Determine page type?

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Greetings,

I'm working on a PDF redaction product that will be used by our clients.  This is a secure redaction process and the only way I know to do this is to render the PDF page to image, draw on the image, then save it back to PDF.  This is working fine, but it's incredibly slow.  I've got a few questions I'm hoping someone here can answer.

I've tried working differently depending on the document type.  I try to determine the make up of the document by finding images and if images = number of pages I assume these are scanned pages and extract the Tiff.  Otherwise, I render document to file then convert the rendered pages to tiff.  My redaction software creates thumbnails and the preview image to draw on from tiff files (I'm using Imageman).

I'm using an old version of QuickPDF (4.49) and am wondering if some of the bugs and slowness have been improved in the latest version.

FindImages seems to extract jpegs into the PDF's directory, this is quite annoying, anyone know anything about that?

FindImages is really slow, is there a better/faster way to determine page type (text/image or scanned image)?

Lastly, has PDF rendering been improved?  I've found the only acceptable text -> image rendering is png at 150dpi.  Anything less than that and there's too much detail lost in the text.

Here is the code I'm using (yes, it's vb5):
    'QP is the QuickPDF ActiveX component

    If QP.LoadFromFile(sPDFFileName) = 1 Then    'loads document
        lTotal = QP.PageCount

        notTiff = True
        rendered = False
        sPDFShortName = Mid$(sPDFFileName, 1, Len(sPDFFileName) - 4)
        If lTotal > 0 Then
            images = QP.FindImages
             If images = lTotal Then
                notTiff = False
                For page = 1 To images
                    imgID = QP.ImageID(page)
                    QP.SelectImage (imgID)
                    If QP.ImageType = 3 And QP.ImageWidth > 400 And QP.ImageHeight > 600 Then
                        QP.SaveImageToFile sPDFShortName & page & ".tif"
                    Else
                        notTiff = True
                    End If
                Next page
            End If
            If notTiff Then
                QP.RenderDocumentToFile 150, 1, lTotal, 5, sPDFShortName & ".png"
                For iC = 1 To lTotal
                    frmUploadDocs.imPreview.Picture = sPDFShortName & iC & ".png"
                    sTempFile = sPDFShortName & iC & ".tif"
                    If FileExist(sTempFile) Then Kill sTempFile
                    frmUploadDocs.imPreview.SaveAs sPDFShortName & iC & ".tif"
                    Kill sPDFShortName & iC & ".png"
                Next iC
                rendered = True
            End If
        End If
        ConvertPDFtoTIF = lTotal ' success
    End If

Thanks for any help

Sam

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Sam Members Profile Find Members Posts Beginner Joined: 21 Aug 09 Location: MN Status: Offline Points: 2	Post Options Post Reply Quote Sam Report Post Thanks(0) Quote Reply Topic: Determine page type? Posted: 28 Aug 09 at 8:43PM
	Greetings, I'm working on a PDF redaction product that will be used by our clients. This is a secure redaction process and the only way I know to do this is to render the PDF page to image, draw on the image, then save it back to PDF. This is working fine, but it's incredibly slow. I've got a few questions I'm hoping someone here can answer. I've tried working differently depending on the document type. I try to determine the make up of the document by finding images and if images = number of pages I assume these are scanned pages and extract the Tiff. Otherwise, I render document to file then convert the rendered pages to tiff. My redaction software creates thumbnails and the preview image to draw on from tiff files (I'm using Imageman). I'm using an old version of QuickPDF (4.49) and am wondering if some of the bugs and slowness have been improved in the latest version. FindImages seems to extract jpegs into the PDF's directory, this is quite annoying, anyone know anything about that? FindImages is really slow, is there a better/faster way to determine page type (text/image or scanned image)? Lastly, has PDF rendering been improved? I've found the only acceptable text -> image rendering is png at 150dpi. Anything less than that and there's too much detail lost in the text. Here is the code I'm using (yes, it's vb5): 'QP is the QuickPDF ActiveX component If QP.LoadFromFile(sPDFFileName) = 1 Then 'loads document lTotal = QP.PageCount notTiff = True rendered = False sPDFShortName = Mid$(sPDFFileName, 1, Len(sPDFFileName) - 4) If lTotal > 0 Then images = QP.FindImages If images = lTotal Then notTiff = False For page = 1 To images imgID = QP.ImageID(page) QP.SelectImage (imgID) If QP.ImageType = 3 And QP.ImageWidth > 400 And QP.ImageHeight > 600 Then QP.SaveImageToFile sPDFShortName & page & ".tif" Else notTiff = True End If Next page End If If notTiff Then QP.RenderDocumentToFile 150, 1, lTotal, 5, sPDFShortName & ".png" For iC = 1 To lTotal frmUploadDocs.imPreview.Picture = sPDFShortName & iC & ".png" sTempFile = sPDFShortName & iC & ".tif" If FileExist(sTempFile) Then Kill sTempFile frmUploadDocs.imPreview.SaveAs sPDFShortName & iC & ".tif" Kill sPDFShortName & iC & ".png" Next iC rendered = True End If End If ConvertPDFtoTIF = lTotal ' success End If Thanks for any help Sam

Michel_K17 Members Profile Find Members Posts Newbie www.exp-systems.com Joined: 25 Jan 03 Status: Offline Points: 297	Post Options Post Reply Quote Michel_K17 Report Post Thanks(0) Quote Reply Posted: 29 Aug 09 at 2:38PM
	Hello Sam, I can answer a few of your questions (but not the one about finding images - sorry). The library has absolutely improved dramatically since v4.49, in two ways. Today, we are at v7.15 which reflects the last of the improvements from iSed, the improvements from the user community, and the work done by Debenu and their programmers. By "improvements", I mean that a large number of bug fixes have been addressed as well as improved compatibility with PDF content. You mentioned rendering in particular, and yes, that portion of the code is far better - with rendering that now matches the rendering to Adobe's Reader in terms of quality. Finally, Debenu is steadily adding new features for which I am very thankful for as it brings the library back in line with the new technology being brought to the PDF format. For example, this includes the ability to digitally sign documents, and so much more. To be sure, it's a never ending task, but Debenu has been really pro-active at regular updates and addressing specific requests from the users when they can. Hopefully, someone else can address your image question. But, there is no doubt that you should upgrade, as a minimum, to the last version that iSed published with the modifications from the users. This would be a free upgrade for you. It's available [here]. There is a list [here] of all the improvements by Debenu since v5.11 that you should take a look at. I believe that the offer to upgrade to the v7.xx series is still available to the users of the old version (you will need to provide proof of ownership). As I recall, they offer a $100 discount. The purchase page is [here]. I hope that helps. Cheers! Michel
	Michel

Shotgun Tom Members Profile Find Members Posts Senior Member Joined: 14 Aug 09 Location: Phoenix, AZ Status: Offline Points: 53	Post Options Post Reply Quote Shotgun Tom Report Post Thanks(0) Quote Reply Posted: 29 Aug 09 at 5:32PM
	A couple of thoughts for you, Sam. 1. HasFontResources is a fairly quick way to determine if the entire document consists of images. From the QuickPDF Manual: Determines if the selected document has font resources. If the document does not it can be assumed to be an image only PDF. 2. I'm not all that familar with Imageman... however there is an ActiveX component called GdPicture Imaging SDK at www.gdpicture.com. At one point it directly supported the ised library. This component has a method that quickly converts a pdf (and pdf/a) to multipage tiff and also multipage tiff to pdf or pdf/a. The package includes a viewer that renders pdf and multipage tiff very quickly. In combination with the latest QuickPDF library you would have a very powerful pdf/tiff toolbox.

Sam Members Profile Find Members Posts Beginner Joined: 21 Aug 09 Location: MN Status: Offline Points: 2	Post Options Post Reply Quote Sam Report Post Thanks(0) Quote Reply Posted: 03 Sep 09 at 5:40PM
	Thanks for the help. I upgraded to 7.15 and it is indeed quite a bit faster. I was also able to reduce DPI which made the resulting image files smaller. Tom, if I have time I may play with hasfontresources. I don't know how that would work though, if a page is made up of multiple images, or if that's even common enough to worry about.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 03 Sep 09 at 6:56PM
	Hi! It's not a must that in an "only-image-pdf" there are no fontresources. I have a sample as really "only-image-pdf" with helvetica. What you can do is to extract the textcontent. If there isn't any textcontent and if there are embedded images (function FindImages) then you can be pretty sure that it's a scanned or image-only-pdf. Cheers, Ingo