Determine if a PDF file contains only images

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Is there an easy way to determine if a given PDF file contains only images and no other "visible" objects? For example, scanned documents are often stored as PDF files, where the visible portion of the file is CCITT images. If this is the case, and nothing else has been added (like an annotations, text, or other objects, then I can simple extract the images directly, rather than "print" the PDF to a capture printer. But if anything has been added to the PDF (other than meta data that isn't visible), then I must "print" the PDF to a capture printer in order to get an accurate representation of the PDF contents.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Eric24 Members Profile Find Members Posts Team Player Joined: 28 Jun 09 Location: Dallas, TX Status: Offline Points: 29	Post Options Post Reply Quote Eric24 Report Post Thanks(0) Quote Reply Topic: Determine if a PDF file contains only images Posted: 24 Dec 10 at 8:48PM
	Is there an easy way to determine if a given PDF file contains only images and no other "visible" objects? For example, scanned documents are often stored as PDF files, where the visible portion of the file is CCITT images. If this is the case, and nothing else has been added (like an annotations, text, or other objects, then I can simple extract the images directly, rather than "print" the PDF to a capture printer. But if anything has been added to the PDF (other than meta data that isn't visible), then I must "print" the PDF to a capture printer in order to get an accurate representation of the PDF contents.

Shotgun Tom Members Profile Find Members Posts Senior Member Joined: 14 Aug 09 Location: Phoenix, AZ Status: Offline Points: 53	Post Options Post Reply Quote Shotgun Tom Report Post Thanks(0) Quote Reply Posted: 25 Dec 10 at 2:11PM
	Hi Eric and Merry Christmas! One way would be to look for fonts. Use the the "HasFontResources" function. If the return equals false then it probably is a rasterized PDF. For the reverse (text only PDF), you can use the "FindImages" method. Tom

Eric24 Members Profile Find Members Posts Team Player Joined: 28 Jun 09 Location: Dallas, TX Status: Offline Points: 29	Post Options Post Reply Quote Eric24 Report Post Thanks(0) Quote Reply Posted: 25 Dec 10 at 6:00PM
	Merry Christmas to you, too! I had thought of that, but it doesn't cover the possibility that someone added "markup" in the form of lines, shapes, or other non-text objects. Is there a similar call that would "reveal" such objects? If so, is there anything else that's "visible" other than images, fonts, and "other non-text objects"?

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 25 Dec 10 at 6:35PM
	Hi Eric! You can use the textextraction from each page. If there's no result then there are only images. Another thing to determine if there are scanned pages without ocr-run: Check if there are as many imageobjects as there are pages. Cheers, Ingo

Eric24 Members Profile Find Members Posts Team Player Joined: 28 Jun 09 Location: Dallas, TX Status: Offline Points: 29	Post Options Post Reply Quote Eric24 Report Post Thanks(0) Quote Reply Posted: 03 Jan 11 at 2:59AM
	Thanks! That sounds like a very solid approach.

Dimitry Members Profile Find Members Posts Team Player Joined: 18 Feb 10 Status: Offline Points: 37	Post Options Post Reply Quote Dimitry Report Post Thanks(0) Quote Reply Posted: 13 Jan 11 at 10:40AM
	Beside font resources and extractable text, PDF file can contain vector graphics, tables or Acroforms that can be visually rendered on the page. To make sure that there are no visual elements on the page beside images we just need to remove all images from the page. After this page should become visually empty (blank). Here is PageContainsImages() function that basically answers are there any images on PDF page. Your opinion and testing results are welcome. procedure ClonePageDimensions(QPL: TQuickPDF; SourcePage, TargetPage: Integer); type TPageBox = record Left: Double; Top: Double; Width: Double; Height: Double; end; var i: Integer; width, height: Double; rotation: Integer; boxes: array [1..5] of TPageBox; begin with QPL do begin // Reading dimensions from Source Page SelectPage(SourcePage); width := PageWidth; height := PageHeight; rotation := PageRotation; for i := 1 to 5 do begin boxes.Left := GetPageBox(i, 0); boxes.Top := GetPageBox(i, 1); boxes.Width := GetPageBox(i, 2); boxes.Height := GetPageBox(i, 3); end; // Saving dimensions to Target Page SelectPage(TargetPage); SetPageDimensions(width, height); RotatePage(rotation); for i := 1 to 5 do SetPageBox(i, boxes.Left, boxes.Top, boxes.Width, boxes.Height); end; end; function PageContainsImages(QPL: TQuickPDF; Page: Integer; DPI: Integer): Boolean; var i: Integer; doc, doc_tmp: Integer; s, s_tmp: AnsiString; begin Result := False; with QPL do begin try // custom Document is selected if FindImages = 0 then Exit; doc := SelectedDocument; doc_tmp := NewDocument; // temporary Document is selected CopyPageRanges(doc, IntToStr(Page)); // Page 2 contains customer's page copy atm SelectPage(2); // clear all image content on Page 2 for i := 1 to FindImages do ClearImage(GetImageID(i)); // Page 1 is empty and its dimensions should be equal to Page 2 ClonePageDimensions(QPL, 2, 1); s := RenderPageToString(DPI, 2, 0); s_tmp := RenderPageToString(DPI, 1, 0); // Compare Page 1 and Page 2 by size and content if Length(s) <> Length(s_tmp) then Exit; Result := True; for i := 1 to Length(s) do if s <> s_tmp then begin Result := False; Exit; end; finally RemoveDocument(doc_tmp); end; end; end; Edited by Dimitry - 13 Jan 11 at 10:44AM
	Regards, Dmitry