I need help - I can help - Determine if a PDF file contains only images

Determine if a PDF file contains only images

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1694
Printed Date: 09 Nov 25 at 7:01AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Determine if a PDF file contains only images

Posted By: Eric24
Subject: Determine if a PDF file contains only images
Date Posted: 24 Dec 10 at 8:48PM

Is there an easy way to determine if a given PDF file contains only images and no other "visible" objects? For example, scanned documents are often stored as PDF files, where the visible portion of the file is CCITT images. If this is the case, and nothing else has been added (like an annotations, text, or other objects, then I can simple extract the images directly, rather than "print" the PDF to a capture printer. But if anything has been added to the PDF (other than meta data that isn't visible), then I must "print" the PDF to a capture printer in order to get an accurate representation of the PDF contents.

Replies:

Posted By: Shotgun Tom
Date Posted: 25 Dec 10 at 2:11PM

Hi Eric and Merry Christmas!

One way would be to look for fonts. Use the the "HasFontResources" function. If the return equals false then it probably is a rasterized PDF.

For the reverse (text only PDF), you can use the "FindImages" method.

Tom

Posted By: Eric24
Date Posted: 25 Dec 10 at 6:00PM

Merry Christmas to you, too!

I had thought of that, but it doesn't cover the possibility that someone added "markup" in the form of lines, shapes, or other non-text objects. Is there a similar call that would "reveal" such objects? If so, is there anything else that's "visible" other than images, fonts, and "other non-text objects"?

Posted By: Ingo
Date Posted: 25 Dec 10 at 6:35PM

Hi Eric!

You can use the textextraction from each page.

If there's no result then there are only images.

Another thing to determine if there are scanned

pages without ocr-run: Check if there are as many

imageobjects as there are pages.

Cheers, Ingo

Posted By: Eric24
Date Posted: 03 Jan 11 at 2:59AM

Thanks! That sounds like a very solid approach.

Posted By: Dimitry
Date Posted: 13 Jan 11 at 10:40AM

Beside font resources and extractable text, PDF file can contain vector graphics, tables or Acroforms that can be visually rendered on the page.

To make sure that there are no visual elements on the page beside images we just need to remove all images from the page.

After this page should become visually empty (blank).

Here is PageContainsImages() function that basically answers are there any images on PDF page.

Your opinion and testing results are welcome.

procedure ClonePageDimensions(QPL: TQuickPDF;
    SourcePage, TargetPage: Integer);
type
    TPageBox = record
      Left: Double;
      Top: Double;
      Width: Double;
      Height: Double;
    end;
var
    i: Integer;
    width, height: Double;
    rotation: Integer;
    boxes: array [1..5] of TPageBox;
begin
    with QPL do
    begin
      // Reading dimensions from Source Page
      SelectPage(SourcePage);
      width := PageWidth;
      height := PageHeight;
      rotation := PageRotation;
      for i := 1 to 5 do
      begin
        boxes.Left := GetPageBox(i, 0);
        boxes.Top := GetPageBox(i, 1);
        boxes.Width := GetPageBox(i, 2);
        boxes.Height := GetPageBox(i, 3);
      end;
      // Saving dimensions to Target Page
      SelectPage(TargetPage);
      SetPageDimensions(width, height);
      RotatePage(rotation);
      for i := 1 to 5 do
        SetPageBox(i, boxes.Left, boxes.Top, boxes.Width, boxes.Height);
    end;
end;

function PageContainsImages(QPL: TQuickPDF;
    Page: Integer; DPI: Integer): Boolean;
var
    i: Integer;
    doc, doc_tmp: Integer;
    s, s_tmp: AnsiString;
begin
    Result := False;
    with QPL do
    begin
      try
        // custom Document is selected
        if FindImages = 0 then
          Exit;
        doc := SelectedDocument;
        doc_tmp := NewDocument;
        // temporary Document is selected
        CopyPageRanges(doc, IntToStr(Page));
        // Page 2 contains customer's page copy atm
        SelectPage(2);
        // clear all image content on Page 2
        for i := 1 to FindImages do
          ClearImage(GetImageID(i));
        // Page 1 is empty and its dimensions should be equal to Page 2
        ClonePageDimensions(QPL, 2, 1);
        s := RenderPageToString(DPI, 2, 0);
        s_tmp := RenderPageToString(DPI, 1, 0);
        // Compare Page 1 and Page 2 by size and content
        if Length(s) <> Length(s_tmp) then
          Exit;
        Result := True;
        for i := 1 to Length(s) do
        if s <> s_tmp then
        begin
          Result := False;
          Exit;
        end;
      finally
        RemoveDocument(doc_tmp);
      end;
    end;
end;

-------------
Regards,
Dmitry