Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - delete blank pdf-pages
  FAQ FAQ  Forum Search   Register Register  Login Login

delete blank pdf-pages

 Post Reply Post Reply
Author
Message
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Topic: delete blank pdf-pages
    Posted: 04 Nov 19 at 8:22AM
hey everybody,

I have a program for documents-management, that shows me my scanned PDF. If a blank page was scanned, the programm should delete this page automatically and should show only pages with text. Unfortunately, I have no idea where to start. I imagine it like that, for example:

if ... there is no text then ..
delete the page
end if

At first, with GetPageText I would get the information from the page. (Https://www.debenu.com/docs/pdf_library_reference/GetPageText.php). With DeletePages I delete the page. (Https://www.debenu.com/docs/pdf_library_reference/DeletePages.php)

I know it's hard to help if I can not show any code snippets yet. Maybe there is someone who can give me a hint, otherwise I will hopefully come back later with more information.

Greetz,
hjerteblod
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 255
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 19 at 9:51AM
It is much easier to detect blank pages in an image than from analysing all the objects in a PDF, some of which could be simply painting white space on the page. First, the software for one of the scanners I use (Canon) has an option to detect and ignore blank page images before writing them to the PDF. And a scanner will not give you anything from GetPageText if it does not support OCR, so you will have to look at the image anyway in these cases.

A simple solution is to take the page image from the scanner as a bitmap (or render the page to a bitmap if the PDF is not scanner output), and walk the bitmap scan line by scan line until you find a non-white pixel.  Yes, this can be slow and memory-hungry, but if you can afford these resources it is relatively simple for you to implement, and quite quick to skip deletion for non-blank pages.
Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 19 at 10:10AM
Thank you for the quick reply. However, we already use this solution for jpegs. Now a solution for PDFs has to be created. Could I do it on the recognition of text? If text was found that is made searchable then delete nothing?
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 255
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 19 at 6:35PM
If you have a scanner which does OCR while it scans, then of course you can look at the page text and see if it is all empty. But this does not tell you whether there are box outlines and rules, or pictures in which the OCR found no text. So you need to iterate and check all images, as well as checking the text. Only you are aware of what may turn up in the documents you are working with, so none of the users here can suggest precisely what has to be absent to to guarantee that the page is completely blank. Some experimentation is probably called for!
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3029
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 19 at 7:03PM
Hi,

textextraction won't work if you have a scanned page - then normally itwill be an image.
Checking the scan lines (like already adviced) can be a solution.
You should scale it up to very less resolution - than the performance will be good enough.
I'm doing similar things to check if a pdf-page has coloured content...

Cheers and welcome here,
Ingo



Edited by Ingo - 04 Nov 19 at 7:04PM
Cheers,
Ingo

Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 7:21AM
The documents that are scanned contain text and logos of companies. So I have to exclude these. The scan with the lines, we have already installed elsewhere. I should now explicitly program this for PDF. So I understood you correctly, that I first have to check if there is text and then if there are pictures?

Thanks for the hints and the welcome.

greetings
hjerteblod
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3029
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 7:57AM
This should be the first functions to have an eye on regarding your issue:
HasFontResources, ImageCount, FindImages.
To extract images you'll find the functionality in this group:
https://www.debenu.com/docs/pdf_library_reference/ImageHandling.php
The functions beginning with GetImage... will feed your needs.
The extraction of text you know already...

Cheers,
Ingo

Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 8:03AM
awesome, thank you
I started with

     If OpenFileDialog1.ShowDialog = DialogResult.OK Then
            result = openFileDialog1.FileName
            QP.LoadFromFile(result, "")
            QP.GetPageText(0)
            MsgBox(QP.GetPageText(0))
        End If

this works, and now I will go on with your hints, thanks!!
Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 9:40AM
So far, everything works, as I wish. However, it does not delete pages when I write the code for it. And there is no error message.


        Dim openFileDialog1 As New OpenFileDialog()
        Dim result As String

        If openFileDialog1.ShowDialog = DialogResult.OK Then
            result = openFileDialog1.FileName

            QP.LoadFromFile(result, "")
            QP.HasFontResources()
            QP.GetPageImageList(0)
            QP.FindImages())
            QP.ImageCount())         
            QP.DeletePages(2, 1)

        End If


Can someone tell me if I miss something? I am very grateful for any hints
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 255
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 12:55PM
I assume that what is shown above is just an outline of your real code, omitting all the checks you are doing and error handling. Does the list above also omit a QP.SaveToFile which is in your real code, or is that omission the cause of the pages appearing not to be deleted?
Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 19 at 1:06PM
yes you are right, I missed the savetofile Confused
Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 19 at 7:58AM
hey guys,
so far almost everything works as wished. But one problem is left. I get the number of images (QP.ImageCount()) and the text (QP.GetPageText(0) to filter the blank pages. Now I have the problem that blank pages are not recognized as blank, instead 2 pictures are recognized.

If my text is not searchable (non-OCR), it will be recognized as an image. Therefore, I can not distinguish whether it is currently a blank page on which non-existing images are detected, or if it is not searchable text that should not be deleted.

I find this problem unsolvable, but I still wanted to try to present it here.

Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 255
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 19 at 12:07PM
Have you looked at GetPageImageList, and then if there is one, GetImageListCount? And then get the image ID using GetImageListItemProperty? I have not used these functions myself, but they should allow you to work out which page(s) the images you counted with ImageCount are on. The internal structure of a PDF is not designed to be easy to walk around in!
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3029
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 19 at 7:21PM
Hi,

Tim pushed you in the right direction.
Here's a part from a loop to show you how to extract embedded images from a pdf-document page by page (in Pascal/Delphi):

      for i2 := 1 to pagecount do //
      begin
         iref := QP.SelectPage(i2);
         ilst := QP.GetPageImageList(iref);
         icnt := QP.GetImageListCount(ilst);

         for i3 := 1 to icnt do
         begin
            lv_imtype := QP.GetImageListItemIntProperty(ilst, i3, 400);
            if ( lv_imtype > 4 ) then
               lv_type := '.jpg';
            if ( lv_imtype = 1 ) then
               lv_type := '.jpg';
            if ( lv_imtype = 2 ) then
               lv_type := '.bmp';
            if ( lv_imtype = 3 ) then
               lv_type := '.tif';
            if ( lv_imtype = 4 ) then
               lv_type := '.png';

            lv_width  := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 401));
            lv_height := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 402));
            lv_bpp    := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 403)); // BitsPerPixel
            lv_cst    := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 404)); // ColorSpaceType
            lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 406)); // Constant Image ID
            if ( Trim(lv_imguid) = '' ) or
               ( Trim(lv_imguid) = '0') then
               lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 405)); // Image ID

            lv_imagefile := target + ExtractFileName(System.Copy(ws,1,Length(ws)-4)) + '-' + IntToStr(i2) + '-' + IntToStr(i3) + '-' + lv_type;
            QP.SaveImageListItemDataToFile(ilst, i3, 0, lv_imagefile);

Cheers,
Ingo

Back to Top
hjerteblod View Drop Down
Beginner
Beginner


Joined: 06 Sep 19
Status: Offline
Points: 9
Post Options Post Options   Thanks (0) Thanks(0)   Quote hjerteblod Quote  Post ReplyReply Direct Link To This Post Posted: 07 Nov 19 at 6:41AM
thank you guys, this helps a lot!!
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store