delete blank pdf-pages

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   hey everybody,

I have a program for documents-management, that shows me my scanned PDF. If a blank page was scanned, the programm should delete this page automatically and should show only pages with text. Unfortunately, I have no idea where to start. I imagine it like that, for example:

if ... there is no text then ..
delete the page
end if

At first, with GetPageText I would get the information from the page. (Https://www.debenu.com/docs/pdf_library_reference/GetPageText.php). With DeletePages I delete the page. (Https://www.debenu.com/docs/pdf_library_reference/DeletePages.php)

I know it's hard to help if I can not show any code snippets yet. Maybe there is someone who can give me a hint, otherwise I will hopefully come back later with more information.

Greetz,
hjerteblod

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Topic: delete blank pdf-pages Posted: 04 Nov 19 at 8:22AM
	hey everybody, I have a program for documents-management, that shows me my scanned PDF. If a blank page was scanned, the programm should delete this page automatically and should show only pages with text. Unfortunately, I have no idea where to start. I imagine it like that, for example: if ... there is no text then .. delete the page end if At first, with GetPageText I would get the information from the page. (Https://www.debenu.com/docs/pdf_library_reference/GetPageText.php). With DeletePages I delete the page. (Https://www.debenu.com/docs/pdf_library_reference/DeletePages.php) I know it's hard to help if I can not show any code snippets yet. Maybe there is someone who can give me a hint, otherwise I will hopefully come back later with more information. Greetz, hjerteblod

tfrost Members Profile Find Members Posts Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437	Post Options Post Reply Quote tfrost Report Post Thanks(0) Quote Reply Posted: 04 Nov 19 at 9:51AM
	It is much easier to detect blank pages in an image than from analysing all the objects in a PDF, some of which could be simply painting white space on the page. First, the software for one of the scanners I use (Canon) has an option to detect and ignore blank page images before writing them to the PDF. And a scanner will not give you anything from GetPageText if it does not support OCR, so you will have to look at the image anyway in these cases. A simple solution is to take the page image from the scanner as a bitmap (or render the page to a bitmap if the PDF is not scanner output), and walk the bitmap scan line by scan line until you find a non-white pixel. Yes, this can be slow and memory-hungry, but if you can afford these resources it is relatively simple for you to implement, and quite quick to skip deletion for non-blank pages.

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 04 Nov 19 at 10:10AM
	Thank you for the quick reply. However, we already use this solution for jpegs. Now a solution for PDFs has to be created. Could I do it on the recognition of text? If text was found that is made searchable then delete nothing?

tfrost Members Profile Find Members Posts Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437	Post Options Post Reply Quote tfrost Report Post Thanks(0) Quote Reply Posted: 04 Nov 19 at 6:35PM
	If you have a scanner which does OCR while it scans, then of course you can look at the page text and see if it is all empty. But this does not tell you whether there are box outlines and rules, or pictures in which the OCR found no text. So you need to iterate and check all images, as well as checking the text. Only you are aware of what may turn up in the documents you are working with, so none of the users here can suggest precisely what has to be absent to to guarantee that the page is completely blank. Some experimentation is probably called for!

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 04 Nov 19 at 7:03PM
	Hi, textextraction won't work if you have a scanned page - then normally itwill be an image. Checking the scan lines (like already adviced) can be a solution. You should scale it up to very less resolution - than the performance will be good enough. I'm doing similar things to check if a pdf-page has coloured content... Cheers and welcome here, Ingo Edited by Ingo - 04 Nov 19 at 7:04PM
	Cheers, Ingo

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 7:21AM
	The documents that are scanned contain text and logos of companies. So I have to exclude these. The scan with the lines, we have already installed elsewhere. I should now explicitly program this for PDF. So I understood you correctly, that I first have to check if there is text and then if there are pictures? Thanks for the hints and the welcome. greetings hjerteblod

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 7:57AM
	This should be the first functions to have an eye on regarding your issue: HasFontResources, ImageCount, FindImages. To extract images you'll find the functionality in this group: https://www.debenu.com/docs/pdf_library_reference/ImageHandling.php The functions beginning with GetImage... will feed your needs. The extraction of text you know already...
	Cheers, Ingo

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 8:03AM
	awesome, thank you I started with If OpenFileDialog1.ShowDialog = DialogResult.OK Then result = openFileDialog1.FileName QP.LoadFromFile(result, "") QP.GetPageText(0) MsgBox(QP.GetPageText(0)) End If this works, and now I will go on with your hints, thanks!!

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 9:40AM
	So far, everything works, as I wish. However, it does not delete pages when I write the code for it. And there is no error message. Dim openFileDialog1 As New OpenFileDialog() Dim result As String If openFileDialog1.ShowDialog = DialogResult.OK Then result = openFileDialog1.FileName QP.LoadFromFile(result, "") QP.HasFontResources() QP.GetPageImageList(0) QP.FindImages()) QP.ImageCount()) QP.DeletePages(2, 1) End If Can someone tell me if I miss something? I am very grateful for any hints

tfrost Members Profile Find Members Posts Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437	Post Options Post Reply Quote tfrost Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 12:55PM
	I assume that what is shown above is just an outline of your real code, omitting all the checks you are doing and error handling. Does the list above also omit a QP.SaveToFile which is in your real code, or is that omission the cause of the pages appearing not to be deleted?

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 05 Nov 19 at 1:06PM
	yes you are right, I missed the savetofile

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 06 Nov 19 at 7:58AM
	hey guys, so far almost everything works as wished. But one problem is left. I get the number of images (QP.ImageCount()) and the text (QP.GetPageText(0) to filter the blank pages. Now I have the problem that blank pages are not recognized as blank, instead 2 pictures are recognized. If my text is not searchable (non-OCR), it will be recognized as an image. Therefore, I can not distinguish whether it is currently a blank page on which non-existing images are detected, or if it is not searchable text that should not be deleted. I find this problem unsolvable, but I still wanted to try to present it here.

tfrost Members Profile Find Members Posts Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437	Post Options Post Reply Quote tfrost Report Post Thanks(0) Quote Reply Posted: 06 Nov 19 at 12:07PM
	Have you looked at GetPageImageList, and then if there is one, GetImageListCount? And then get the image ID using GetImageListItemProperty? I have not used these functions myself, but they should allow you to work out which page(s) the images you counted with ImageCount are on. The internal structure of a PDF is not designed to be easy to walk around in!

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 06 Nov 19 at 7:21PM
	Hi, Tim pushed you in the right direction. Here's a part from a loop to show you how to extract embedded images from a pdf-document page by page (in Pascal/Delphi): for i2 := 1 to pagecount do // begin iref := QP.SelectPage(i2); ilst := QP.GetPageImageList(iref); icnt := QP.GetImageListCount(ilst); for i3 := 1 to icnt do begin lv_imtype := QP.GetImageListItemIntProperty(ilst, i3, 400); if ( lv_imtype > 4 ) then lv_type := '.jpg'; if ( lv_imtype = 1 ) then lv_type := '.jpg'; if ( lv_imtype = 2 ) then lv_type := '.bmp'; if ( lv_imtype = 3 ) then lv_type := '.tif'; if ( lv_imtype = 4 ) then lv_type := '.png'; lv_width := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 401)); lv_height := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 402)); lv_bpp := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 403)); // BitsPerPixel lv_cst := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 404)); // ColorSpaceType lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 406)); // Constant Image ID if ( Trim(lv_imguid) = '' ) or ( Trim(lv_imguid) = '0') then lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 405)); // Image ID lv_imagefile := target + ExtractFileName(System.Copy(ws,1,Length(ws)-4)) + '-' + IntToStr(i2) + '-' + IntToStr(i3) + '-' + lv_type; QP.SaveImageListItemDataToFile(ilst, i3, 0, lv_imagefile);
	Cheers, Ingo

hjerteblod Members Profile Find Members Posts Beginner Joined: 06 Sep 19 Status: Offline Points: 9	Post Options Post Reply Quote hjerteblod Report Post Thanks(0) Quote Reply Posted: 07 Nov 19 at 6:41AM
	thank you guys, this helps a lot!!