Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
![]() |
delete blank pdf-pages |
Post Reply ![]() |
Author | |
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() Posted: 04 Nov 19 at 8:22AM |
hey everybody, I have a program for documents-management, that shows me my scanned PDF. If a blank page was scanned, the programm should delete this page automatically and should show only pages with text. Unfortunately, I have no idea where to start. I imagine it like that, for example: if ... there is no text then .. delete the page end if At first, with GetPageText I would get the information from the page. (Https://www.debenu.com/docs/pdf_library_reference/GetPageText.php). With DeletePages I delete the page. (Https://www.debenu.com/docs/pdf_library_reference/DeletePages.php) I know it's hard to help if I can not show any code snippets yet. Maybe there is someone who can give me a hint, otherwise I will hopefully come back later with more information. Greetz, hjerteblod
|
|
![]() |
|
tfrost ![]() Senior Member ![]() Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
![]() ![]() ![]() ![]() ![]() |
It is much easier to detect blank pages in an image than from analysing all the objects in a PDF, some of which could be simply painting white space on the page. First, the software for one of the scanners I use (Canon) has an option to detect and ignore blank page images before writing them to the PDF. And a scanner will not give you anything from GetPageText if it does not support OCR, so you will have to look at the image anyway in these cases. A simple solution is to take the page image from the scanner as a bitmap (or render the page to a bitmap if the PDF is not scanner output), and walk the bitmap scan line by scan line until you find a non-white pixel. Yes, this can be slow and memory-hungry, but if you can afford these resources it is relatively simple for you to implement, and quite quick to skip deletion for non-blank pages.
|
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
Thank you for the quick reply. However, we already use this solution for jpegs. Now a solution for PDFs has to be created. Could I do it on the recognition of text? If text was found that is made searchable then delete nothing?
|
|
![]() |
|
tfrost ![]() Senior Member ![]() Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
![]() ![]() ![]() ![]() ![]() |
If you have a scanner which does OCR while it scans, then of course you can look at the page text and see if it is all empty. But this does not tell you whether there are box outlines and rules, or pictures in which the OCR found no text. So you need to iterate and check all images, as well as checking the text. Only you are aware of what may turn up in the documents you are working with, so none of the users here can suggest precisely what has to be absent to to guarantee that the page is completely blank. Some experimentation is probably called for!
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi,
textextraction won't work if you have a scanned page - then normally itwill be an image. Checking the scan lines (like already adviced) can be a solution. You should scale it up to very less resolution - than the performance will be good enough. I'm doing similar things to check if a pdf-page has coloured content... Cheers and welcome here, Ingo Edited by Ingo - 04 Nov 19 at 7:04PM |
|
Cheers,
Ingo |
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
The documents that are scanned contain text and logos of companies. So I have to exclude these. The scan with the lines, we have already installed elsewhere. I should now explicitly program this for PDF. So I understood you correctly, that I first have to check if there is text and then if there are pictures?
Thanks for the hints and the welcome. greetings hjerteblod |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
This should be the first functions to have an eye on regarding your issue:
HasFontResources, ImageCount, FindImages. To extract images you'll find the functionality in this group: https://www.debenu.com/docs/pdf_library_reference/ImageHandling.php The functions beginning with GetImage... will feed your needs. The extraction of text you know already... |
|
Cheers,
Ingo |
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
awesome, thank you I started with If OpenFileDialog1.ShowDialog = DialogResult.OK Then result = openFileDialog1.FileName QP.LoadFromFile(result, "") QP.GetPageText(0) MsgBox(QP.GetPageText(0)) End If this works, and now I will go on with your hints, thanks!!
|
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
So far, everything works, as I wish. However, it does not delete pages when I write the code for it. And there is no error message. Dim openFileDialog1 As New OpenFileDialog() Dim result As String If openFileDialog1.ShowDialog = DialogResult.OK Then result = openFileDialog1.FileName QP.LoadFromFile(result, "") QP.HasFontResources() QP.GetPageImageList(0) QP.FindImages()) QP.ImageCount()) QP.DeletePages(2, 1) End If Can someone tell me if I miss something? I am very grateful for any hints
|
|
![]() |
|
tfrost ![]() Senior Member ![]() Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
![]() ![]() ![]() ![]() ![]() |
I assume that what is shown above is just an outline of your real code, omitting all the checks you are doing and error handling. Does the list above also omit a QP.SaveToFile which is in your real code, or is that omission the cause of the pages appearing not to be deleted?
|
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
yes you are right, I missed the savetofile
![]() |
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
hey guys, so far almost everything works as wished. But one problem is left. I get the number of images (QP.ImageCount()) and the text (QP.GetPageText(0) to filter the blank pages. Now I have the problem that blank pages are not recognized as blank, instead 2 pictures are recognized. If my text is not searchable (non-OCR), it will be recognized as an image. Therefore, I can not distinguish whether it is currently a blank page on which non-existing images are detected, or if it is not searchable text that should not be deleted. I find this problem unsolvable, but I still wanted to try to present it here. |
|
![]() |
|
tfrost ![]() Senior Member ![]() Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
![]() ![]() ![]() ![]() ![]() |
Have you looked at GetPageImageList, and then if there is one, GetImageListCount? And then get the image ID using GetImageListItemProperty? I have not used these functions myself, but they should allow you to work out which page(s) the images you counted with ImageCount are on. The internal structure of a PDF is not designed to be easy to walk around in!
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi,
Tim pushed you in the right direction. Here's a part from a loop to show you how to extract embedded images from a pdf-document page by page (in Pascal/Delphi): for i2 := 1 to pagecount do // begin iref := QP.SelectPage(i2); ilst := QP.GetPageImageList(iref); icnt := QP.GetImageListCount(ilst); for i3 := 1 to icnt do begin lv_imtype := QP.GetImageListItemIntProperty(ilst, i3, 400); if ( lv_imtype > 4 ) then lv_type := '.jpg'; if ( lv_imtype = 1 ) then lv_type := '.jpg'; if ( lv_imtype = 2 ) then lv_type := '.bmp'; if ( lv_imtype = 3 ) then lv_type := '.tif'; if ( lv_imtype = 4 ) then lv_type := '.png'; lv_width := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 401)); lv_height := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 402)); lv_bpp := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 403)); // BitsPerPixel lv_cst := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 404)); // ColorSpaceType lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 406)); // Constant Image ID if ( Trim(lv_imguid) = '' ) or ( Trim(lv_imguid) = '0') then lv_imguid := IntToStr(QP.GetImageListItemIntProperty(ilst, i3, 405)); // Image ID lv_imagefile := target + ExtractFileName(System.Copy(ws,1,Length(ws)-4)) + '-' + IntToStr(i2) + '-' + IntToStr(i3) + '-' + lv_type; QP.SaveImageListItemDataToFile(ilst, i3, 0, lv_imagefile); |
|
Cheers,
Ingo |
|
![]() |
|
hjerteblod ![]() Beginner ![]() Joined: 06 Sep 19 Status: Offline Points: 9 |
![]() ![]() ![]() ![]() ![]() |
thank you guys, this helps a lot!!
|
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store