I need help - I can help - Image Legibility

Print Page | Close Window

Image Legibility

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=878
Printed Date: 26 Dec 25 at 3:30PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Image Legibility

Posted By: bb46970
Subject: Image Legibility
Date Posted: 14 Mar 08 at 5:32PM

I have a client who scans text documents. They do not perform OCR. They just save the text as images, in the PDFs. Sometimes the people scanning do poor jobs, resulting in some pages that are really dark or black. I am looking for a way to programmatically check the pages and see if any of the pages are "suspect."

Replies:

Posted By: peteratoce
Date Posted: 18 Mar 08 at 8:49AM

First of all, your problem is more in the field of image processing than in PDF handling. Further, it is almost impossible to generate a good image from a really bad scan.

That said, you can have a look at e.g. PixEdit, which allows you to load PDFs and offers a COM interface to do all kinds of operations on your (hopefully monochrome?) images.

If you want to identify images that are too dark you probably would have to look at a region in the margin that should be white = without pixels.

There is unfortunately no API function in PixEdit that returns the number of black pixels in a given area, but you can excerpt an area to file and then perhaps turn to ImageMagick for the counting of black pixels.

On second thought, simply save to format "Uncompressed, No header", read in the bytes and count the number of "1"-bits in each byte yourself.

Peter

Posted By: bb46970
Date Posted: 18 Mar 08 at 11:37AM

Thanks for the reply. I came-up with an option that may work. I believe that they do scan the documents as 1-bit. However, I went with a grey scale option. I use scanline, to examine each pixel. I add the red, green, and blue values, for each pixel. If it falls below 50%, I assume that it is "dark." If necessary, I can adjust the 50%. I keep a tally of all of the dark pixels. Then I set a threshold for the page. For example, if 45% of the page is dark pixels, I flag it as "suspect." I do that for each page in the document. My only concern is finding suspect pages, for a human to examine, and determine if the document needs to be rescanned.

If anyone has a better option - particularly faster - I am open to it. Some of these documents are hundreds of pages long, and I may have to process hundreds of documents at a time.