Image Legibility

Author

Message

Topic Search

Topic Options

bb46970

Team Player

Joined: 06 Mar 06
Status: Offline
Points: 33

Post Options

Post Reply

Quote bb46970

Report Post

Thanks(0)

Quote

Topic: Image Legibility
Posted: 14 Mar 08 at 5:32PM

I have a client who scans text documents. They do not perform OCR. They just save the text as images, in the PDFs. Sometimes the people scanning do poor jobs, resulting in some pages that are really dark or black. I am looking for a way to programmatically check the pages and see if any of the pages are "suspect."

peteratoce View Drop Down

Members Profile

Find Members Posts

Beginner

Joined: 23 Feb 07
Location: Germany
Status: Offline
Points: 8

Post Options

Post Reply

Quote peteratoce

Report Post

Thanks(0)

Quote

Posted: 18 Mar 08 at 8:49AM

First of all, your problem is more in the field of image processing than in PDF handling. Further, it is almost impossible to generate a good image from a really bad scan.

That said, you can have a look at e.g. PixEdit, which allows you to load PDFs and offers a COM interface to do all kinds of operations on your (hopefully monochrome?) images.

If you want to identify images that are too dark you probably would have to look at a region in the margin that should be white = without pixels.

There is unfortunately no API function in PixEdit that returns the number of black pixels in a given area, but you can excerpt an area to file and then perhaps turn to ImageMagick for the counting of black pixels.

On second thought, simply save to format "Uncompressed, No header", read in the bytes and count the number of "1"-bits in each byte yourself.

Peter

Edited by peteratoce - 18 Mar 08 at 9:10AM

bb46970

Members Profile

Find Members Posts

Team Player

Joined: 06 Mar 06
Status: Offline
Points: 33

Post Options

Post Reply

Quote bb46970

Report Post

Thanks(0)

Quote

Posted: 18 Mar 08 at 11:37AM

Thanks for the reply. I came-up with an option that may work. I believe that they do scan the documents as 1-bit. However, I went with a grey scale option. I use scanline, to examine each pixel. I add the red, green, and blue values, for each pixel. If it falls below 50%, I assume that it is "dark." If necessary, I can adjust the 50%. I keep a tally of all of the dark pixels. Then I set a threshold for the page. For example, if 45% of the page is dark pixels, I flag it as "suspect." I do that for each page in the document. My only concern is finding suspect pages, for a human to examine, and determine if the document needs to be rescanned.

If anyone has a better option - particularly faster - I am open to it. Some of these documents are hundreds of pages long, and I may have to process hundreds of documents at a time.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
bb46970 Members Profile Find Members Posts Team Player Joined: 06 Mar 06 Status: Offline Points: 33	Post Options Post Reply Quote bb46970 Report Post Thanks(0) Quote Reply Topic: Image Legibility Posted: 14 Mar 08 at 5:32PM
	I have a client who scans text documents. They do not perform OCR. They just save the text as images, in the PDFs. Sometimes the people scanning do poor jobs, resulting in some pages that are really dark or black. I am looking for a way to programmatically check the pages and see if any of the pages are "suspect."

peteratoce Members Profile Find Members Posts Beginner Joined: 23 Feb 07 Location: Germany Status: Offline Points: 8	Post Options Post Reply Quote peteratoce Report Post Thanks(0) Quote Reply Posted: 18 Mar 08 at 8:49AM
	First of all, your problem is more in the field of image processing than in PDF handling. Further, it is almost impossible to generate a good image from a really bad scan. That said, you can have a look at e.g. PixEdit, which allows you to load PDFs and offers a COM interface to do all kinds of operations on your (hopefully monochrome?) images. If you want to identify images that are too dark you probably would have to look at a region in the margin that should be white = without pixels. There is unfortunately no API function in PixEdit that returns the number of black pixels in a given area, but you can excerpt an area to file and then perhaps turn to ImageMagick for the counting of black pixels. On second thought, simply save to format "Uncompressed, No header", read in the bytes and count the number of "1"-bits in each byte yourself. Peter Edited by peteratoce - 18 Mar 08 at 9:10AM

bb46970 Members Profile Find Members Posts Team Player Joined: 06 Mar 06 Status: Offline Points: 33	Post Options Post Reply Quote bb46970 Report Post Thanks(0) Quote Reply Posted: 18 Mar 08 at 11:37AM
	Thanks for the reply. I came-up with an option that may work. I believe that they do scan the documents as 1-bit. However, I went with a grey scale option. I use scanline, to examine each pixel. I add the red, green, and blue values, for each pixel. If it falls below 50%, I assume that it is "dark." If necessary, I can adjust the 50%. I keep a tally of all of the dark pixels. Then I set a threshold for the page. For example, if 45% of the page is dark pixels, I flag it as "suspect." I do that for each page in the document. My only concern is finding suspect pages, for a human to examine, and determine if the document needs to be rescanned. If anyone has a better option - particularly faster - I am open to it. Some of these documents are hundreds of pages long, and I may have to process hundreds of documents at a time.