I need help - I can help - ocr - where is the recognized text?

Print Page | Close Window

ocr - where is the recognized text?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2102
Printed Date: 22 Dec 25 at 6:37AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: ocr - where is the recognized text?

Posted By: vladob
Subject: ocr - where is the recognized text?
Date Posted: 13 Jan 12 at 5:36PM

Hi all

I have following question, when you ask OCR software to read picture PDF (scanned pictures into PDF), OCR engines inject recognized text into PDF file, can you let me know where? I mean how I can access those recognized text with QuickPDF?

Many thanks

Vladimir

Replies:

Posted By: Ingo
Date Posted: 13 Jan 12 at 6:47PM

Hi Vladimir!

Don't know if i understand your question right but ...
First there's a scanned invoice for example.
It's scanned as an image to pdf first.
You can view this pdf via QuickPDF, changing properties and so on but textextraction isn't possible.
Then there are ocr-tools available going through this pdf making readable textcontent from the "image-pdf".
For this the "image-pdf" remains the same but additionally the ocr-tool inserts real textcontent.
Now you can extract this text with QuickPDF and things like fulltext search and others are possible.

With QuickPDF you can determine if there's an "ocr-ed" 'cause while textextraction there's an option to extract with fontnames... ocr-fonts are very special fonts and mostly inside the fontname there's an "ocr" too.
The other thing how to determine an ocr-pdf is:
If the inserted imagecount is the same than the pagecount and if the images have the same dimensions as the pages.

I hope i could help a little bit and perhaps now you have further ideas ;-)

Cheers and welcome here,
Ingo

Posted By: AndrewC
Date Posted: 16 Jan 12 at 2:22PM

OCR text is often inserted into an invisible text object that cannot be seen but can be extracted with GetPageText text extraction functions within QPL.

int ret = QP.LoadFromFile("ocred.pdf", "");

string s = QP.GetPageText(3); // you can also try option 7 or 8.

Posted By: vladob
Date Posted: 17 Jan 12 at 7:43AM

Many thanks for your precious help

It works

Have a nice day

Posted By: wubuer
Date Posted: 25 Nov 15 at 7:56AM

AndrewC wrote:

http://www.online-code.net/ocr.html" rel="nofollow -

string s = QP.GetPageText(3); // you can also try option 7 or 8.

thanks, it's help a lot.