Print Page | Close Window

ocr - where is the recognized text?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2102
Printed Date: 01 May 24 at 2:05PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: ocr - where is the recognized text?
Posted By: vladob
Subject: ocr - where is the recognized text?
Date Posted: 13 Jan 12 at 5:36PM
Hi all
 
I have following question, when you ask OCR software to read picture PDF (scanned pictures into PDF), OCR engines inject recognized text into PDF file, can you let me know where? I mean how I can access those recognized text with QuickPDF?
 
Many thanks
 
 
Vladimir



Replies:
Posted By: Ingo
Date Posted: 13 Jan 12 at 6:47PM
Hi Vladimir!

Don't know if i understand your question right but ...
First there's a scanned invoice for example.
It's scanned as an image to pdf first.
You can view this pdf via QuickPDF, changing properties and so on but textextraction isn't possible.
Then there are ocr-tools available going through this pdf making readable textcontent from the "image-pdf".
For this the "image-pdf" remains the same but additionally the ocr-tool inserts real textcontent.
Now you can extract this text with QuickPDF and things like fulltext search and others are possible.

With QuickPDF you can determine if there's an "ocr-ed" 'cause while textextraction there's an option to extract with fontnames... ocr-fonts are very special fonts and mostly inside the fontname there's an "ocr" too.
The other thing how to determine an ocr-pdf is:
If the inserted imagecount is the same than the pagecount and if the images have the same dimensions as the pages.

I hope i could help a little bit and perhaps now you have further ideas ;-)

Cheers and welcome here,
Ingo
 


Posted By: AndrewC
Date Posted: 16 Jan 12 at 2:22PM
OCR text is often inserted into an invisible text object that cannot be seen but can be extracted with GetPageText text extraction functions within QPL.

  int ret = QP.LoadFromFile("ocred.pdf", "");
  string s = QP.GetPageText(3);    // you can also try option 7 or 8.




Posted By: vladob
Date Posted: 17 Jan 12 at 7:43AM
Many thanks for your precious help
It works
Have a nice day
V.


Posted By: wubuer
Date Posted: 25 Nov 15 at 7:56AM
Originally posted by AndrewC AndrewC wrote:

http://www.online-code.net/ocr.html" rel="nofollow -
  string s = QP.GetPageText(3);    // you can also try option 7 or 8.



thanks, it's help a lot.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk