Problem with textextraction from ocr-pdf

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi! 

I've a scanned pdf-document (with ScandAll PRO V1.0 / Adobe PDF Scan Library 2.3). While scanning the floating text was converted into ocr. With QP i didn't found a solution to get the textcontent via textextraction. Is here anybody with experiences in this case? What should i do?

The document is made with the pdf-specs 1.4 and has no encryption.

Thanks in advance and best regards,
Ingo

Edited by Ingo - 22 Aug 08 at 9:32AM

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Topic: Problem with textextraction from ocr-pdf Posted: 22 Aug 08 at 9:30AM
	Hi! I've a scanned pdf-document (with ScandAll PRO V1.0 / Adobe PDF Scan Library 2.3). While scanning the floating text was converted into ocr. With QP i didn't found a solution to get the textcontent via textextraction. Is here anybody with experiences in this case? What should i do? The document is made with the pdf-specs 1.4 and has no encryption. Thanks in advance and best regards, Ingo Edited by Ingo - 22 Aug 08 at 9:32AM

DELBEKE Members Profile Find Members Posts Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151	Post Options Post Reply Quote DELBEKE Report Post Thanks(0) Quote Reply Posted: 22 Aug 08 at 1:36PM
	Hi, Ingo Can you send me the file, i'll have a look in it jean-luc(at)delbeke(dot)fr Edited by DELBEKE - 22 Aug 08 at 1:37PM

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 23 Aug 08 at 8:27AM
	Hi Jean-Luc! Thanks for helping. I've sent it to you. Cheers and best regards, Ingo

DELBEKE Members Profile Find Members Posts Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151	Post Options Post Reply Quote DELBEKE Report Post Thanks(0) Quote Reply Posted: 23 Aug 08 at 1:16PM
	Hi Ingo. No problem for me, it works perfectly. In your document, the text is on an separate layer. Try to use the CombineLayers function before then GetPageText function I've made some enhancements for this function, perhaps, you missed this post http://www.quickpdf.org/forum/forum_posts.asp?TID=923 In the meanwhile, i've made more enhancements for the annotations functions and send them to Michel, but it seem to be on hollidays. For now, i am working on a digital signing funtion, but i have somes difficulties whith Delphi (i've just begin with delphi on july).

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 28 Aug 08 at 5:18AM
	Hi Jean-Luc! I'm in England actually - so this late answer. I'll try it with combine layers. Thanks a lot. If you have any enhancements for the library please send it first to me. I'll always implement it and then i send the whole package to Michel and he upload it for new testing for the members. Cheers and best regards, Ingo

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 31 Aug 08 at 4:47PM
	Hi Jean-Luc! How are you using the extract function? My code still doesn't work. If you want to look: QP := TiSEDQuickPDF.Create; try QP.UnlockKey('mycode'); dafh := QP.DAOpenFile(fneu,''); x := QP.DAGetPageCount(dafh); STR := ''; ProgressBar1.Min := 0; ProgressBar1.Max := x; verztxt := fneu + '.txt'; AssignFile(cf,verztxt); Rewrite(cf); for i := 1 to x Do begin ProgressBar1.Position := i; ProgressBar1.Repaint; dapr := QP.DAFindPage(dafh,i); QP.CombineLayers; STR := QP.DAExtractPageText(dafh,dapr,0); WriteLn(cf, ' '); WriteLn(cf, ' page ' + IntToStr(i) + ' from ' + IntToStr(x) + ' '); WriteLn(cf, ' '); WriteLn(cf,Trim(STR)); end; CloseFile(cf); finally QP.DACloseFile(dafh); QP.Free; ProgressBar1.Position := 0; Screen.Cursor := Save_Cursor; { Always restore to normal } end; Thanks a lot! Cheers, Ingo

DELBEKE Members Profile Find Members Posts Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151	Post Options Post Reply Quote DELBEKE Report Post Thanks(0) Quote Reply Posted: 31 Aug 08 at 5:33PM
	Hi Ingo I have not try with the direct access functions. Today it is to late for me to have a try. And tomorrow , i will not be able to test. So i'll see that later, but i' ll dot it.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 01 Sep 08 at 1:48AM
	Hi Jean-Luc! This is already a help - I'll try the "normal" extract function ;-) Cheers, Ingo

DELBEKE Members Profile Find Members Posts Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151	Post Options Post Reply Quote DELBEKE Report Post Thanks(0) Quote Reply Posted: 02 Sep 08 at 12:59PM
	Hi Ingo I've tested with GetPageText(Parameter) With Parameter 0 and 1, no text extraction. Others works correctly. Strange

DELBEKE Members Profile Find Members Posts Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151	Post Options Post Reply Quote DELBEKE Report Post Thanks(0) Quote Reply Posted: 02 Sep 08 at 2:10PM
	Hi again, I've traced the program to understand the diff�rence. The text contained in your files is Unicoded, The method for extract the text with parameter 0/1 ( the parameter 0 and 1 are strictkly identical) do'nt use the rendering engine and ca'nt extract the Ansi string. This shoud be an improvement.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 03 Sep 08 at 1:38AM
	Hi Jean-Luc! I've made similar experiences. I can only extract the text from my special document with GetPageText(4). Parameter 0 and 3 aren't working. Wonder why but i'm happy that it runs now with your help! Thanks a lot! Cheers, Ingo