Print Page | Close Window

Problem with textextraction from ocr-pdf

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=931
Printed Date: 26 Jan 26 at 10:39AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Problem with textextraction from ocr-pdf
Posted By: Ingo
Subject: Problem with textextraction from ocr-pdf
Date Posted: 22 Aug 08 at 9:30AM
Hi!
 
I've a scanned pdf-document (with ScandAll PRO V1.0 / Adobe PDF Scan Library 2.3). While scanning the floating text was converted into ocr. With QP i didn't found a solution to get the textcontent via textextraction. Is here anybody with experiences in this case? What should i do?
 
The document is made with the pdf-specs 1.4 and has no encryption.
 
Thanks in advance and best regards,
Ingo



Replies:
Posted By: DELBEKE
Date Posted: 22 Aug 08 at 1:36PM
Hi, Ingo
Can you send me the file, i'll have a look in it
 
jean-luc(at)delbeke(dot)fr


Posted By: Ingo
Date Posted: 23 Aug 08 at 8:27AM
Hi Jean-Luc!
 
Thanks for helping.
I've sent it to you.
 
Cheers and best regards,
Ingo 


Posted By: DELBEKE
Date Posted: 23 Aug 08 at 1:16PM
Hi Ingo.
No problem for me, it works perfectly.
In your document, the text is on an separate layer. Try to use the CombineLayers function before then GetPageText function
I've made some enhancements for this function, perhaps, you missed this post http://www.quickpdf.org/forum/forum_posts.asp?TID=923 - http://www.quickpdf.org/forum/forum_posts.asp?TID=923
In the meanwhile, i've made more enhancements for the annotations functions and send them to Michel, but it seem to be on hollidays.
For now, i am working on a digital signing funtion, but i have somes difficulties whith Delphi (i've just begin with delphi on july).


Posted By: Ingo
Date Posted: 28 Aug 08 at 5:18AM
Hi Jean-Luc!
 
I'm in England actually - so this late answer.
I'll try it with combine layers. Thanks a lot.
If you have any enhancements for the library please send it first to me. I'll always implement it and then i send the whole package to Michel and he upload it for new testing for the members.
 
Cheers and best regards,
Ingo


Posted By: Ingo
Date Posted: 31 Aug 08 at 4:47PM
Hi Jean-Luc!
 
How are you using the extract function?
My code still doesn't work. If you want to look:
   QP := TiSEDQuickPDF.Create;
   try
       QP.UnlockKey('mycode');
       dafh := QP.DAOpenFile(fneu,'');
       x    := QP.DAGetPageCount(dafh);
       STR  := '';
       ProgressBar1.Min := 0;
       ProgressBar1.Max := x;
       verztxt := fneu + '.txt';
       AssignFile(cf,verztxt);
       Rewrite(cf);
       for i := 1 to x Do
       begin
          ProgressBar1.Position := i;
          ProgressBar1.Repaint;
          dapr := QP.DAFindPage(dafh,i);
          QP.CombineLayers;
          STR  := QP.DAExtractPageText(dafh,dapr,0);      
          WriteLn(cf, '   ');
          WriteLn(cf, ' page ' + IntToStr(i) + ' from ' + IntToStr(x) + ' ');
          WriteLn(cf, '   ');
          WriteLn(cf,Trim(STR));
       end;
       CloseFile(cf);
    finally
       QP.DACloseFile(dafh);
       QP.Free;
       ProgressBar1.Position := 0;
       Screen.Cursor := Save_Cursor;  { Always restore to normal }
    end;
 
Thanks a lot!
 
Cheers,
Ingo


Posted By: DELBEKE
Date Posted: 31 Aug 08 at 5:33PM
Hi Ingo
I have not try with the direct access functions. Today it is to late for me to have a try. And tomorrow , i will not be able to test. So i'll see that later, but i' ll dot it.


Posted By: Ingo
Date Posted: 01 Sep 08 at 1:48AM
Hi Jean-Luc!

This is already a help - I'll try the "normal" extract function ;-)

Cheers,
Ingo


Posted By: DELBEKE
Date Posted: 02 Sep 08 at 12:59PM
Hi Ingo
 
I've tested with GetPageText(Parameter)
With Parameter 0 and 1, no text extraction. Others works correctly.
 
Strange


Posted By: DELBEKE
Date Posted: 02 Sep 08 at 2:10PM
Hi again,
I've traced the program to understand the différence.
 
The text contained in your files is Unicoded, The method for extract the text with parameter 0/1 ( the parameter 0 and 1 are strictkly identical) do'nt use the rendering engine and ca'nt extract the Ansi string. This shoud be an improvement.


Posted By: Ingo
Date Posted: 03 Sep 08 at 1:38AM
Hi Jean-Luc!

I've made similar experiences. I can only extract the text from my special document with GetPageText(4). Parameter 0 and 3 aren't working. Wonder why but i'm happy that it runs now with your help! Thanks a lot!

Cheers,
Ingo



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk