I need help - I can help - Problem with textextraction from ocr-pdf

Problem with textextraction from ocr-pdf

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=931
Printed Date: 26 Jan 26 at 10:39AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Problem with textextraction from ocr-pdf

Posted By: Ingo
Subject: Problem with textextraction from ocr-pdf
Date Posted: 22 Aug 08 at 9:30AM

Hi!

I've a scanned pdf-document (with ScandAll PRO V1.0 / Adobe PDF Scan Library 2.3). While scanning the floating text was converted into ocr. With QP i didn't found a solution to get the textcontent via textextraction. Is here anybody with experiences in this case? What should i do?

The document is made with the pdf-specs 1.4 and has no encryption.

Thanks in advance and best regards,

Ingo

Replies:

Posted By: DELBEKE
Date Posted: 22 Aug 08 at 1:36PM

Hi, Ingo

Can you send me the file, i'll have a look in it

jean-luc(at)delbeke(dot)fr

Posted By: Ingo
Date Posted: 23 Aug 08 at 8:27AM

Hi Jean-Luc!

Thanks for helping.

I've sent it to you.

Cheers and best regards,

Ingo

Posted By: DELBEKE
Date Posted: 23 Aug 08 at 1:16PM

Hi Ingo.

No problem for me, it works perfectly.

In your document, the text is on an separate layer. Try to use the CombineLayers function before then GetPageText function

I've made some enhancements for this function, perhaps, you missed this post http://www.quickpdf.org/forum/forum_posts.asp?TID=923 - http://www.quickpdf.org/forum/forum_posts.asp?TID=923

In the meanwhile, i've made more enhancements for the annotations functions and send them to Michel, but it seem to be on hollidays.

For now, i am working on a digital signing funtion, but i have somes difficulties whith Delphi (i've just begin with delphi on july).

Posted By: Ingo
Date Posted: 28 Aug 08 at 5:18AM

Hi Jean-Luc!

I'm in England actually - so this late answer.

I'll try it with combine layers. Thanks a lot.

If you have any enhancements for the library please send it first to me. I'll always implement it and then i send the whole package to Michel and he upload it for new testing for the members.

Cheers and best regards,

Ingo

Posted By: Ingo
Date Posted: 31 Aug 08 at 4:47PM

Hi Jean-Luc!

How are you using the extract function?

My code still doesn't work. If you want to look:

   QP := TiSEDQuickPDF.Create;
   try
       QP.UnlockKey('mycode');
       dafh := QP.DAOpenFile(fneu,'');
       x    := QP.DAGetPageCount(dafh);
       STR := '';
       ProgressBar1.Min := 0;
       ProgressBar1.Max := x;
       verztxt := fneu + '.txt';
       AssignFile(cf,verztxt);
       Rewrite(cf);
       for i := 1 to x Do
       begin
          ProgressBar1.Position := i;
          ProgressBar1.Repaint;
          dapr := QP.DAFindPage(dafh,i);
          QP.CombineLayers;
          STR := QP.DAExtractPageText(dafh,dapr,0);

          WriteLn(cf, '   ');
          WriteLn(cf, ' page ' + IntToStr(i) + ' from ' + IntToStr(x) + ' ');
          WriteLn(cf, '   ');

          WriteLn(cf,Trim(STR));
       end;
       CloseFile(cf);
    finally
       QP.DACloseFile(dafh);
       QP.Free;
       ProgressBar1.Position := 0;
       Screen.Cursor := Save_Cursor; { Always restore to normal }
    end;

Thanks a lot!

Cheers,

Ingo

Posted By: DELBEKE
Date Posted: 31 Aug 08 at 5:33PM

Hi Ingo

I have not try with the direct access functions. Today it is to late for me to have a try. And tomorrow , i will not be able to test. So i'll see that later, but i' ll dot it.

Posted By: Ingo
Date Posted: 01 Sep 08 at 1:48AM

Hi Jean-Luc!

This is already a help - I'll try the "normal" extract function ;-)

Cheers,
Ingo

Posted By: DELBEKE
Date Posted: 02 Sep 08 at 12:59PM

Hi Ingo

I've tested with GetPageText(Parameter)

With Parameter 0 and 1, no text extraction. Others works correctly.

Strange

Posted By: DELBEKE
Date Posted: 02 Sep 08 at 2:10PM

Hi again,

I've traced the program to understand the diff�rence.

The text contained in your files is Unicoded, The method for extract the text with parameter 0/1 ( the parameter 0 and 1 are strictkly identical) do'nt use the rendering engine and ca'nt extract the Ansi string. This shoud be an improvement.

Posted By: Ingo
Date Posted: 03 Sep 08 at 1:38AM

Hi Jean-Luc!

I've made similar experiences. I can only extract the text from my special document with GetPageText(4). Parameter 0 and 3 aren't working. Wonder why but i'm happy that it runs now with your help! Thanks a lot!

Cheers,
Ingo