I need help - I can help - GetPageText with cid fonts doesn't work?

Print Page | Close Window

GetPageText with cid fonts doesn't work?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3915
Printed Date: 02 Jan 26 at 7:56PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: GetPageText with cid fonts doesn't work?

Posted By: Ingo
Subject: GetPageText with cid fonts doesn't work?
Date Posted: 16 May 21 at 1:40PM

Hi :)

today i myself have a problem... ;-)
Regarding functionalities like fulltext search i'm using GetPageText as first preparing step.
Now i've made bad experiences with few files and GetPageText.
The textextract is hanging... my app stops and i'm loosing control and at least i have to use the task manager to kill the process.
After a deeper look into the relevant documents i've always seen that there are many (more than 10) embedded fonts and these fonts are always cid fonts.

I've tried to use some heal-functionality offered by QuickPDF to transform the pdf and do the extract later but nothing helps. Additionally it doesn't matter which extract option (0 up to 8) i'm using.

Is here anybody with the same experience and perhaps a solution?
Here are two of my samples to test the extraction.
Thanks in advance:
https://www.is-soft.de/vx800/prob1.pdf
https://www.is-soft.de/vx800/prob2.pdf

Cheers,
Ingo

-------------
Cheers,
Ingo

Replies:

Posted By: tfrost
Date Posted: 16 May 21 at 8:58PM

I can't help with a solution, but I tried analysing both files with PDF Analyzer Pro 5.0 (which normally gives a good report of issues). This also hung during text analysis.

Posted By: Sopracenery
Date Posted: 27 May 21 at 12:15AM

Hi Ingo,

can you reduce one of your files to a minimum? I mean a minimum of pages or words so that the error is still there? This would simplify the search of a reason.

In other projects I remember a bug when a special char was at a special position at byte 1024 in a richtext file. A strange thing that was found by reducing the file step by step.

Martin

Posted By: Ingo
Date Posted: 28 May 21 at 11:00PM

Hi Martin,

i think it's a common prob too many cid-fonts prevent from some render functionalities.
Here are the same documents... one with one remaining page and the other one with two pages now. The probs are the same:
https://www.is-soft.de/vx800/prob1_klein.pdf
https://www.is-soft.de/vx800/prob2_klein.pdf

-------------
Cheers,
Ingo

Posted By: Sopracenery
Date Posted: 29 May 21 at 9:05AM

Hi Ingo,

I opened your probe https://www.is-soft.de/vx800/prob2_klein.pdf" rel="nofollow - prob2_klein.pdf with 18.11 and there is no issue with GetPageText.

I tested options 0 and 7. Both are working perfect. Page 1 and page 2 are ok.

How can I help you?

Martin

Posted By: Ingo
Date Posted: 30 May 21 at 10:19PM

Hi Martin,

thanks a lot for trying yourself.
I'm working with the same release.
prob2_klein.pdf works with rendering (i have a small preview function). I see the textcontent but when i use GetPageText with option 7 i'm producing an empty txtfile..
prob1_klein.pdf completely doesn't work.

Here's a bit code - perhaps you can see what i don't see ;-)
Thanks in advance.

QP := TDebenuPDFLibrary1811.Create;
try
    QP.LoadFromFile(Edit1.Text, '');
    If (QP.EncryptionStatus > 0) Then
      QP.Decrypt;
    X := QP.PageCount;
    QP.CombineContentStreams;
    UNI := '';

// . . .

    filetxt := ChangeFileExt(ExtractFileName(Edit1.Text), '.txt');
    verztxt := tpath + '__' + filetxt;

    sl := TStringList.Create;
    sl2 := TStringList.Create;
    for i := 1 to X Do
    begin
      QP.SelectPage(i);
      QP.SetOrigin(1);
      QP.CombineContentStreams;
      UNI := WideString('');
      UNI2 := WideString(' ' + #13#10);
      UNI2 := UNI2 + WideString(' --- page ' + IntToStr(i) + ' from ' + IntToStr(X) + ' ---');
      UNI2 := UNI2 + WideString(' ' + #13#10);

      UNI := UNI2 + QP.GetPageText(7);

      UNI := UNI + #13#10;
      sl.Add(UNI);
    end;
finally
    QP.Free;
    UNI := WideString('');
    UNI2 := WideString('');
    sl.SaveToFile(verztxt, TEncoding.Unicode);
    Screen.Cursor := Save_Cursor; { Always restore to normal }
end;

-------------
Cheers,
Ingo

Posted By: Sopracenery
Date Posted: 31 May 21 at 8:42AM

I tried your sequence as follows:

QP.LoadFromFile(Edit1.Text, '');
    X := QP.PageCount;
    QP.CombineContentStreams;
      QP.SelectPage(i);
      QP.SetOrigin(1);
      QP.CombineContentStreams;
QP.GetPageText(7);

and I see no issue.

But I do not have QP.Free in my library. So we are not working with the same binary.

I use DebenuPDFLibraryAX1811.dll on windows32.

Please check you output directly after

QP.GetPageText(7);

into debug. Is there really nothing coming out?