Print Page | Close Window

GetPageText with cid fonts doesn't work?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3915
Printed Date: 06 May 24 at 9:09AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: GetPageText with cid fonts doesn't work?
Posted By: Ingo
Subject: GetPageText with cid fonts doesn't work?
Date Posted: 16 May 21 at 1:40PM
Hi :)


today i myself have a problem... ;-)
Regarding functionalities like fulltext search i'm using GetPageText as first preparing step.
Now i've made bad experiences with few files and GetPageText.
The textextract is hanging... my app stops and i'm loosing control and at least i have to use the task manager to kill the process.
After a deeper look into the relevant documents i've always seen that there are many (more than 10) embedded fonts and these fonts are always cid fonts.

I've tried to use some heal-functionality offered by QuickPDF to transform the pdf and do the extract later but nothing helps. Additionally it doesn't matter which extract option (0 up to 8) i'm using.

Is here anybody with the same experience and perhaps a solution?
Here are two of my samples to test the extraction.
Thanks in advance:
https://www.is-soft.de/vx800/prob1.pdf
https://www.is-soft.de/vx800/prob2.pdf


Cheers,
Ingo



-------------
Cheers,
Ingo




Replies:
Posted By: tfrost
Date Posted: 16 May 21 at 8:58PM
I can't help with a solution, but I tried analysing both files with PDF Analyzer Pro 5.0 (which normally gives a good report of issues). This also hung during text analysis.


Posted By: Sopracenery
Date Posted: 27 May 21 at 12:15AM
Hi Ingo,

can you reduce one of your files to a minimum? I mean a minimum of pages or words so that the error is still there? This would simplify the search of a reason.

In other projects I remember a bug when a special char was at a special position at byte 1024 in a richtext file. A strange thing that was found by reducing the file step by step.

Martin


Posted By: Ingo
Date Posted: 28 May 21 at 11:00PM
Hi Martin,

i think it's a common prob too many cid-fonts prevent from some render functionalities.
Here are the same documents... one with one remaining page and the other one with two pages now. The probs are the same:
https://www.is-soft.de/vx800/prob1_klein.pdf
https://www.is-soft.de/vx800/prob2_klein.pdf


-------------
Cheers,
Ingo



Posted By: Sopracenery
Date Posted: 29 May 21 at 9:05AM
Hi Ingo,

I opened your probe  https://www.is-soft.de/vx800/prob2_klein.pdf" rel="nofollow - prob2_klein.pdf  with 18.11 and there is no issue with GetPageText.
I tested options 0 and 7. Both are working perfect. Page 1 and page 2 are ok.

How can I help you?
Martin


Posted By: Ingo
Date Posted: 30 May 21 at 10:19PM
Hi Martin,

thanks a lot for trying yourself.
I'm working with the same release.
prob2_klein.pdf works with rendering (i have a small preview function). I see the textcontent but when i use GetPageText with option 7 i'm producing an empty txtfile..
prob1_klein.pdf completely doesn't work.

Here's a bit code - perhaps you can see what i don't see ;-)
Thanks in advance.

  QP := TDebenuPDFLibrary1811.Create;
  try
    QP.LoadFromFile(Edit1.Text, '');
    If (QP.EncryptionStatus > 0) Then
      QP.Decrypt;
    X := QP.PageCount;
    QP.CombineContentStreams;
    UNI := '';

//  . . .

    filetxt := ChangeFileExt(ExtractFileName(Edit1.Text), '.txt');
    verztxt := tpath + '__' + filetxt;

    sl  := TStringList.Create;
    sl2 := TStringList.Create;
    for i := 1 to X Do
    begin
      QP.SelectPage(i);
      QP.SetOrigin(1);
      QP.CombineContentStreams;
      UNI := WideString('');
      UNI2 := WideString(' ' + #13#10);
      UNI2 := UNI2 + WideString(' --- page ' + IntToStr(i) + ' from ' + IntToStr(X) + ' ---');
      UNI2 := UNI2 + WideString(' ' + #13#10);

      UNI := UNI2 + QP.GetPageText(7);

      UNI := UNI + #13#10;
      sl.Add(UNI);
    end;
  finally
    QP.Free;
    UNI := WideString('');
    UNI2 := WideString('');
    sl.SaveToFile(verztxt, TEncoding.Unicode);
    Screen.Cursor := Save_Cursor; { Always restore to normal }
  end;




-------------
Cheers,
Ingo



Posted By: Sopracenery
Date Posted: 31 May 21 at 8:42AM
I tried your sequence as follows:

QP.LoadFromFile(Edit1.Text, '');
    X := QP.PageCount;
    QP.CombineContentStreams;
      QP.SelectPage(i);
      QP.SetOrigin(1);
      QP.CombineContentStreams;
      QP.GetPageText(7);

and I see no issue.
But I do not have QP.Free in my library. So we are not working with the same binary.
I use DebenuPDFLibraryAX1811.dll on windows32.

Please check you output directly after 
QP.GetPageText(7);
into debug. Is there really nothing coming out?



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk