Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - comment on text extraction
  FAQ FAQ  Forum Search   Register Register  Login Login

comment on text extraction

 Post Reply Post Reply
Author
Message
ukobsa View Drop Down
Senior Member
Senior Member


Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115
Post Options Post Options   Thanks (0) Thanks(0)   Quote ukobsa Quote  Post ReplyReply Direct Link To This Post Topic: comment on text extraction
    Posted: 12 Jul 06 at 6:46AM
Hi,

just a short comment on text extraction using GetPageText with option 4:

I wrote a one liner:

Test "This, that"

using LaTeX and tried to extract text with option 4 from the generated PDF: this results in the following:

"IUQMMW+CMR10",#000000,9.96,133.7680,705.1921,209.4270,705.1921,209.4270,714.0393,133.7680,714.0393,"TTesTest Test”ThiTest”ThisTest”This, Test”This,thTest”This,that”"

After some debugging I find, that GetPageText has problems to extract words when they are not defined as simple "[(text)]Tj" but as "[(t) 83 (ex) 83 (t)]Tj" (with individual glyph positioning) which was the case in my example: [(T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at")]TJ

So I think, that option 3 of GetPageText seems to be the most usable one.

If someone has a fix for this problem, let me know, please.

HTH,
Ulrich


Edited by ukobsa
Back to Top
ukobsa View Drop Down
Senior Member
Senior Member


Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115
Post Options Post Options   Thanks (0) Thanks(0)   Quote ukobsa Quote  Post ReplyReply Direct Link To This Post Posted: 12 Jul 06 at 9:22AM
Update:

I have made a small change on source so that the result becomes at least as good as with option 3. It cannot assured that the text is always splitt to single words but I can avoid the partially doubled entries.

The remaining problem:
How can I determine if (T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at") are one, two or more words?

Is it allowed to post the changes here on this forum?

best regards,
Ulrich
Back to Top
DELBEKE View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert
Avatar

Joined: 31 Oct 05
Location: France
Status: Offline
Points: 151
Post Options Post Options   Thanks (0) Thanks(0)   Quote DELBEKE Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jul 06 at 12:42AM

Please, post it

i am waiting these from years.

As far i can see , postive index are used for the beginning of the word

 

Back to Top
ukobsa View Drop Down
Senior Member
Senior Member


Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115
Post Options Post Options   Thanks (0) Thanks(0)   Quote ukobsa Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jul 06 at 7:28AM
Ok, here it is:

Description: With the following change option 4 results in correct text when having text like [(te) 83 (s) -1 (t) -333 (thi) 82(s)]TJ
The original version results in "tetestest thithis" while the new version result in "test this".
Restriction: having text like the above it is not possible to divide it to single words. So it works more like option 3.

search for UKO in the following code (2 lines) and add the lines to your source.

Unit: uPDFRenderer
Method: SubRender
local SubMethod: ShowText

procedure ShowText;
var
    X: Integer;
    C: Char;
    M: TPDFXForm;
    OldM: TPDFXForm;
    DX: Double;
    OI: Integer;
    Text: string;
    W: Double;
    CW: Word;
    WI: Integer;
    DXS: Double;
    MapText: string;
    TestP1: TPDFXFormPoint;
    TestP2: TPDFXFormPoint;
    RealTextSize: Double;
    GC: TPDFGenericCanvas;
    CIDToGIDMap: string;
    CN: string;
    MappedText: string;
    ThisMappedText: string;
    MatrixScale: Double;
    UC: Word;
    UX: Integer;
    UFound: Boolean;
begin
    MatrixScale := 1000;
    MappedText := '';
    CIDToGIDMap := FFontCol.FoundFontData.CIDToGIDMap;
    if FDestination = rdEPS then
      GC := FEPS
    else
      GC := Picasso;
    GC.BeginPath;
    if GS.TextSize <> 0 then
    begin
      if Assigned(FFontCol.FoundFontData.Rasterizer) then
        FFontCol.FoundFontData.Rasterizer.RenderingMode := GS.TextRenderingMode;
      SetFill(pfUnknown);
      SetPen;
      OI := Operands.Count - ArrayCount;
      SelectFont;
      OldM := CanvasMat;
      try
        CombinePDFXForm(CanvasMat, GS.TM, CanvasMat);
        if FFontCol.FoundFontData.Rasterizer is TPDFType3Rasterizer then
        begin
          MatrixScale := 1;
          CombinePDFXForm(CanvasMat, TPDFType3Rasterizer(
            FFontCol.FoundFontData.Rasterizer).FontMatrix, CanvasMat);
          TPDFType3Rasterizer(FFontCol.FoundFontData.Rasterizer).FillColor :=
            GS.FillColor;
        end;
        Mat(M, GS.TextSize / MatrixScale * GS.TextScaling / 100, 0, 0, GS.TextSize / MatrixScale, 0, 0);
        CombinePDFXForm(CanvasMat, M, CanvasMat);

        TestP1.X := 0;
        TestP1.Y := 0;
        TestP2.X := 0;
        TestP2.Y := MatrixScale;
        TestP1 := DoPDFXForm(CanvasMat, TestP1);
        TestP2 := DoPDFXForm(CanvasMat, TestP2);
        RealTextSize := Sqrt(Sqr(TestP2.X - TestP1.X) + Sqr(TestP2.Y - TestP1.Y));

        DX := 0;
        repeat
          Text := Operands[OI];
          if (Copy(Text, 1, 1) <> '(') and (Copy(Text, 1, 1) <> '<') then
          begin
            DX := DX - TazzToFloat(Text) * MatrixScale / 1000;
          end else
          begin
            if (Copy(Text, 1, 1) = '(') then
            begin
              Text := FStructure.DecodeString(Text);
            end else
            if (Copy(Text, 1, 1) = '<') then
            begin
              Text := FStructure.DecodeHex(Text);
            end;
            if FFontCol.FoundFontData.IsComposite then
            begin
              DXS := DX;
              for X := 1 to Length(Text) div 2 do
              begin
               CW := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);

               if CIDToGIDMap <> '' then
               begin
                  if (CW * 2) < Length(CIDToGIDMap) then
                  begin
                    CN := 'GID:' + IntToStr(Ord(CIDToGIDMap[CW * 2 + 1]) * 256 +
                      Ord(CIDToGIDMap[CW * 2 + 2]));
                  end;
               end else
                  CN := 'GID:' + IntToStr(CW);
               if Assigned(FFontCol.FoundFontData.CIDWidths) then
               begin
                  WI := FFontCol.FoundFontData.CIDWidths.IndexOf('CID:' + IntToStr(CW));
                  if WI >= 0 then
                  begin
                    WI := Integer(FFontCol.FoundFontData.CIDWidths.Objects[WI]);
                    {***Th W := WI * GS.TextScaling / 100; ***}
                    W := WI{***Th W ***} + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
                  end else
                    W := MatrixScale;
               end else
                  W := MatrixScale;
               if FDestination <> rdTextFunnel then
               begin
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, CN);
               end;
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               begin
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(CW)];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                  begin
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    FFunnel.SetNextMatch(False);
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               end;
               DX := DX + W;
              end;
              FFunnel.SetNextMatch(True);
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
              begin
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := ''; // UKO Reset (necessary for positioned text glyphs)
              end;
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
              begin
               MapText := '';
               //MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2])];
               for X := 1 to Length(Text) div 2 do
               begin
                  UC := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);
                  UX := 0;
                  UFound := False;
                  while (not UFound) and (UX < Length(FFontCol.FoundFontData.DisplayCS2)) do
                  begin
                    if (UC >= FFontCol.FoundFontData.DisplayCS2[X].StartCode) and
                      (UC <= FFontCol.FoundFontData.DisplayCS2[X].EndCode) then
                    begin
                      UC := FFontCol.FoundFontData.DisplayCS2[X].ResultCode +
                        UC - FFontCol.FoundFontData.DisplayCS2[X].StartCode;
                      UFound := True;
                    end else
                      Inc(UX);
                  end;
                  if UFound then MapText := MapText + WideChar(UC);
               end;
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
              end;
            end else
            begin
              DXS := DX;
              for X := 1 to Length(Text) do
              begin
               C := Text[X];
               W := (Widths[Ord(C)] * Abs(GS.TextSize) {***Th GS.TextScaling / 100 ***} / Abs(GS.TextSize));
               if Assigned(FFontCol.FoundFontData.Rasterizer) then
                  W := W / FFontCol.FoundFontData.Rasterizer.FontMatrixScaling;
               W := W + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
               if C = #32 then
                  W := W + GS.WordSpacing * MatrixScale / Abs(GS.TextSize);
               if FDestination <> rdTextFunnel then
               begin
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, Encoding[Ord(C)]);
               end;
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               begin
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                  begin
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    FFunnel.SetNextMatch(False);
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               end;
               DX := DX + W;
              end;
              FFunnel.SetNextMatch(True);
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
              begin
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := '';   // UKO Reset (necessary for positioned text glyphs)
              end;
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
              begin
               MapText := '';
               for X := 1 to Length(Text) do
                  MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
              end;
            end;
          end;
          Inc(OI);
        until OI = Operands.Count;
        Mat(M, 1, 0, 0, 1, DX * Abs(GS.TextSize) / MatrixScale, 0);
        CombinePDFXForm(GS.TM, M, GS.TM);
      finally
        CanvasMat := OldM;
      end;
    end;
    ArrayCount := 0;
    if FDestination = rdEPS then
    begin
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
      begin
        FEPS.PSSetColor(GS.FillColorEPS);
        FEPS.PSFill(epsFillModeNonZeroWinding);
      end;
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
      begin
        FEPS.PSSetColor(GS.StrokeColorEPS);
        FEPS.PSStroke;
      end;
    end else
    begin
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
          SetPen;
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
          SetFill(pfWinding);
      if GS.TextRenderingMode >= 4 then
        Picasso.SetClippingPath(pfWinding);
      case GS.TextRenderingMode of
        0: Picasso.FillPath;
        1: Picasso.StrokePath;
        2: Picasso.StrokeAndFillPath;
        4: Picasso.FillPath;
        5: Picasso.StrokePath;
        6: Picasso.StrokeAndFillPath;
        7: Picasso.NullPath;
      end;
    end;
end;



So please test it and if it works ok, can it be included in Version 5.15 ?

best regards,
Ulrich
Back to Top
DELBEKE View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert
Avatar

Joined: 31 Oct 05
Location: France
Status: Offline
Points: 151
Post Options Post Options   Thanks (0) Thanks(0)   Quote DELBEKE Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jul 06 at 8:17AM

thank you very much
monday, i 'll try it

 

Best regards

Back to Top
DELBEKE View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert
Avatar

Joined: 31 Oct 05
Location: France
Status: Offline
Points: 151
Post Options Post Options   Thanks (0) Thanks(0)   Quote DELBEKE Quote  Post ReplyReply Direct Link To This Post Posted: 18 Jul 06 at 5:12AM

Good job

Working fine for me.

The function can be improved, surely, but working better than the old version wich was too bugged before.

thanks very much Ukobsa :)

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store