Print Page | Close Window

comment on text extraction

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
Printed Date: 18 May 24 at 8:43PM
Software Version: Web Wiz Forums 11.01 -

Topic: comment on text extraction
Posted By: ukobsa
Subject: comment on text extraction
Date Posted: 12 Jul 06 at 6:46AM

just a short comment on text extraction using GetPageText with option 4:

I wrote a one liner:

Test "This, that"

using LaTeX and tried to extract text with option 4 from the generated PDF: this results in the following:

"IUQMMW+CMR10",#000000,9.96,133.7680,705.1921,209.4270,705.1921,209.4270,714.0393,133.7680,714.0393,"TTesTest Test”ThiTest”ThisTest”This, Test”This,thTest”This,that”"

After some debugging I find, that GetPageText has problems to extract words when they are not defined as simple "[(text)]Tj" but as "[(t) 83 (ex) 83 (t)]Tj" (with individual glyph positioning) which was the case in my example: [(T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at")]TJ

So I think, that option 3 of GetPageText seems to be the most usable one.

If someone has a fix for this problem, let me know, please.


Posted By: ukobsa
Date Posted: 12 Jul 06 at 9:22AM

I have made a small change on source so that the result becomes at least as good as with option 3. It cannot assured that the text is always splitt to single words but I can avoid the partially doubled entries.

The remaining problem:
How can I determine if (T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at") are one, two or more words?

Is it allowed to post the changes here on this forum?

best regards,

Posted By: DELBEKE
Date Posted: 13 Jul 06 at 12:42AM

Please, post it

i am waiting these from years.

As far i can see , postive index are used for the beginning of the word


Posted By: ukobsa
Date Posted: 13 Jul 06 at 7:28AM
Ok, here it is:

Description: With the following change option 4 results in correct text when having text like [(te) 83 (s) -1 (t) -333 (thi) 82(s)]TJ
The original version results in "tetestest thithis" while the new version result in "test this".
Restriction: having text like the above it is not possible to divide it to single words. So it works more like option 3.

search for UKO in the following code (2 lines) and add the lines to your source.

Unit: uPDFRenderer
Method: SubRender
local SubMethod: ShowText

procedure ShowText;
    X: Integer;
    C: Char;
    M: TPDFXForm;
    OldM: TPDFXForm;
    DX: Double;
    OI: Integer;
    Text: string;
    W: Double;
    CW: Word;
    WI: Integer;
    DXS: Double;
    MapText: string;
    TestP1: TPDFXFormPoint;
    TestP2: TPDFXFormPoint;
    RealTextSize: Double;
    GC: TPDFGenericCanvas;
    CIDToGIDMap: string;
    CN: string;
    MappedText: string;
    ThisMappedText: string;
    MatrixScale: Double;
    UC: Word;
    UX: Integer;
    UFound: Boolean;
    MatrixScale := 1000;
    MappedText := '';
    CIDToGIDMap := FFontCol.FoundFontData.CIDToGIDMap;
    if FDestination = rdEPS then
      GC := FEPS
      GC := Picasso;
    if GS.TextSize <> 0 then
      if Assigned(FFontCol.FoundFontData.Rasterizer) then
        FFontCol.FoundFontData.Rasterizer.RenderingMode := GS.TextRenderingMode;
      OI := Operands.Count - ArrayCount;
      OldM := CanvasMat;
        CombinePDFXForm(CanvasMat, GS.TM, CanvasMat);
        if FFontCol.FoundFontData.Rasterizer is TPDFType3Rasterizer then
          MatrixScale := 1;
          CombinePDFXForm(CanvasMat, TPDFType3Rasterizer(
            FFontCol.FoundFontData.Rasterizer).FontMatrix, CanvasMat);
          TPDFType3Rasterizer(FFontCol.FoundFontData.Rasterizer).FillColor :=
        Mat(M, GS.TextSize / MatrixScale * GS.TextScaling / 100, 0, 0, GS.TextSize / MatrixScale, 0, 0);
        CombinePDFXForm(CanvasMat, M, CanvasMat);

        TestP1.X := 0;
        TestP1.Y := 0;
        TestP2.X := 0;
        TestP2.Y := MatrixScale;
        TestP1 := DoPDFXForm(CanvasMat, TestP1);
        TestP2 := DoPDFXForm(CanvasMat, TestP2);
        RealTextSize := Sqrt(Sqr(TestP2.X - TestP1.X) + Sqr(TestP2.Y - TestP1.Y));

        DX := 0;
          Text := Operands[OI];
          if (Copy(Text, 1, 1) <> '(') and (Copy(Text, 1, 1) <> '<') then
            DX := DX - TazzToFloat(Text) * MatrixScale / 1000;
          end else
            if (Copy(Text, 1, 1) = '(') then
              Text := FStructure.DecodeString(Text);
            end else
            if (Copy(Text, 1, 1) = '<') then
              Text := FStructure.DecodeHex(Text);
            if FFontCol.FoundFontData.IsComposite then
              DXS := DX;
              for X := 1 to Length(Text) div 2 do
               CW := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);

               if CIDToGIDMap <> '' then
                  if (CW * 2) < Length(CIDToGIDMap) then
                    CN := 'GID:' + IntToStr(Ord(CIDToGIDMap[CW * 2 + 1]) * 256 +
                      Ord(CIDToGIDMap[CW * 2 + 2]));
               end else
                  CN := 'GID:' + IntToStr(CW);
               if Assigned(FFontCol.FoundFontData.CIDWidths) then
                  WI := FFontCol.FoundFontData.CIDWidths.IndexOf('CID:' + IntToStr(CW));
                  if WI >= 0 then
                    WI := Integer(FFontCol.FoundFontData.CIDWidths.Objects[WI]);
                    {***Th W := WI * GS.TextScaling / 100; ***}
                    W := WI{***Th W ***} + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
                  end else
                    W := MatrixScale;
               end else
                  W := MatrixScale;
               if FDestination <> rdTextFunnel then
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, CN);
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(CW)];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               DX := DX + W;
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := ''; // UKO Reset (necessary for positioned text glyphs)
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
               MapText := '';
               //MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2])];
               for X := 1 to Length(Text) div 2 do
                  UC := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);
                  UX := 0;
                  UFound := False;
                  while (not UFound) and (UX < Length(FFontCol.FoundFontData.DisplayCS2)) do
                    if (UC >= FFontCol.FoundFontData.DisplayCS2[X].StartCode) and
                      (UC <= FFontCol.FoundFontData.DisplayCS2[X].EndCode) then
                      UC := FFontCol.FoundFontData.DisplayCS2[X].ResultCode +
                        UC - FFontCol.FoundFontData.DisplayCS2[X].StartCode;
                      UFound := True;
                    end else
                  if UFound then MapText := MapText + WideChar(UC);
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
            end else
              DXS := DX;
              for X := 1 to Length(Text) do
               C := Text[X];
               W := (Widths[Ord(C)] * Abs(GS.TextSize) {***Th GS.TextScaling / 100 ***} / Abs(GS.TextSize));
               if Assigned(FFontCol.FoundFontData.Rasterizer) then
                  W := W / FFontCol.FoundFontData.Rasterizer.FontMatrixScaling;
               W := W + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
               if C = #32 then
                  W := W + GS.WordSpacing * MatrixScale / Abs(GS.TextSize);
               if FDestination <> rdTextFunnel then
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, Encoding[Ord(C)]);
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               DX := DX + W;
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := '';   // UKO Reset (necessary for positioned text glyphs)
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
               MapText := '';
               for X := 1 to Length(Text) do
                  MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
        until OI = Operands.Count;
        Mat(M, 1, 0, 0, 1, DX * Abs(GS.TextSize) / MatrixScale, 0);
        CombinePDFXForm(GS.TM, M, GS.TM);
        CanvasMat := OldM;
    ArrayCount := 0;
    if FDestination = rdEPS then
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
    end else
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
      if GS.TextRenderingMode >= 4 then
      case GS.TextRenderingMode of
        0: Picasso.FillPath;
        1: Picasso.StrokePath;
        2: Picasso.StrokeAndFillPath;
        4: Picasso.FillPath;
        5: Picasso.StrokePath;
        6: Picasso.StrokeAndFillPath;
        7: Picasso.NullPath;

So please test it and if it works ok, can it be included in Version 5.15 ?

best regards,

Posted By: DELBEKE
Date Posted: 13 Jul 06 at 8:17AM

thank you very much
monday, i 'll try it


Best regards

Posted By: DELBEKE
Date Posted: 18 Jul 06 at 5:12AM

Good job

Working fine for me.

The function can be improved, surely, but working better than the old version wich was too bugged before.

thanks very much Ukobsa :)

Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 -
Copyright ©2001-2014 Web Wiz Ltd. -