Print Page | Close Window

comment on text extraction

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=462
Printed Date: 18 May 24 at 8:43PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: comment on text extraction
Posted By: ukobsa
Subject: comment on text extraction
Date Posted: 12 Jul 06 at 6:46AM
Hi,

just a short comment on text extraction using GetPageText with option 4:

I wrote a one liner:

Test "This, that"

using LaTeX and tried to extract text with option 4 from the generated PDF: this results in the following:

"IUQMMW+CMR10",#000000,9.96,133.7680,705.1921,209.4270,705.1921,209.4270,714.0393,133.7680,714.0393,"TTesTest Test”ThiTest”ThisTest”This, Test”This,thTest”This,that”"

After some debugging I find, that GetPageText has problems to extract words when they are not defined as simple "[(text)]Tj" but as "[(t) 83 (ex) 83 (t)]Tj" (with individual glyph positioning) which was the case in my example: [(T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at")]TJ

So I think, that option 3 of GetPageText seems to be the most usable one.

If someone has a fix for this problem, let me know, please.

HTH,
Ulrich



Replies:
Posted By: ukobsa
Date Posted: 12 Jul 06 at 9:22AM
Update:

I have made a small change on source so that the result becomes at least as good as with option 3. It cannot assured that the text is always splitt to single words but I can avoid the partially doubled entries.

The remaining problem:
How can I determine if (T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at") are one, two or more words?

Is it allowed to post the changes here on this forum?

best regards,
Ulrich


Posted By: DELBEKE
Date Posted: 13 Jul 06 at 12:42AM

Please, post it

i am waiting these from years.

As far i can see , postive index are used for the beginning of the word

 



Posted By: ukobsa
Date Posted: 13 Jul 06 at 7:28AM
Ok, here it is:

Description: With the following change option 4 results in correct text when having text like [(te) 83 (s) -1 (t) -333 (thi) 82(s)]TJ
The original version results in "tetestest thithis" while the new version result in "test this".
Restriction: having text like the above it is not possible to divide it to single words. So it works more like option 3.

search for UKO in the following code (2 lines) and add the lines to your source.

Unit: uPDFRenderer
Method: SubRender
local SubMethod: ShowText

procedure ShowText;
var
    X: Integer;
    C: Char;
    M: TPDFXForm;
    OldM: TPDFXForm;
    DX: Double;
    OI: Integer;
    Text: string;
    W: Double;
    CW: Word;
    WI: Integer;
    DXS: Double;
    MapText: string;
    TestP1: TPDFXFormPoint;
    TestP2: TPDFXFormPoint;
    RealTextSize: Double;
    GC: TPDFGenericCanvas;
    CIDToGIDMap: string;
    CN: string;
    MappedText: string;
    ThisMappedText: string;
    MatrixScale: Double;
    UC: Word;
    UX: Integer;
    UFound: Boolean;
begin
    MatrixScale := 1000;
    MappedText := '';
    CIDToGIDMap := FFontCol.FoundFontData.CIDToGIDMap;
    if FDestination = rdEPS then
      GC := FEPS
    else
      GC := Picasso;
    GC.BeginPath;
    if GS.TextSize <> 0 then
    begin
      if Assigned(FFontCol.FoundFontData.Rasterizer) then
        FFontCol.FoundFontData.Rasterizer.RenderingMode := GS.TextRenderingMode;
      SetFill(pfUnknown);
      SetPen;
      OI := Operands.Count - ArrayCount;
      SelectFont;
      OldM := CanvasMat;
      try
        CombinePDFXForm(CanvasMat, GS.TM, CanvasMat);
        if FFontCol.FoundFontData.Rasterizer is TPDFType3Rasterizer then
        begin
          MatrixScale := 1;
          CombinePDFXForm(CanvasMat, TPDFType3Rasterizer(
            FFontCol.FoundFontData.Rasterizer).FontMatrix, CanvasMat);
          TPDFType3Rasterizer(FFontCol.FoundFontData.Rasterizer).FillColor :=
            GS.FillColor;
        end;
        Mat(M, GS.TextSize / MatrixScale * GS.TextScaling / 100, 0, 0, GS.TextSize / MatrixScale, 0, 0);
        CombinePDFXForm(CanvasMat, M, CanvasMat);

        TestP1.X := 0;
        TestP1.Y := 0;
        TestP2.X := 0;
        TestP2.Y := MatrixScale;
        TestP1 := DoPDFXForm(CanvasMat, TestP1);
        TestP2 := DoPDFXForm(CanvasMat, TestP2);
        RealTextSize := Sqrt(Sqr(TestP2.X - TestP1.X) + Sqr(TestP2.Y - TestP1.Y));

        DX := 0;
        repeat
          Text := Operands[OI];
          if (Copy(Text, 1, 1) <> '(') and (Copy(Text, 1, 1) <> '<') then
          begin
            DX := DX - TazzToFloat(Text) * MatrixScale / 1000;
          end else
          begin
            if (Copy(Text, 1, 1) = '(') then
            begin
              Text := FStructure.DecodeString(Text);
            end else
            if (Copy(Text, 1, 1) = '<') then
            begin
              Text := FStructure.DecodeHex(Text);
            end;
            if FFontCol.FoundFontData.IsComposite then
            begin
              DXS := DX;
              for X := 1 to Length(Text) div 2 do
              begin
               CW := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);

               if CIDToGIDMap <> '' then
               begin
                  if (CW * 2) < Length(CIDToGIDMap) then
                  begin
                    CN := 'GID:' + IntToStr(Ord(CIDToGIDMap[CW * 2 + 1]) * 256 +
                      Ord(CIDToGIDMap[CW * 2 + 2]));
                  end;
               end else
                  CN := 'GID:' + IntToStr(CW);
               if Assigned(FFontCol.FoundFontData.CIDWidths) then
               begin
                  WI := FFontCol.FoundFontData.CIDWidths.IndexOf('CID:' + IntToStr(CW));
                  if WI >= 0 then
                  begin
                    WI := Integer(FFontCol.FoundFontData.CIDWidths.Objects[WI]);
                    {***Th W := WI * GS.TextScaling / 100; ***}
                    W := WI{***Th W ***} + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
                  end else
                    W := MatrixScale;
               end else
                  W := MatrixScale;
               if FDestination <> rdTextFunnel then
               begin
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, CN);
               end;
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               begin
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(CW)];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                  begin
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    FFunnel.SetNextMatch(False);
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               end;
               DX := DX + W;
              end;
              FFunnel.SetNextMatch(True);
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
              begin
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := ''; // UKO Reset (necessary for positioned text glyphs)
              end;
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
              begin
               MapText := '';
               //MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2])];
               for X := 1 to Length(Text) div 2 do
               begin
                  UC := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);
                  UX := 0;
                  UFound := False;
                  while (not UFound) and (UX < Length(FFontCol.FoundFontData.DisplayCS2)) do
                  begin
                    if (UC >= FFontCol.FoundFontData.DisplayCS2[X].StartCode) and
                      (UC <= FFontCol.FoundFontData.DisplayCS2[X].EndCode) then
                    begin
                      UC := FFontCol.FoundFontData.DisplayCS2[X].ResultCode +
                        UC - FFontCol.FoundFontData.DisplayCS2[X].StartCode;
                      UFound := True;
                    end else
                      Inc(UX);
                  end;
                  if UFound then MapText := MapText + WideChar(UC);
               end;
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
              end;
            end else
            begin
              DXS := DX;
              for X := 1 to Length(Text) do
              begin
               C := Text[X];
               W := (Widths[Ord(C)] * Abs(GS.TextSize) {***Th GS.TextScaling / 100 ***} / Abs(GS.TextSize));
               if Assigned(FFontCol.FoundFontData.Rasterizer) then
                  W := W / FFontCol.FoundFontData.Rasterizer.FontMatrixScaling;
               W := W + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
               if C = #32 then
                  W := W + GS.WordSpacing * MatrixScale / Abs(GS.TextSize);
               if FDestination <> rdTextFunnel then
               begin
                  if Assigned(FFontCol.FoundFontData.Rasterizer) then
                    with FFontCol.FoundFontData do
                      Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
                        Abs(GS.TextSize), GC, CanvasMat, Encoding[Ord(C)]);
               end;
               if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
               begin
                  ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
                  if (ThisMappedText = ' ') and (MappedText <> '') then
                  begin
                    FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                      FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                        GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                        FFontCol.FoundFontData.Descent, DX - DXS,
                        FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
                    MappedText := '';
                    FFunnel.SetNextMatch(False);
                    DXS := DX + W;
                  end else
                    MappedText := MappedText + ThisMappedText;
               end;
               DX := DX + W;
              end;
              FFunnel.SetNextMatch(True);
              if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
              begin
               FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
               MappedText := '';   // UKO Reset (necessary for positioned text glyphs)
              end;
              if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
              begin
               MapText := '';
               for X := 1 to Length(Text) do
                  MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
               FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
                  FFontCol.FoundFontData.FontName, RealTextSize, DXS,
                    GS.TextRise * MatrixScale / Abs(GS.TextSize) +
                    FFontCol.FoundFontData.Descent, DX - DXS,
                    FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
              end;
            end;
          end;
          Inc(OI);
        until OI = Operands.Count;
        Mat(M, 1, 0, 0, 1, DX * Abs(GS.TextSize) / MatrixScale, 0);
        CombinePDFXForm(GS.TM, M, GS.TM);
      finally
        CanvasMat := OldM;
      end;
    end;
    ArrayCount := 0;
    if FDestination = rdEPS then
    begin
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
      begin
        FEPS.PSSetColor(GS.FillColorEPS);
        FEPS.PSFill(epsFillModeNonZeroWinding);
      end;
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
      begin
        FEPS.PSSetColor(GS.StrokeColorEPS);
        FEPS.PSStroke;
      end;
    end else
    begin
      if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
          SetPen;
      if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
        (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
          SetFill(pfWinding);
      if GS.TextRenderingMode >= 4 then
        Picasso.SetClippingPath(pfWinding);
      case GS.TextRenderingMode of
        0: Picasso.FillPath;
        1: Picasso.StrokePath;
        2: Picasso.StrokeAndFillPath;
        4: Picasso.FillPath;
        5: Picasso.StrokePath;
        6: Picasso.StrokeAndFillPath;
        7: Picasso.NullPath;
      end;
    end;
end;



So please test it and if it works ok, can it be included in Version 5.15 ?

best regards,
Ulrich


Posted By: DELBEKE
Date Posted: 13 Jul 06 at 8:17AM

thank you very much
monday, i 'll try it

 

Best regards



Posted By: DELBEKE
Date Posted: 18 Jul 06 at 5:12AM

Good job

Working fine for me.

The function can be improved, surely, but working better than the old version wich was too bugged before.

thanks very much Ukobsa :)




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk