comment on text extraction
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=462
Printed Date: 18 May 24 at 8:43PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: comment on text extraction
Posted By: ukobsa
Subject: comment on text extraction
Date Posted: 12 Jul 06 at 6:46AM
Hi,
just a short comment on text extraction using GetPageText with option 4:
I wrote a one liner:
Test "This, that"
using LaTeX and tried to extract text with option 4 from the generated PDF: this results in the following:
"IUQMMW+CMR10",#000000,9.96,133.7680,705.1921,209.4270,705.1921,209.4270,714.0393,133.7680,714.0393,"TTesTest Test”ThiTest”ThisTest”This, Test”This,thTest”This,that”"
After some debugging I find, that GetPageText has problems to extract words when they are not defined as simple "[(text)]Tj" but as "[(t) 83 (ex) 83 (t)]Tj" (with individual glyph positioning) which was the case in my example: [(T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at")]TJ
So I think, that option 3 of GetPageText seems to be the most usable one.
If someone has a fix for this problem, let me know, please.
HTH,
Ulrich
|
Replies:
Posted By: ukobsa
Date Posted: 12 Jul 06 at 9:22AM
Update:
I have made a small change on source so that the result becomes at least as good as with option 3. It cannot assured that the text is always splitt to single words but I can avoid the partially doubled entries.
The remaining problem:
How can I determine if (T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at") are one, two or more words?
Is it allowed to post the changes here on this forum?
best regards,
Ulrich
|
Posted By: DELBEKE
Date Posted: 13 Jul 06 at 12:42AM
Please, post it
i am waiting these from years.
As far i can see , postive index are used for the beginning of the word
|
Posted By: ukobsa
Date Posted: 13 Jul 06 at 7:28AM
Ok, here it is:
Description: With the following change option 4 results in correct text when having text like [(te) 83 (s) -1 (t) -333 (thi) 82(s)]TJ
The original version results in "tetestest thithis" while the new version result in "test this".
Restriction: having text like the above it is not possible to divide it to single words. So it works more like option 3.
search for UKO in the following code (2 lines) and add the lines to your source.
Unit: uPDFRenderer
Method: SubRender
local SubMethod: ShowText
procedure ShowText;
var
X: Integer;
C: Char;
M: TPDFXForm;
OldM: TPDFXForm;
DX: Double;
OI: Integer;
Text: string;
W: Double;
CW: Word;
WI: Integer;
DXS: Double;
MapText: string;
TestP1: TPDFXFormPoint;
TestP2: TPDFXFormPoint;
RealTextSize: Double;
GC: TPDFGenericCanvas;
CIDToGIDMap: string;
CN: string;
MappedText: string;
ThisMappedText: string;
MatrixScale: Double;
UC: Word;
UX: Integer;
UFound: Boolean;
begin
MatrixScale := 1000;
MappedText := '';
CIDToGIDMap := FFontCol.FoundFontData.CIDToGIDMap;
if FDestination = rdEPS then
GC := FEPS
else
GC := Picasso;
GC.BeginPath;
if GS.TextSize <> 0 then
begin
if Assigned(FFontCol.FoundFontData.Rasterizer) then
FFontCol.FoundFontData.Rasterizer.RenderingMode := GS.TextRenderingMode;
SetFill(pfUnknown);
SetPen;
OI := Operands.Count - ArrayCount;
SelectFont;
OldM := CanvasMat;
try
CombinePDFXForm(CanvasMat, GS.TM, CanvasMat);
if FFontCol.FoundFontData.Rasterizer is TPDFType3Rasterizer then
begin
MatrixScale := 1;
CombinePDFXForm(CanvasMat, TPDFType3Rasterizer(
FFontCol.FoundFontData.Rasterizer).FontMatrix, CanvasMat);
TPDFType3Rasterizer(FFontCol.FoundFontData.Rasterizer).FillColor :=
GS.FillColor;
end;
Mat(M, GS.TextSize / MatrixScale * GS.TextScaling / 100, 0, 0, GS.TextSize / MatrixScale, 0, 0);
CombinePDFXForm(CanvasMat, M, CanvasMat);
TestP1.X := 0;
TestP1.Y := 0;
TestP2.X := 0;
TestP2.Y := MatrixScale;
TestP1 := DoPDFXForm(CanvasMat, TestP1);
TestP2 := DoPDFXForm(CanvasMat, TestP2);
RealTextSize := Sqrt(Sqr(TestP2.X - TestP1.X) + Sqr(TestP2.Y - TestP1.Y));
DX := 0;
repeat
Text := Operands[OI];
if (Copy(Text, 1, 1) <> '(') and (Copy(Text, 1, 1) <> '<') then
begin
DX := DX - TazzToFloat(Text) * MatrixScale / 1000;
end else
begin
if (Copy(Text, 1, 1) = '(') then
begin
Text := FStructure.DecodeString(Text);
end else
if (Copy(Text, 1, 1) = '<') then
begin
Text := FStructure.DecodeHex(Text);
end;
if FFontCol.FoundFontData.IsComposite then
begin
DXS := DX;
for X := 1 to Length(Text) div 2 do
begin
CW := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);
if CIDToGIDMap <> '' then
begin
if (CW * 2) < Length(CIDToGIDMap) then
begin
CN := 'GID:' + IntToStr(Ord(CIDToGIDMap[CW * 2 + 1]) * 256 +
Ord(CIDToGIDMap[CW * 2 + 2]));
end;
end else
CN := 'GID:' + IntToStr(CW);
if Assigned(FFontCol.FoundFontData.CIDWidths) then
begin
WI := FFontCol.FoundFontData.CIDWidths.IndexOf('CID:' + IntToStr(CW));
if WI >= 0 then
begin
WI := Integer(FFontCol.FoundFontData.CIDWidths.Objects[WI]);
{***Th W := WI * GS.TextScaling / 100; ***}
W := WI{***Th W ***} + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
end else
W := MatrixScale;
end else
W := MatrixScale;
if FDestination <> rdTextFunnel then
begin
if Assigned(FFontCol.FoundFontData.Rasterizer) then
with FFontCol.FoundFontData do
Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
Abs(GS.TextSize), GC, CanvasMat, CN);
end;
if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
begin
ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(CW)];
if (ThisMappedText = ' ') and (MappedText <> '') then
begin
FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
MappedText := '';
FFunnel.SetNextMatch(False);
DXS := DX + W;
end else
MappedText := MappedText + ThisMappedText;
end;
DX := DX + W;
end;
FFunnel.SetNextMatch(True);
if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
begin
FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
MappedText := ''; // UKO Reset (necessary for positioned text glyphs)
end;
if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
begin
MapText := '';
//MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2])];
for X := 1 to Length(Text) div 2 do
begin
UC := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]);
UX := 0;
UFound := False;
while (not UFound) and (UX < Length(FFontCol.FoundFontData.DisplayCS2)) do
begin
if (UC >= FFontCol.FoundFontData.DisplayCS2[X].StartCode) and
(UC <= FFontCol.FoundFontData.DisplayCS2[X].EndCode) then
begin
UC := FFontCol.FoundFontData.DisplayCS2[X].ResultCode +
UC - FFontCol.FoundFontData.DisplayCS2[X].StartCode;
UFound := True;
end else
Inc(UX);
end;
if UFound then MapText := MapText + WideChar(UC);
end;
FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
end;
end else
begin
DXS := DX;
for X := 1 to Length(Text) do
begin
C := Text[X];
W := (Widths[Ord(C)] * Abs(GS.TextSize) {***Th GS.TextScaling / 100 ***} / Abs(GS.TextSize));
if Assigned(FFontCol.FoundFontData.Rasterizer) then
W := W / FFontCol.FoundFontData.Rasterizer.FontMatrixScaling;
W := W + GS.CharSpacing * MatrixScale / Abs(GS.TextSize);
if C = #32 then
W := W + GS.WordSpacing * MatrixScale / Abs(GS.TextSize);
if FDestination <> rdTextFunnel then
begin
if Assigned(FFontCol.FoundFontData.Rasterizer) then
with FFontCol.FoundFontData do
Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale /
Abs(GS.TextSize), GC, CanvasMat, Encoding[Ord(C)]);
end;
if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
begin
ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
if (ThisMappedText = ' ') and (MappedText <> '') then
begin
FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
MappedText := '';
FFunnel.SetNextMatch(False);
DXS := DX + W;
end else
MappedText := MappedText + ThisMappedText;
end;
DX := DX + W;
end;
FFunnel.SetNextMatch(True);
if (FDestination = rdTextFunnel) and FFunnel.SplitWords then
begin
FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
MappedText := ''; // UKO Reset (necessary for positioned text glyphs)
end;
if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then
begin
MapText := '';
for X := 1 to Length(Text) do
MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X])];
FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor),
FFontCol.FoundFontData.FontName, RealTextSize, DXS,
GS.TextRise * MatrixScale / Abs(GS.TextSize) +
FFontCol.FoundFontData.Descent, DX - DXS,
FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent);
end;
end;
end;
Inc(OI);
until OI = Operands.Count;
Mat(M, 1, 0, 0, 1, DX * Abs(GS.TextSize) / MatrixScale, 0);
CombinePDFXForm(GS.TM, M, GS.TM);
finally
CanvasMat := OldM;
end;
end;
ArrayCount := 0;
if FDestination = rdEPS then
begin
if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
(GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
begin
FEPS.PSSetColor(GS.FillColorEPS);
FEPS.PSFill(epsFillModeNonZeroWinding);
end;
if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
(GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
begin
FEPS.PSSetColor(GS.StrokeColorEPS);
FEPS.PSStroke;
end;
end else
begin
if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or
(GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then
SetPen;
if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or
(GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then
SetFill(pfWinding);
if GS.TextRenderingMode >= 4 then
Picasso.SetClippingPath(pfWinding);
case GS.TextRenderingMode of
0: Picasso.FillPath;
1: Picasso.StrokePath;
2: Picasso.StrokeAndFillPath;
4: Picasso.FillPath;
5: Picasso.StrokePath;
6: Picasso.StrokeAndFillPath;
7: Picasso.NullPath;
end;
end;
end;
So please test it and if it works ok, can it be included in Version 5.15 ?
best regards,
Ulrich
|
Posted By: DELBEKE
Date Posted: 13 Jul 06 at 8:17AM
thank you very much monday, i 'll try it
Best regards
|
Posted By: DELBEKE
Date Posted: 18 Jul 06 at 5:12AM
Good job
Working fine for me.
The function can be improved, surely, but working better than the old version wich was too bugged before.
thanks very much Ukobsa :)
|
|