I need help - I can help - Textextraction with danish characters

Print Page | Close Window

Textextraction with danish characters

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=789
Printed Date: 13 May 26 at 2:11AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Textextraction with danish characters

Posted By: Ingo
Subject: Textextraction with danish characters
Date Posted: 21 Sep 07 at 2:57PM

Hi!

I've documents with danish textcontent. While extracting the lines will break when danish characters appear.

I've this one line:
Energim�rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn� besparelser.
After extracting i get this lines:
Energim
�
rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn
� besparelser.

You see... when one of these strange (for me) characters appear the line will break.

Any advices for me how to extract a better way?

Best regards and thanks for reading,
Ingo

Replies:

Posted By: ukobsa
Date Posted: 24 Sep 07 at 2:36PM

Hi Ingo,

after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?

greetings,
Uli

Posted By: Ingo
Date Posted: 24 Sep 07 at 4:07PM

Hi Uli!

Before sending it to you i thought to myself "don't blame you" and look deeper.
With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character.
With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.

So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)

I'll send you the danish file and a code-snippet. Thanks a lot in advance!
Best regards,
Ingo

Posted By: Ingo
Date Posted: 28 Sep 07 at 5:23AM

Hi!

I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)

Best regards,
Ingo

--- from Uli ---

Hi Ingo,

i've had a look on your (danish) document:

- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions.

- The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.

We can look in the document for examples:

q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 13 280 Tm
-0.134 Tc
(Energim) Tj
ET
Q
q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 120 280 Tm
0.062 Tc
(\346) Tj
ET
Q
q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 141 280 Tm
97.8404 Tz
(rkning) Tj
ET

you can see how the word "Energim�rkning" was built:
"Energim" + "\346" + "rkning"
This mean three textblocks for QuickPDF.

- What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-)

Example:

"Energim"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)]

"�"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)]

"rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)]

The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.

Perhaps you can use option 4 in this case (wordextraction)...

Best regards,
Uli