Debenu Quick PDF Library - PDF SDK Community Forum : Textextraction with danish characters

Debenu Quick PDF Library - PDF SDK Community Forum : Textextraction with danish characters http://www.quickpdf.org/forum/ Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved. Wed, 13 May 2026 02:18:49 +0000 Fri, 28 Sep 2007 05:23:34 +0000 http://blogs.law.harvard.edu/tech/rss Web Wiz Forums 11.01 360 www.quickpdf.org/forum/RSS_post_feed.asp?TID=789 <![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]> http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png http://www.quickpdf.org/forum/ <![CDATA[Textextraction with danish characters : Hi!I've got an answer from...]]> http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3813.html#3813 Author: Ingo
Subject: 789
Posted: 28 Sep 07 at 5:23AM

Hi!

I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)

Best regards,
Ingo

--- from Uli ---

Hi Ingo,

i've had a look on your (danish) document:

- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions.

- The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.

We can look in the document for examples:

q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 13 280 Tm
-0.134 Tc
(Energim) Tj
ET
Q
q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 120 280 Tm
0.062 Tc
(\346) Tj
ET
Q
q
13 273 512 32 re
W n
BT
/Fabc11 29 Tf
0 0.3569 0.5882 rg
1 0 0 1 141 280 Tm
97.8404 Tz
(rkning) Tj
ET

you can see how the word "Energim�rkning" was built:
"Energim" + "\346" + "rkning"
This mean three textblocks for QuickPDF.

- What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-)

Example:

"Energim"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)]

"�"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)]

"rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn"
Font: "Verdana"
Textcolor: #000000
TextSize: 8.21
TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)]

The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.

Perhaps you can use option 4 in this case (wordextraction)...

Best regards,
Uli

]]> Fri, 28 Sep 2007 05:23:34 +0000 http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3813.html#3813 <![CDATA[Textextraction with danish characters : Hi Uli!Before sending it to you...]]> http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3805.html#3805 Author: Ingo
Subject: 789
Posted: 24 Sep 07 at 4:07PM

Hi Uli!

Before sending it to you i thought to myself "don't blame you" and look deeper.
With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character.
With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.

So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)

I'll send you the danish file and a code-snippet. Thanks a lot in advance!
Best regards,
Ingo

]]> Mon, 24 Sep 2007 16:07:52 +0000 http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3805.html#3805 <![CDATA[Textextraction with danish characters : Hi Ingo, after some short tests...]]> http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3804.html#3804 Author: ukobsa
Subject: 789
Posted: 24 Sep 07 at 2:36PM

Hi Ingo,

after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?

greetings,
Uli]]> Mon, 24 Sep 2007 14:36:34 +0000 http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3804.html#3804 <![CDATA[Textextraction with danish characters : Hi!I've documents with danish...]]> http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3801.html#3801 Author: Ingo
Subject: 789
Posted: 21 Sep 07 at 2:57PM

Hi!

I've documents with danish textcontent. While extracting the lines will break when danish characters appear.

I've this one line:
Energim�rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn� besparelser.
After extracting i get this lines:
Energim
�
rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn
� besparelser.

You see... when one of these strange (for me) characters appear the line will break.

Any advices for me how to extract a better way?

Best regards and thanks for reading,
Ingo

]]> Fri, 21 Sep 2007 14:57:18 +0000 http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3801.html#3801