Textextraction with danish characters

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi!

I've documents with danish textcontent. While extracting the lines will break when danish characters appear.

I've this one line:
Energim�rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn� besparelser.
After extracting i get this lines:
Energim
�
rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn
� besparelser.

You see... when one of these strange (for me) characters appear the line will break.

Any advices for me how to extract a better way?

Best regards and thanks for reading,
Ingo

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Topic: Textextraction with danish characters Posted: 21 Sep 07 at 2:57PM
	Hi! I've documents with danish textcontent. While extracting the lines will break when danish characters appear. I've this one line: Energim�rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn� besparelser. After extracting i get this lines: Energim � rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn � besparelser. You see... when one of these strange (for me) characters appear the line will break. Any advices for me how to extract a better way? Best regards and thanks for reading, Ingo

ukobsa Members Profile Find Members Posts Senior Member Joined: 29 May 06 Location: Germany Status: Offline Points: 115	Post Options Post Reply Quote ukobsa Report Post Thanks(0) Quote Reply Posted: 24 Sep 07 at 2:36PM
	Hi Ingo, after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21? greetings, Uli

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 24 Sep 07 at 4:07PM
	Hi Uli! Before sending it to you i thought to myself "don't blame you" and look deeper. With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character. With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay. So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-) I'll send you the danish file and a code-snippet. Thanks a lot in advance! Best regards, Ingo

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 28 Sep 07 at 5:23AM
	Hi! I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-) Best regards, Ingo --- from Uli --- Hi Ingo, i've had a look on your (danish) document: - You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions. - The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row. We can look in the document for examples: q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 13 280 Tm -0.134 Tc (Energim) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 120 280 Tm 0.062 Tc (\346) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 141 280 Tm 97.8404 Tz (rkning) Tj ET you can see how the word "Energim�rkning" was built: "Energim" + "\346" + "rkning" This mean three textblocks for QuickPDF. - What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-) Example: "Energim" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(63,11\|591,40) (95,35\|591,40) (95,35\|599,38) (63,11\|599,38)] "�" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(95,35\|591,40) (102,30\|591,40) (102,30\|599,38) (95,35\|599,38)] "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(102,30\|591,40) (408,17\|591,40) (408,17\|599,38) (102,30\|599,38)] The edge on the right top (...(95,35\|591,40)...) is identically to the left, bottom edge of the second part ([(95,35\|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together. Perhaps you can use option 4 in this case (wordextraction)... Best regards, Uli