Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Textextraction with danish characters |
Post Reply |
Author | |
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
Posted: 21 Sep 07 at 2:57PM |
Hi!
I've documents with danish textcontent. While extracting the lines will break when danish characters appear. I've this one line: Energimærkningen oplyser om ejendommens energiforbrug, mulighederne for at opnå besparelser. After extracting i get this lines: Energim æ rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn å besparelser. You see... when one of these strange (for me) characters appear the line will break. Any advices for me how to extract a better way? Best regards and thanks for reading, Ingo |
|
ukobsa
Senior Member Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
Post Options
Thanks(0)
|
Hi Ingo,
after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21? greetings, Uli |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Uli!
Before sending it to you i thought to myself "don't blame you" and look deeper. With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character. With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay. So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-) I'll send you the danish file and a code-snippet. Thanks a lot in advance! Best regards, Ingo |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi!
I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-) Best regards, Ingo --- from Uli --- Hi Ingo, i've had a look on your (danish) document: - You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions. - The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row. We can look in the document for examples: q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 13 280 Tm -0.134 Tc (Energim) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 120 280 Tm 0.062 Tc (\346) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 141 280 Tm 97.8404 Tz (rkning) Tj ET you can see how the word "Energimærkning" was built: "Energim" + "\346" + "rkning" This mean three textblocks for QuickPDF. - What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-) Example: "Energim" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)] "æ" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)] "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)] The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together. Perhaps you can use option 4 in this case (wordextraction)... Best regards, Uli |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store