Textextraction with danish characters
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=789
Printed Date: 10 May 24 at 2:42PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Textextraction with danish characters
Posted By: Ingo
Subject: Textextraction with danish characters
Date Posted: 21 Sep 07 at 2:57PM
Hi!
I've documents with danish textcontent. While extracting the lines will break when danish characters appear.
I've this one line: Energimærkningen oplyser om ejendommens energiforbrug, mulighederne for at opnå besparelser. After extracting i get this lines: Energim æ rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn å besparelser.
You see... when one of these strange (for me) characters appear the line will break.
Any advices for me how to extract a better way?
Best regards and thanks for reading, Ingo
|
Replies:
Posted By: ukobsa
Date Posted: 24 Sep 07 at 2:36PM
Hi Ingo,
after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?
greetings,
Uli
|
Posted By: Ingo
Date Posted: 24 Sep 07 at 4:07PM
Hi Uli!
Before sending it to you i thought to myself "don't blame you" and look deeper. With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character. With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.
So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)
I'll send you the danish file and a code-snippet. Thanks a lot in advance! Best regards, Ingo
|
Posted By: Ingo
Date Posted: 28 Sep 07 at 5:23AM
Hi!
I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)
Best regards, Ingo
--- from Uli ---
Hi Ingo,
i've had a look on your (danish) document:
- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions. - The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.
We can look in the document for examples:
q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 13 280 Tm -0.134 Tc (Energim) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 120 280 Tm 0.062 Tc (\346) Tj ET Q q 13 273 512 32 re W n BT /Fabc11 29 Tf 0 0.3569 0.5882 rg 1 0 0 1 141 280 Tm 97.8404 Tz (rkning) Tj ET
you can see how the word "Energimærkning" was built: "Energim" + "\346" + "rkning" This mean three textblocks for QuickPDF. - What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-) Example: "Energim" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)] "æ" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)] "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn" Font: "Verdana" Textcolor: #000000 TextSize: 8.21 TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)]
The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.
Perhaps you can use option 4 in this case (wordextraction)... Best regards, Uli
|
|