Print Page | Close Window

Textextraction with danish characters

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=789
Printed Date: 10 May 24 at 2:42PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Textextraction with danish characters
Posted By: Ingo
Subject: Textextraction with danish characters
Date Posted: 21 Sep 07 at 2:57PM
Hi!

I've documents with danish textcontent. While extracting the lines will break when danish characters appear.

I've this one line:
Energimærkningen oplyser om ejendommens energiforbrug, mulighederne for at opnå besparelser.
After extracting i get this lines:
Energim
æ
rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn
å besparelser.

You see... when one of these strange (for me) characters appear the line will break.

Any advices for me how to extract a better way?

Best regards and thanks for reading,
Ingo




Replies:
Posted By: ukobsa
Date Posted: 24 Sep 07 at 2:36PM
Hi Ingo,

after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?

greetings,
Uli


Posted By: Ingo
Date Posted: 24 Sep 07 at 4:07PM
Hi Uli!

Before sending it to you i thought to myself "don't blame you" and look deeper.
With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character.
With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.

So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)

I'll send you the danish file and a code-snippet. Thanks  a lot in advance!
Best regards,
Ingo



Posted By: Ingo
Date Posted: 28 Sep 07 at 5:23AM
Hi!

I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)

Best regards,
Ingo

--- from Uli ---

Hi Ingo,

i've had a look on your (danish) document:

- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions.
 
- The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.

We can look in the document for examples:

  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 13 280 Tm
  -0.134 Tc
  (Energim) Tj
  ET
  Q
  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 120 280 Tm
  0.062 Tc
  (\346) Tj
  ET
  Q
  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 141 280 Tm
  97.8404 Tz
  (rkning) Tj
  ET

  you can see how the word "Energimærkning" was built:
  "Energim" + "\346" + "rkning"
  This mean three textblocks for QuickPDF.
 
- What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-)
 
  Example:
 
  "Energim"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)]
 
  "æ"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)]
 
  "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)]

  The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.

Perhaps you can use option 4 in this case (wordextraction)... 
 
Best regards,
Uli




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk