Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Textextraction with danish characters
  FAQ FAQ  Forum Search   Register Register  Login Login

Textextraction with danish characters

 Post Reply Post Reply
Author
Message
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Topic: Textextraction with danish characters
    Posted: 21 Sep 07 at 2:57PM
Hi!

I've documents with danish textcontent. While extracting the lines will break when danish characters appear.

I've this one line:
Energimærkningen oplyser om ejendommens energiforbrug, mulighederne for at opnå besparelser.
After extracting i get this lines:
Energim
æ
rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn
å besparelser.

You see... when one of these strange (for me) characters appear the line will break.

Any advices for me how to extract a better way?

Best regards and thanks for reading,
Ingo

Back to Top
ukobsa View Drop Down
Senior Member
Senior Member


Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115
Post Options Post Options   Thanks (0) Thanks(0)   Quote ukobsa Quote  Post ReplyReply Direct Link To This Post Posted: 24 Sep 07 at 2:36PM
Hi Ingo,

after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?

greetings,
Uli
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 24 Sep 07 at 4:07PM
Hi Uli!

Before sending it to you i thought to myself "don't blame you" and look deeper.
With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character.
With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.

So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)

I'll send you the danish file and a code-snippet. Thanks  a lot in advance!
Best regards,
Ingo

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 28 Sep 07 at 5:23AM
Hi!

I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)

Best regards,
Ingo

--- from Uli ---

Hi Ingo,

i've had a look on your (danish) document:

- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions.
 
- The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.

We can look in the document for examples:

  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 13 280 Tm
  -0.134 Tc
  (Energim) Tj
  ET
  Q
  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 120 280 Tm
  0.062 Tc
  (\346) Tj
  ET
  Q
  q
  13 273 512 32 re
  W n
  BT
  /Fabc11 29 Tf
  0 0.3569 0.5882 rg
  1 0 0 1 141 280 Tm
  97.8404 Tz
  (rkning) Tj
  ET

  you can see how the word "Energimærkning" was built:
  "Energim" + "\346" + "rkning"
  This mean three textblocks for QuickPDF.
 
- What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-)
 
  Example:
 
  "Energim"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)]
 
  "æ"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)]
 
  "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn"
  Font: "Verdana"
  Textcolor: #000000
  TextSize: 8.21
  TextRect: [(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)]

  The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part ([(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.

Perhaps you can use option 4 in this case (wordextraction)... 
 
Best regards,
Uli

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store