<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : Textextraction with danish characters</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : I need help - I can help : Textextraction with danish characters]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Wed, 13 May 2026 02:18:49 +0000</pubDate>
  <lastBuildDate>Fri, 28 Sep 2007 05:23:34 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=789</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[Textextraction with danish characters : Hi!I&amp;#039;ve got an answer from...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3813.html#3813</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 789<br /><strong>Posted:</strong> 28 Sep 07 at 5:23AM<br /><br />Hi!<br><br>I've got an answer from Uli (in german). I've translated it in my "german english" - so it's for everybody ;-)<br><br>Best regards,<br>Ingo<br><br>--- from Uli ---<br><br>Hi Ingo,<br><br>i've had a look on your (danish) document:<br><br>- You should use option 3 or 4. I've made the experience that the results are nearly always better than with option 0 or 1. I think option 0 and 1 is only included to be compatible with older versions.<br>&nbsp; <br>- The carriage returns are inside 'cause the special danish characters are included in the pdfcode as ansi-code (and not as real characters). Each ansi-code-entry is a separate string inside the pdfcode - So each special danish character is after extraction on a separate row.<br><br>We can look in the document for examples:<br><br>&nbsp; q<br>&nbsp; 13 273 512 32 re<br>&nbsp; W n<br>&nbsp; BT<br>&nbsp; /Fabc11 29 Tf<br>&nbsp; 0 0.3569 0.5882 rg<br>&nbsp; 1 0 0 1 13 280 Tm<br>&nbsp; -0.134 Tc<br>&nbsp; (Energim) Tj<br>&nbsp; ET<br>&nbsp; Q<br>&nbsp; q<br>&nbsp; 13 273 512 32 re<br>&nbsp; W n<br>&nbsp; BT<br>&nbsp; /Fabc11 29 Tf<br>&nbsp; 0 0.3569 0.5882 rg<br>&nbsp; 1 0 0 1 120 280 Tm<br>&nbsp; 0.062 Tc<br>&nbsp; (\346) Tj<br>&nbsp; ET<br>&nbsp; Q<br>&nbsp; q<br>&nbsp; 13 273 512 32 re<br>&nbsp; W n<br>&nbsp; BT<br>&nbsp; /Fabc11 29 Tf<br>&nbsp; 0 0.3569 0.5882 rg<br>&nbsp; 1 0 0 1 141 280 Tm<br>&nbsp; 97.8404 Tz<br>&nbsp; (rkning) Tj<br>&nbsp; ET <br><br>&nbsp; you can see how the word "Energimærkning" was built:<br>&nbsp; "Energim" + "\346" + "rkning"<br>&nbsp; This mean three textblocks for QuickPDF. <br>&nbsp; <br>- What you can do: You can examine the string coordinates to check which strings belong together. Not a 100 percent solution but mostly it will work ;-)<br>&nbsp; <br>&nbsp; Example:<br>&nbsp; <br>&nbsp; "Energim"<br>&nbsp; Font: "Verdana"<br>&nbsp; Textcolor: #000000<br>&nbsp; TextSize: 8.21<br>&nbsp; TextRect: &#091;(63,11|591,40) (95,35|591,40) (95,35|599,38) (63,11|599,38)&#093;<br>&nbsp; <br>&nbsp; "æ"<br>&nbsp; Font: "Verdana"<br>&nbsp; Textcolor: #000000<br>&nbsp; TextSize: 8.21<br>&nbsp; TextRect: &#091;(95,35|591,40) (102,30|591,40) (102,30|599,38) (95,35|599,38)&#093;<br>&nbsp; <br>&nbsp; "rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn"<br>&nbsp; Font: "Verdana"<br>&nbsp; Textcolor: #000000<br>&nbsp; TextSize: 8.21<br>&nbsp; TextRect: &#091;(102,30|591,40) (408,17|591,40) (408,17|599,38) (102,30|599,38)&#093;<br><br>&nbsp; The edge on the right top (...(95,35|591,40)...) is identically to the left, bottom edge of the second part (&#091;(95,35|591,40)...). The height is indentically, too. And so on, and so on, ... So you can put the strings together which belong together.<br><br>Perhaps you can use option 4 in this case (wordextraction)...&nbsp; <br>&nbsp;<br>Best regards,<br>Uli<br><br>]]>
   </description>
   <pubDate>Fri, 28 Sep 2007 05:23:34 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3813.html#3813</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction with danish characters : Hi Uli!Before sending it to you...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3805.html#3805</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 789<br /><strong>Posted:</strong> 24 Sep 07 at 4:07PM<br /><br />Hi Uli!<br><br>Before sending it to you i thought to myself "don't blame you" and look deeper. <br>With textextraction option 3 i get separate strings. Sometimes this are long strings (complete row) but if there's a danish character then this character is a separate string... and 'cause my output is string by string there are some "rows" with only one danish character.<br>With textextraction option 0 i get the textcontent of complete pages nearly as i see it in pdf. Then it's all okay.<br><br>So the question remains if it's possible to change the functionality "option 3"...? Yes. I'm still using the open version 5.21 ;-)<br><br>I'll send you the danish file and a code-snippet. Thanks&nbsp; a lot in advance!<br>Best regards,<br>Ingo<br><br>]]>
   </description>
   <pubDate>Mon, 24 Sep 2007 16:07:52 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3805.html#3805</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction with danish characters : Hi Ingo,  after some short tests...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3804.html#3804</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=370">ukobsa</a><br /><strong>Subject:</strong> 789<br /><strong>Posted:</strong> 24 Sep 07 at 2:36PM<br /><br />Hi Ingo,<br /><br />after some short tests I think that this is a document depending problem. Can you send me the file and the code that you use to extract? Am I right that you are using 5.21?<br /><br />greetings,<br />Uli]]>
   </description>
   <pubDate>Mon, 24 Sep 2007 14:36:34 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3804.html#3804</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction with danish characters : Hi!I&amp;#039;ve documents with danish...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3801.html#3801</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 789<br /><strong>Posted:</strong> 21 Sep 07 at 2:57PM<br /><br />Hi!<br><br>I've documents with danish textcontent. While extracting the lines will break when danish characters appear.<br><br>I've this one line:<br>Energimærkningen oplyser om ejendommens energiforbrug, mulighederne for at opnå besparelser.<br>After extracting i get this lines:<br>Energim<br>æ<br>rkningen oplyser om ejendommens energiforbrug, mulighederne for at opn<br>å besparelser.<br><br>You see... when one of these strange (for me) characters appear the line will break.<br><br>Any advices for me how to extract a better way?<br><br>Best regards and thanks for reading,<br>Ingo<br><br>]]>
   </description>
   <pubDate>Fri, 21 Sep 2007 14:57:18 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-with-danish-characters_topic789_post3801.html#3801</guid>
  </item> 
 </channel>
</rss>