<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : text extraction</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : I need help - I can help : text extraction]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Mon, 25 May 2026 11:13:30 +0000</pubDate>
  <lastBuildDate>Wed, 09 Mar 2022 12:10:59 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=3974</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[text extraction : Just to save others the trouble,...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16044.html#16044</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1388">tfrost</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 09 Mar 22 at 12:10PM<br /><br />Just to save others the trouble, I found that the DA versions of these functions produce the same garbage output, although the&nbsp; source code for the top-level functions is completely different.<div><br></div><div>The winsoft.sk PDFIUM wrapper has an extract text sample which extracts all four of your examples correctly. It is available for Delphi or .Net but I have only tested the former.&nbsp; I feel able to mention it here because of the QPDF EOL announcement and the slim chance of this being improved in QPDF.</div>]]>
   </description>
   <pubDate>Wed, 09 Mar 2022 12:10:59 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16044.html#16044</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : I&amp;#039;ve got the same described...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16043.html#16043</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 08 Mar 22 at 10:14PM<br /><br />I've got the same described issues with normal <br>GetPageText with option 7 and option 3!<br><br>My sample looks like this:<br><br>&nbsp;<br>&nbsp;--- page 1 from 1 --- <br>&nbsp;<br>"Arial";000000;0;234/118;"Text1"<br><br>"AAAAAA+ISOCPEUR";000000;105;234/241;"T㠀䂎 쁰䀀䂏㎀ӷ툠彯切䁚T�ĕὈˑ糌˔ꯣ}꼹}e㠀䂎 쁰䀀䂏㎀ӷ툠彯切䁚 䂀 䂀eꄸӲ�ĕ艔ĕὈˑ糌˔ꯣ}꼹}꽊}냎}궠Ӯ군Ӯx㠀䂎 쁰䀀䂏㎀ӷ툠彯切䁚　䁼㠀䂎x鏨Ӻ萴ĉ분đὈˑ糌˔ꯣ}꼹}꽊}냎}궠Ӯ군Ӯ῜ˑt㠀䂎 쁰䀀䂏㎀ӷ툠彯切䁚　䁼⠀䂖t׭脴Č戌ĎὈˑ糌˔ꯣ}꼹}꽊}냎}궠Ӯ군Ӯ῜ˑ2㠀䂎 쁰䀀䂏㎀ӷ툠彯切䁚　䁸"<br><br>"AAAAAA+Tahoma";000000;93;234/369;"T⠀䂐　쁺䀀䂏㎀ӷ氠穸咥䁗T脴Č׭耼˔ꯣ}꼹}e⠀䂐　쁺䀀䂏㎀ӷ氠穸咥䁗䀀䂂䀀䂂eꄸӲ脴Č矌Č׭耼˔ꯣ}꼹}꽊}냎}궠Ӯ군Ӯx⠀䂐　쁺䀀䂏㎀ӷ氠穸咥䁗瀀䂀堀䂑x鏨Ӻ脴Č鈌Ď׭耼˔ꯣ}꼹}꽊}냎}궠Ӯ군ӮὌˑt⠀䂐　쁺䀀䂏㎀ӷ氠穸咥䁗䁾᐀䂙t׭脴Č戌Ď׭耼˔ꯣ}꼹}꽊}냎}궠Ӯ군ӮὌˑ3⠀䂐　쁺䀀䂏㎀ӷ氠穸咥䁗䁴"<br><br>"AAAAAA+ArialMT";000000;95;234/497;"T䀀䂐倀쁴䀀䂏㎀ӷ鄀ﭾ뀺䁗T矌ĕ䋸ˑ׭갦}꼹}e䀀䂐倀쁴䀀䂏㿰㎀ӷ鄀ﭾ뀺䁗က䂃᠀䂃က䂃eꄸӲ职ĕ䋸ˑ׭갦}꼹}꽊}냎}궠Ӯ군Ӯx䀀䂐倀쁴䀀䂏㿰㎀ӷ鄀ﭾ뀺䁗က䂃怀䂁㠀䂒x鏨Ӻ脴Č职ĕ䋸ˑ׭갦}꼹}꽊}냎}궠Ӯ군Ӯ׭׭t䀀䂐倀쁴䀀䂏뿰㎀ӷ鄀ﭾ뀺䁗ఀ䂚䀀䁿ఀ䂚tৈ׭脴Č䋸ˑ׭갦}꼹}꽊}냎}궠Ӯ군Ӯ׭׭4䀀䂐倀쁴䀀䂏㿰㎀ӷ鄀ﭾ뀺䁗怀䂞怀䁱&nbsp;&nbsp;&nbsp; "<br><br>]]>
   </description>
   <pubDate>Tue, 08 Mar 2022 22:14:37 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16043.html#16043</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : I think it&amp;#039;s a bug (which...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16042.html#16042</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1388">tfrost</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 08 Mar 22 at 3:35PM<br /><br />I think it's a bug (which I can reproduce) in&nbsp;<span style=": rgb251, 251, 253;">GetTextBlockText. After each valid character in the 130 characters returned for each (5-character) line of your text, there is a block of garbage 8-bit characters which form the mojibake string. This happens in every text line except your first one.&nbsp; Dropping all characters not in 0..9,a..z,A..Z reveals the correct results.</span><div><span style=": rgb251, 251, 253;"><br></span></div><div><span style=": rgb251, 251, 253;">It looks as if GetPageText(0) sanitizes the output somehow, to get rid of all the garbage.&nbsp; But it is hard to tell from the source where it does this, because the primary filter in this function is the area covered, and the filtering on content must then happen below this. My (ancient) copy of Debenu PDF Tools Pro must use this function, because it also displays all the text correctly when extracting.</span></div><div><span style=": rgb251, 251, 253;"><br></span></div><div><span style=": rgb251, 251, 253;">Interestingly, I opened your PDF in Affinity Publisher (which can edit PDFs), and found that Text1 completely disappears, whereas the other three are clear and editable, in substituted fonts Consolas, Tahoma and Arial respectively.</span></div>]]>
   </description>
   <pubDate>Tue, 08 Mar 2022 15:35:28 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16042.html#16042</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Hi,at the following link you find...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16041.html#16041</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=3315">BAULOG</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 08 Mar 22 at 1:04PM<br /><br />Hi,<div><br></div><div><br></div><div>at the following link you find a sample file with 4 lines text. Each text has a different font.</div><div>I only get the first line as an readable text.</div><div><br></div><div><a href="http://cloud.baulog.de/index.php/s/4cSjGXdLXfcZqeJ" target="_blank" rel="nofollow">https://cloud.baulog.de/index.php/s/4cSjGXdLXfcZqeJ</a></div><div><br></div><div>Maybe someone else has an idea.</div><div>Thanks for your help.</div><div><br></div><div>Peter</div>]]>
   </description>
   <pubDate>Tue, 08 Mar 2022 13:04:30 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16041.html#16041</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Your &amp;#034;hieroglyphics&amp;#034;...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16040.html#16040</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1388">tfrost</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 28 Feb 22 at 2:04PM<br /><br />Your "<span style=": rgb251, 251, 253;">hieroglyphics" look to me like 'mojibake', which is the name for Japanese (or Chinese) characters appearing when you mix up ASCII and DBCS. Have you checked in a debugger that you are not accidentally assigning the wrong character size somewhere?</span>]]>
   </description>
   <pubDate>Mon, 28 Feb 2022 14:04:01 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16040.html#16040</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Hi Peter,perhaps the pdf is encrypted?I...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16039.html#16039</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 27 Feb 22 at 8:20PM<br /><br />Hi Peter,<br><br>perhaps the pdf is encrypted?<br>I don't see a decryption in your code...<br>There are free online hoster offering a bit space combined with advertisement.<br>You can use it with your sample-pdfs and post the link here.<br><br>]]>
   </description>
   <pubDate>Sun, 27 Feb 2022 20:20:48 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16039.html#16039</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Hi Ingo,thank you very much for...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16034.html#16034</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=3315">BAULOG</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 23 Feb 22 at 8:02AM<br /><br />Hi Ingo,<div><br></div><div>thank you very much for the quick answer. Should we stick with English, as perhaps other users should also benefit from it?</div><div><br></div><div>Here the code snippet.&nbsp;</div><div><br></div><div>I have omitted the variable declarations (all started with "_").&nbsp;</div><div>PDFLib is the reference to the Debenu dll.</div><div><br></div><div><div>If&nbsp; FileOpen(_File,_ID)</div><div>&nbsp; _PageCount = PDFLib.PageCount()</div><div><br></div><div>&nbsp; For _PageCounter = 1 To _ToPage</div><div>&nbsp; &nbsp; _PCounter += 1</div><div>&nbsp; &nbsp; _TextCounter = 0</div><div>&nbsp; &nbsp; PDFLib.SelectPage(_PageCounter)</div><div>&nbsp; &nbsp; PDFLib.NormalizePage(0)</div><div>&nbsp; &nbsp; _TBID = PDFLib.ExtractPageTextBlocks(3)</div><div>&nbsp; &nbsp; _TBCount = PDFLib.GetTextBlockCount(_TBID)</div><div><br></div><div>&nbsp; &nbsp; For _TBCounter = 1 To _TBCount</div><div>&nbsp; &nbsp; &nbsp; _TextList.add(PDFLib.GetTextBlockText(_TBID, _TBCounter))</div><div>&nbsp; &nbsp; Next</div><div><br></div><div>&nbsp; Next</div></div><div><br></div><div>FileClose(_ID)<span style="white-space:pre">	</span></div><div><br></div><div>End If</div><div><br></div><div>The PDF file was generated from a CAD application. All TTF texts (Arial etc.) cannot be read correctly. The SHX fonts (CAD specific) are read correctly.</div><div><br></div><div>I would like to attach the pdf file. But I can't find a way to do it.</div><div><br></div><div>many Thanks</div><div><br></div><div>Peter</div><span style="font-size:10px"><br /><br />Edited by BAULOG - 23 Feb 22 at 8:05AM</span>]]>
   </description>
   <pubDate>Wed, 23 Feb 2022 08:02:35 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16034.html#16034</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : aditional hint:A CombineContentStream...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16033.html#16033</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 22 Feb 22 at 8:49PM<br /><br />aditional hint:<br>A CombineContentStream and NormalizePage can help as well...<br>]]>
   </description>
   <pubDate>Tue, 22 Feb 2022 20:49:37 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16033.html#16033</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Hi Peter :) It looks as if a Decrypt...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16032.html#16032</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 22 Feb 22 at 8:48PM<br /><br />Hi Peter :) <br><br>It looks as if a Decrypt is missing after load and before extraction?<br>BTW: GetTextBlockText needs an ExtractPageTextBlocks first - what's the result of this function? Perhaps a longer, relevant code snippet makes it easier to say more...<br><br>Cheers and welcome here,<br>Ingo (living near Bremen ;-)<br>]]>
   </description>
   <pubDate>Tue, 22 Feb 2022 20:48:15 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16032.html#16032</guid>
  </item> 
  <item>
   <title><![CDATA[text extraction : Hi,at first, hello and greetings...]]></title>
   <link>http://www.quickpdf.org/forum/text-extraction_topic3974_post16031.html#16031</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=3315">BAULOG</a><br /><strong>Subject:</strong> 3974<br /><strong>Posted:</strong> 22 Feb 22 at 1:33PM<br /><br />Hi,<div><br></div><div><div>at first, hello and greetings from Germany.</div><div>We have been using the Quick PDF library for many years and have solved all tasks well. But now I have a problem for which I cannot find a solution.</div><div><br></div><div>The problem is text extraction.&nbsp;</div><div>With TTF fonts I get with the function&nbsp;</div><div><br></div><div>GetTextBlockText(_TBID, _TBCounter))&nbsp;</div><div><br></div><div>I always get hieroglyphics like</div><div>&nbsp;</div><div>T㠀䂎 쁰䀀䂏㴷砄␋Ȩ䁓T쳤視﫸〶�⻈ă徆㙺ăă抙㙺ăe㠀䂎 쁰䀀䂏㴷砄␋Ȩ䁓 䂀 䂀eࣘ〨튴밁젼視﫸〶�⻈ă徆㙺ăă抙㙺ăă抪㙺ăă搮㙺ăꚐ㯐Ꙡ㯐x㠀䂎 쁰䀀䂏㿰㴷砄␋Ȩ䁓 䂎 䁼 䂎x㯘쳤밁﫸〶⻍ă徆㙺ăă抙㙺ăă抪㙺ăă搮㙺ăꚐ㯐Ꙡ㯐�⻈⻍t㠀䂎 쁰䀀䂏㿰㴷砄␋Ȩ䁓 䂎 䁼␀䂖t㯘튴밁꼤륫﫸〶⻍ă徆㙺ăă抙</div><div><br></div><div>If I get the whole text of the page with&nbsp;</div><div><br></div><div>GetPageText(0)</div><div><br></div><div>I get the correct text</div><div><br></div><div>"Text"</div><div><br></div><div>Can anyone help me?</div><div><br></div><div>regards</div></div><div><br></div><div>Peter</div>]]>
   </description>
   <pubDate>Tue, 22 Feb 2022 13:33:00 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/text-extraction_topic3974_post16031.html#16031</guid>
  </item> 
 </channel>
</rss>