<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : Problem with text extraction</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : I need help - I can help : Problem with text extraction]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Sat, 04 Apr 2026 23:09:27 +0000</pubDate>
  <lastBuildDate>Sat, 23 Nov 2013 20:56:11 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=2788</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[Problem with text extraction : Hi all,I am trying to extract...]]></title>
   <link>http://www.quickpdf.org/forum/problem-with-text-extraction_topic2788_post11352.html#11352</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=2501">rstojnic</a><br /><strong>Subject:</strong> 2788<br /><strong>Posted:</strong> 23 Nov 13 at 8:56PM<br /><br />Hi all,<div><br></div><div>I am trying to extract text from this PDF (on Mac):</div><div><br></div><div>http://research.microsoft.com/pubs/145347/bodypartrecognition.pdf</div><div><br></div><div>These are the results of my current efforts:</div><div><p ="p1">&#091;pdfText <span ="s1">appendString</span>:&#091;DQPL <span ="s2">ExtractFilePageText</span>:pdfFilePath :<span ="s3">@""</span> :nPage+<span ="s4">1</span> :0&#093;&#093;;</p></div><div>-&gt; crashes with unmapped memory exception deep in the Debenu code. Options 7 and 8 work, but return the text only partially, roughly half of the text is missing.&nbsp;</div><div><br></div><div><span style="font-size: 12px; line-height: 1.4;">&#091;pdfText</span><span style="font-size: 12px; line-height: 1.4;">&nbsp;</span><span ="s1" style="font-size: 12px; line-height: 1.4;">appendString</span><span style="font-size: 12px; line-height: 1.4;">:&#091;DQPL</span><span style="font-size: 12px; line-height: 1.4;">&nbsp;</span><span ="s2" style="font-size: 12px; line-height: 1.4;">ExtractFilePageText</span><span style="font-size: 12px; line-height: 1.4;">:pdfFilePath :</span><span ="s3" style="font-size: 12px; line-height: 1.4;">@""</span><span style="font-size: 12px; line-height: 1.4;">&nbsp;</span><span style="font-size: 12px; line-height: 1.4;">:nPage+</span><span ="s4" style="font-size: 12px; line-height: 1.4;">1</span><span style="font-size: 12px; line-height: 1.4;">&nbsp;</span><span style="font-size: 12px; line-height: 1.4;">:5</span><span style="font-size: 12px; line-height: 1.4;">&#093;&#093;;</span></div><div><span style="font-size: 12px; line-height: 1.4;"><br></span></div><div><span style="font-size: 12px; line-height: 1.4;">Returns only half of the text on the page. The last line of CSV file is only partially written to the string which makes me think it silently crashes, although the code apparently runs fine and outputs to the file. The same is true for options 3,4 and 6.&nbsp;</span></div><div><span style="font-size: 12px; line-height: 1.4;"><br></span></div><div><span style="font-size: 12px; line-height: 1.4;">Using the following code:</span></div><div><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;<span ="s1">int</span> textblockID = &#091;DQPL <span ="s2">ExtractFilePageTextBlocks</span>:pdfPath :<span ="s3">@""</span> :1&nbsp;:<span ="s4">3</span>&#093;;</p><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp; <span ="s1">int</span> count = &#091;DQPL <span ="s2">GetTextBlockCount</span>:textblockID&#093;;<span style="font-size: 12px; line-height: 1.4;">&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</span></p><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp; <span ="s1">for</span>(<span ="s1">int</span> i=<span ="s4">0</span>;i&lt;count;i++){</p><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span ="s5">NSString</span> *line = &#091;DQPL <span ="s2">GetTextBlockText</span>:textblockID :i+<span ="s4">1</span>&#093;;</p><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span ="s5">NSLog</span>(<span ="s3">@"Page 1 block %d = %@"</span>, i+<span ="s4">1</span>, line);</p><p ="p1">&nbsp; &nbsp; &nbsp; &nbsp; }</p><p ="p1">I can extracts only every alternate text block on the page, but manages to get to the end of the page. Therefore, again half is missing, but a different half than before!&nbsp;</p><p ="p1">Any pointers on what I might be doing wrong would be greatly appreciated! The PDF renders fine, which makes me thing that the problem is the text extraction code.</p><p ="p1">On a related note: is it possible to get glyph information _before_ it is put into blocks. E.g. the CSV file on the individual glyph basis, without any processing? I think that would be very useful.&nbsp;</p><p ="p1">Cheers, R.</p></div>]]>
   </description>
   <pubDate>Sat, 23 Nov 2013 20:56:11 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/problem-with-text-extraction_topic2788_post11352.html#11352</guid>
  </item> 
 </channel>
</rss>