<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : Textextraction: Determine the used codepage</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : I need help - I can help : Textextraction: Determine the used codepage]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Tue, 16 Jun 2026 05:24:51 +0000</pubDate>
  <lastBuildDate>Thu, 29 Dec 2011 19:53:09 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=2078</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Hi,here a short adwise how make...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8875.html#8875</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1569">edvoigt</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 29 Dec 11 at 7:53PM<br /><br />Hi,<br><br>here a short adwise how make it not so dirty and more sure.<br><br>The solution above determes if a font is using afii-codes. Consequently you have to figure out, in which part of the PDF-Text is which font used. The test only if a such font exists is rather unexact. It may happen, there is a mixture of languages (and codepages) on one page. This is rather sure if there are parts from different PDFs combined to a new one. Therefore there is a journey first through the fontdefinitions and than through the content needed.<br><br>Werner<br>]]>
   </description>
   <pubDate>Thu, 29 Dec 2011 19:53:09 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8875.html#8875</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Hi,it seems to be a good source...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8855.html#8855</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1569">edvoigt</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 23 Dec 11 at 1:41PM<br /><br />Hi,<br><br>it seems to be a good source here:<br><br><a href="http://www.science.co.il/language/locale-codes.asp?s=codepage" target="_blank">http://www.science.co.il/language/locale-codes.asp?s=codepage</a><br><br>There are some other forwarding links.<br><br>And you need the relation between afii-codes and copepage.<br><br><br>May it helps.<br><br>Werner<br>]]>
   </description>
   <pubDate>Fri, 23 Dec 2011 13:41:50 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8855.html#8855</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : HiWerner!Thanks a lot for this!Now...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8854.html#8854</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 23 Dec 11 at 12:53PM<br /><br />Hi&nbsp;Werner!<BR><BR>Thanks a lot for this!<BR>Now i can go further ...<BR>So there's the question which are the most used worldwide encode-formats ;-)<DIV>&nbsp;</DIV><DIV>To you and all the other ones here</DIV><DIV>a Merry Christmas and a Happy New Year,</DIV><DIV>Ingo</DIV><DIV>&nbsp;</DIV>]]>
   </description>
   <pubDate>Fri, 23 Dec 2011 12:53:28 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8854.html#8854</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage :  Hi Ingo, Hi Andrew,the goal...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8853.html#8853</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1569">edvoigt</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 23 Dec 11 at 11:22AM<br /><br />Hi Ingo, Hi Andrew,<br><br>the goal is not to paint a glyph correctly, therefore it is enough to figure out&nbsp; the use of the codepage 1251 with its charset.<br><br>Here a quick&amp;dirty solution without to much work:<br><br><font face="Courier New, Courier, mono">function IsQP1251(fn: string): boolean;<br>var<br>&nbsp; QP: TQuickPDF;<br>&nbsp; obj: string;<br>&nbsp; i, n, glyphno, error, p: integer;<br>begin<br>&nbsp; Result := false;<br>&nbsp; QP := TQuickPDF.Create;<br>&nbsp; if QP.UnlockKey({$I PDFkey.inc}) = 1 // 8.xx<br>&nbsp; then begin<br>&nbsp;&nbsp;&nbsp; QP.LoadFromFile(fn, '');<br>&nbsp;&nbsp;&nbsp; n := QP.GetObjectCount;<br>&nbsp;&nbsp;&nbsp; i := 1;<br>&nbsp;&nbsp;&nbsp; repeat&nbsp;&nbsp;&nbsp; // search for an PDF-object /Encoding with afii-codes for cyrillic inside&nbsp;&nbsp; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; obj := QP.GetObjectToString(i);&nbsp; // afii10017-afii10846<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (Pos('/Encoding', obj)&gt;0) and (Pos('/Differences', obj)&gt;0)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; then begin <br>&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; p := Pos('afii10', obj); <br>&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; if (p&gt;0)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; then begin&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // 17..846<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Val(copy(obj, p+6, 3), glyphno, error);<br>&nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Result := </font><font face="Courier New, Courier, mono">(error=0) and </font><font face="Courier New, Courier, mono">(glyphno&gt;=17) and (glyphno&lt;=84 6);<br>&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; end;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br>&nbsp;&nbsp;&nbsp; until (i&gt;n) or Result;<br>&nbsp; end;<br>end;<br></font><br>This works under Delphi7 with QuickPDF 8.13<br><br>The basic idea is:<br><br>inside the PDF is an encoding-object, which may start so:<br>/Type /Encoding /BaseEncoding /WinAnsiEncoding /Differences<br>So I go over all objects and look for /Encoding and /Differences.<br><br>The /Differences-array is the key. There are strings starting with "afii" followed by a number. They denote a special interpretation of the value of a byte. <br><br>It is dirty, because I am not sure about is there ever a such /Encoding-entry in the PDF.<br><br>It is dirty, because I go the short way directly to an object. Better to go along through the tree (/Procset =&gt; /Font =&gt; /Encoding), but the afii-codes are rather unique in conjunction with /Differences.<br><br>I dont know, may this Encoding-object be compressed too? In this case it would be more work, another reason for my wish (case 9605) for a <span id="Bugs">GetInflatedObjectTo... as great brother of GetObjectTo... used in the solution above.</span><br><br>An overview you may get here<br><a href="http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt" target="_blank">http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt</a><br><br>A good source is<br><a href="http://www.adobe.com/c&#111;ntent/dam/Adobe/en/devnet/font/pdfs/5013.Cyrillic_Font_Spec.pdf" target="_blank">http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5013.Cyrillic_Font_Spec.pdf</a><br>too.<br><br>The solution works with the two examples I got from Ingo. Maybe there is more to do. But it is a first entry.<br><br><br>Cheers and merry christmas<br><br>Werner<br><span style="font-size:10px"><br /><br />Edited by edvoigt - 23 Dec 11 at 12:42PM</span>]]>
   </description>
   <pubDate>Fri, 23 Dec 2011 11:22:44 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8853.html#8853</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Hi Werner, Hi Andrew!  I begin...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8852.html#8852</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 23 Dec 11 at 8:13AM<br /><br />Hi Werner, Hi Andrew!<DIV>&nbsp;</DIV><DIV>I begin to realize ... really not an easy job ;-)</DIV><DIV>I have to read ... Thank you both for the links.</DIV><DIV>&nbsp;</DIV><DIV>Cheers, Ingo</DIV><DIV>&nbsp;</DIV>]]>
   </description>
   <pubDate>Fri, 23 Dec 2011 08:13:04 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8852.html#8852</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Extracting text is not an easy...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8851.html#8851</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1483">AndrewC</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 23 Dec 11 at 7:59AM<br /><br />Extracting text is not an easy process. &nbsp;&nbsp;Each font can either be defined as single byte or multi byte font and they also contain an Encoding entry in the CMap which is similar to a code page. &nbsp;To make things more complex a font can also contain a ToUnicode mapping array to help with text extraction routines. &nbsp;On top of that a font can contain a Differences array which can remap any character code to a new code.<div><br></div><div>Also some documents contains subsetted fonts which allow you remap any character code to any other code and this can make text extraction impossible as you can tell the font to draw a 'A' but the font actually draws as a 'B'.</div><div><br></div><div>QPL hides this functionality to make text extraction easy to use. &nbsp;GetPageText options 3 - 8 are more advanced than option 0 and 1.</div><div><br></div><div>The only way to understand what is going on is to look at the CMaps contained in each font.</div><div><br></div><div><a href="http://www.adobe.com/c&#111;ntent/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf" target="_blank">http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf</a>&nbsp;</div><div><br></div><div>Andrew<br><br></div>]]>
   </description>
   <pubDate>Fri, 23 Dec 2011 07:59:07 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8851.html#8851</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage :  Hi Werner!You&amp;#039;ve got an...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8850.html#8850</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 22 Dec 11 at 10:31PM<br /><br />Hi Werner!<br><br>You've got an email ... right now ;-)<br>Thanks in advance.<br><br>Cheers, Ingo<br><span style="font-size:10px"><br /><br />Edited by Ingo - 22 Dec 11 at 10:32PM</span>]]>
   </description>
   <pubDate>Thu, 22 Dec 2011 22:31:52 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8850.html#8850</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Hi Ingo,the PDF-Spec gives no...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8849.html#8849</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1569">edvoigt</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 22 Dec 11 at 10:19PM<br /><br />Hi Ingo,<br><br>the PDF-Spec gives no Results if I search for "codepage". Therefore it is not so simple. But there is an explanation in <u>5.5.5 Character Encoding</u>. I hope this is the key.<br><br>To verify this, it is necessary to have a longer look in an inflated encoding dictionary and an entry in it. For a such test a PDF with codepage 1251 is needed. <br><br>This means a longer way thru some objects. Can you mail me a such PDF? <br><br>Erläuterungen natürlich besser für mich auf deutsch.<br><br>Cheers,<br><br>Werner<br>]]>
   </description>
   <pubDate>Thu, 22 Dec 2011 22:19:53 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8849.html#8849</guid>
  </item> 
  <item>
   <title><![CDATA[Textextraction: Determine the used codepage : Hi!   Now myself ... ;-) I&amp;#039;m...]]></title>
   <link>http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8848.html#8848</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 2078<br /><strong>Posted:</strong> 22 Dec 11 at 3:36PM<br /><br />Hi! <DIV>&nbsp;</DIV><DIV>Now myself ... ;-)</DIV><DIV>I'm an extensive user of the extract functionality still i'm mostly working with version 7.26 together with Delphi 2007 and Delphi 5.</DIV><DIV>Actually i'm trying to get the textcontents as unicode.</DIV><DIV>I can extract the text&nbsp;made in&nbsp;arabic, russian and many other languages&nbsp;as long as they were&nbsp;created with utf8 and after extraction i'll get it all with ...</DIV><DIV>ansistring := utf8decode(widestring)&nbsp;</DIV><DIV>&nbsp;</DIV><DIV>As samples i've many foreign pdf-documents. There are few russian documents, too.</DIV><DIV>Few i can extract 'cause they were made with utf8 - other russian documents failed 'cause they were made with codepage 1251.</DIV><DIV>&nbsp;</DIV><DIV>My questions now:</DIV><DIV>Is there a functionality available to detect the used codepage in a textcontent in an automated way?</DIV><DIV>How to decode codepage 1251 - something similar to utf8decode?</DIV><DIV>&nbsp;</DIV><DIV>Hope someone can help me out.</DIV><DIV>Thanks a lot in advance.</DIV><DIV>&nbsp;</DIV><DIV>Cheers, Merry Christmas and a Happy New Year to all of you</DIV><DIV>Ingo</DIV><DIV>&nbsp;</DIV><span style="font-size:10px"><br /><br />Edited by Ingo - 22 Dec 11 at 3:39PM</span>]]>
   </description>
   <pubDate>Thu, 22 Dec 2011 15:36:37 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/textextraction-determine-the-used-codepage_topic2078_post8848.html#8848</guid>
  </item> 
 </channel>
</rss>