ExtractFilePageText Inconsistencies (ANSI/Unicode)

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi There,
I have some code which is trying to extract text from a PDF document as such:

for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance) 
	ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7)
next

This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF.  After scrutinizing the result, I realize there is a bit of a problem.  For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether.  I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text.  Here is a snippet of the text that is extracted:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
                                or (250)             (Victoria).

Here is a snippet of the text I'd expect:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
1-800-663-7206 or (250) 952-2668 (Victoria).

Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue.  The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI.  It seems that the entire word is being removed if it contains a Unicode character.

This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
18006637206 or (250) 9522668 (Victoria).

To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working.

Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word?

Thanks in advance for any help that someone might be able to provide.
-John

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
aitchisj Members Profile Find Members Posts Beginner Joined: 01 Jun 12 Status: Offline Points: 6	Post Options Post Reply Quote aitchisj Report Post Thanks(0) Quote Reply Topic: ExtractFilePageText Inconsistencies (ANSI/Unicode) Posted: 05 Jun 12 at 11:37PM
	Hi There, I have some code which is trying to extract text from a PDF document as such: for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance) ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7) next This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF. After scrutinizing the result, I realize there is a bit of a problem. For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether. I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text. Here is a snippet of the text that is extracted: lf you have any difficulties or questions, please call the Teleplan Support Centre at or (250) (Victoria). Here is a snippet of the text I'd expect: lf you have any difficulties or questions, please call the Teleplan Support Centre at 1-800-663-7206 or (250) 952-2668 (Victoria). Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue. The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI. It seems that the entire word is being removed if it contains a Unicode character. This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this: lf you have any difficulties or questions, please call the Teleplan Support Centre at 18006637206 or (250) 9522668 (Victoria). To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working. Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word? Thanks in advance for any help that someone might be able to provide. -John

AndrewC Members Profile Find Members Posts Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841	Post Options Post Reply Quote AndrewC Report Post Thanks(0) Quote Reply Posted: 07 Jun 12 at 2:08PM
	There will be some fixes in the 8.16 beta 3 release to improve this. The PDF was using a composite font and the hyphen character was not defined in the PDF font. It will now be replaced with a space character. Options 0,1,2 uses a totally different method for text extraction than options 3 - 8. Andrew.

aitchisj Members Profile Find Members Posts Beginner Joined: 01 Jun 12 Status: Offline Points: 6	Post Options Post Reply Quote aitchisj Report Post Thanks(0) Quote Reply Posted: 07 Jun 12 at 5:12PM
	Andrew, I appreciate the quick response and hope that this will be resolved in a future release of QPL. Have a great day, John