Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - ExtractFilePageText Inconsistencies (ANSI/Unicode)
  FAQ FAQ  Forum Search   Register Register  Login Login

ExtractFilePageText Inconsistencies (ANSI/Unicode)

 Post Reply Post Reply
Author
Message
aitchisj View Drop Down
Beginner
Beginner


Joined: 01 Jun 12
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote aitchisj Quote  Post ReplyReply Direct Link To This Post Topic: ExtractFilePageText Inconsistencies (ANSI/Unicode)
    Posted: 05 Jun 12 at 11:37PM
Hi There,

I have some code which is trying to extract text from a PDF document as such:

for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance) 
ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7)
next

This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF.  After scrutinizing the result, I realize there is a bit of a problem.  For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether.  I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text.  Here is a snippet of the text that is extracted:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
                                or (250)             (Victoria).

Here is a snippet of the text I'd expect:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
1-800-663-7206 or (250) 952-2668 (Victoria).

Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue.  The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI.  It seems that the entire word is being removed if it contains a Unicode character.

This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
18006637206 or (250) 9522668 (Victoria).

To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working.

Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word?

Thanks in advance for any help that someone might be able to provide.
-John

Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 07 Jun 12 at 2:08PM
There will be some fixes in the 8.16 beta 3 release to improve this.

The PDF was using a composite font and the hyphen character was not defined in the PDF font.  It will now be replaced with a space character.

Options 0,1,2 uses a totally different method for text extraction than options 3 - 8.

Andrew.
Back to Top
aitchisj View Drop Down
Beginner
Beginner


Joined: 01 Jun 12
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote aitchisj Quote  Post ReplyReply Direct Link To This Post Posted: 07 Jun 12 at 5:12PM
Andrew,

I appreciate the quick response and hope that this will be resolved in a future release of QPL.  
Have a great day,
John
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store