I need help - I can help - ExtractFilePageText

Print Page | Close Window

ExtractFilePageText - Underscores

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1503
Printed Date: 05 Apr 26 at 5:34AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: ExtractFilePageText - Underscores

Posted By: masterofdesaster
Subject: ExtractFilePageText - Underscores
Date Posted: 28 Jun 10 at 9:04PM

Hi, I'm using the DLL Version.

I have a PDF that contains "OBET_2007" - the ExtractFilePageText method splits this to 2 words: OBET and 2007 which is not what I want.

Is there a setting/dictionary so I get the whole word?

thank you very much

Hanspeter Stutz

Replies:

Posted By: Ingo
Date Posted: 28 Jun 10 at 10:46PM

Hi!

Which are the options for extraction?
If you're using the string by string and the "2007" was inserted later then this string is at a completely different part of the filecontent (but with the correct position data).
What i want to say... try different options... are there differences? You should try option 0.
This has nothing to do with the library. First in first out and so on... you'll know what i mean ;-)

Cheers and welcome here, Ingo

Posted By: masterofdesaster
Date Posted: 29 Jun 10 at 4:19PM

Hi Ingo,

Thanks for the welcome, appreciated!

First I have to say I have absolutely no experience how this library or in general stuff like this works - so apologize for dumb questions :-)

I tried with different options and for my usage 3 or 4 is best. I have to identify the single page based on a number which is always on the same line - except this OBET_2007.

I can easily workaround this but I am just curious why it happens. Can you recommend something I have to read to understand better?

cheers

Hanspeter

Posted By: Ingo
Date Posted: 29 Jun 10 at 8:03PM

Hi HP!

If you want to use option 3/4 then you can't do anything against it.
Did you see the described behavior with other underscores, too?
Perhaps the page was created time ago with "OBET_2006"...
and the extracted string could have position (as an example) line 5,
column 6, Arial, 10, "OBET_2006".
The complete page is finished but now the "2006" shall be replaced
by a "2007". Our sample-string will be line 5, column 6, Arial, 10, "OBET_"
now. At the end of the textcontent there's a new string with line 5,
column 11, Arial, 10, "2007".
First in - first out / last in - last out.
While displaying a page with a pdf-reader, the reader catch all strings of a page
together and put them regarding the position data into the correct sequence.
If you're using option 3/4 for extraction the sequence of the strings can be
different.
Using option 0, QuickPDF thinks for you and put the strings in the correct sequence
but then there are other disadvantages - It's your choice.

Cheers, Ingo

Posted By: masterofdesaster
Date Posted: 06 Jul 10 at 4:29PM

Hi Ingo,

Ok I understand now. I have it working now - thanks for your help

cheers

Hanspeter