ExtractFilePageText - Underscores

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi, I'm using the DLL Version.

I have a PDF that contains "OBET_2007" - the ExtractFilePageText method splits this to 2 words: OBET and 2007 which is not what I want.

Is there a setting/dictionary so I get the whole word?

thank you very much
Hanspeter Stutz 

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
masterofdesaster Members Profile Find Members Posts Beginner Joined: 28 Jun 10 Location: Switzerland Status: Offline Points: 3	Post Options Post Reply Quote masterofdesaster Report Post Thanks(0) Quote Reply Topic: ExtractFilePageText - Underscores Posted: 28 Jun 10 at 9:04PM
	Hi, I'm using the DLL Version. I have a PDF that contains "OBET_2007" - the ExtractFilePageText method splits this to 2 words: OBET and 2007 which is not what I want. Is there a setting/dictionary so I get the whole word? thank you very much Hanspeter Stutz

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 28 Jun 10 at 10:46PM
	Hi! Which are the options for extraction? If you're using the string by string and the "2007" was inserted later then this string is at a completely different part of the filecontent (but with the correct position data). What i want to say... try different options... are there differences? You should try option 0. This has nothing to do with the library. First in first out and so on... you'll know what i mean ;-) Cheers and welcome here, Ingo Edited by Ingo - 28 Jun 10 at 10:47PM

masterofdesaster Members Profile Find Members Posts Beginner Joined: 28 Jun 10 Location: Switzerland Status: Offline Points: 3	Post Options Post Reply Quote masterofdesaster Report Post Thanks(0) Quote Reply Posted: 29 Jun 10 at 4:19PM
	Hi Ingo, Thanks for the welcome, appreciated! First I have to say I have absolutely no experience how this library or in general stuff like this works - so apologize for dumb questions :-) I tried with different options and for my usage 3 or 4 is best. I have to identify the single page based on a number which is always on the same line - except this OBET_2007. I can easily workaround this but I am just curious why it happens. Can you recommend something I have to read to understand better? cheers Hanspeter

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 29 Jun 10 at 8:03PM
	Hi HP! If you want to use option 3/4 then you can't do anything against it. Did you see the described behavior with other underscores, too? Perhaps the page was created time ago with "OBET_2006"... and the extracted string could have position (as an example) line 5, column 6, Arial, 10, "OBET_2006". The complete page is finished but now the "2006" shall be replaced by a "2007". Our sample-string will be line 5, column 6, Arial, 10, "OBET_" now. At the end of the textcontent there's a new string with line 5, column 11, Arial, 10, "2007". First in - first out / last in - last out. While displaying a page with a pdf-reader, the reader catch all strings of a page together and put them regarding the position data into the correct sequence. If you're using option 3/4 for extraction the sequence of the strings can be different. Using option 0, QuickPDF thinks for you and put the strings in the correct sequence but then there are other disadvantages - It's your choice. Cheers, Ingo

masterofdesaster Members Profile Find Members Posts Beginner Joined: 28 Jun 10 Location: Switzerland Status: Offline Points: 3	Post Options Post Reply Quote masterofdesaster Report Post Thanks(0) Quote Reply Posted: 06 Jul 10 at 4:29PM
	Hi Ingo, Ok I understand now. I have it working now - thanks for your help cheers Hanspeter