I need help - I can help - Searching a string in an existing PDF file

Print Page | Close Window

Searching a string in an existing PDF file

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2150
Printed Date: 20 May 26 at 7:06PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Searching a string in an existing PDF file

Posted By: balane78
Subject: Searching a string in an existing PDF file
Date Posted: 15 Feb 12 at 1:58PM

Hi
Sorry for this newbie question which will most probably looks stupid but I am crawling since yesterday in documentation.
Which function should I use to search a predefined string inside a PDF file and get page number ?

Replies:

Posted By: edvoigt
Date Posted: 15 Feb 12 at 5:27PM

Hi,

the normal way is more than one step.

You may gon page by page through your document. Depending from your goal, stop with the first result or building a list of places, where the searchstring is found.

First get the text of a page by GetPageText. Depending from your wishes and knoeledge about the PDF you want to be searched, use the right extractoption. In the result of GetPageText you uses the correct kind of search (which you have to code in your program by yourself), depending on choosen extractoption. Play with extractionoptions and look into its output. Keep an eye on option 3 or 4.

Important! Because a PDF is by definition not a kind of wordprocessing datafile, textextraction may not guarantee to detect words as words. In a PDF a word can be drawn letter by letter and in a wrong order. So the textextraction of QuickPDF has a harder job, as it seems to be. But in normal case (shall mean: a text is written with one font, one size and without tricks in order) you have only problems with words going from end of a line to start of next line. They will possible come as two words, but are in your searchstring only one word.

To get more information use the search for other posts in forum, dealing with textextraction and searching words.

Werner

Posted By: balane78
Date Posted: 15 Feb 12 at 6:22PM

OK thanks.
BTW I wonder how is Acrobat Reader search tool working.

Posted By: Ingo
Date Posted: 15 Feb 12 at 7:35PM

Acrobat Reader comes along with an over 100-mb-installation...
so it's probably a bit faster ;-)

Posted By: edvoigt
Date Posted: 15 Feb 12 at 7:56PM

Hi,

I did the following test: a very small word-text, printed as PDF with PDF-Creator. It looks so:

Test search-
ing with acro-
bat

For better understanding, inside it looks so:
[(T)-15.8907(e)-2.05734(s)3.21993(t)0.721099( )-3.16695(f)7.49943(o)-6.3339(r)-4.55617( )-3.16695(s)3.21993(e)-2.05734(a)-2.05734(r)-4.55617(c)-2.05726(h)5.7217(-)333]TJ
11.52 TL
T*[(i)0.721099(n)5.7217(g)5.7217( )-27.2782(w)10.7194(i)0.721099(t)0.721099(h)5.7217( )-3.16695(a)-2.05734(c)-2.05734(r)-4.55617(o)-6.3339(-)]TJ
11.4 TL
T*[(b)-6.3339(a)-2.05734(t)0.721161( )]TJ

I have marked the word Test by red.

Try a search for "search-", Acrobat Reader X dont find it! But "searching" is found.

Conclusion: Acrobat does a lot of things, to get the searchresults. It seems, as they would prepare the text by omitting some chars (newline, -+newline). Its a little bit like a compiler ignores comments and spaces.

In most cases, it finds you are searching for.

You see it depend on quality of textextraction, preparation and searchtactics.

Werner