Searching a string in an existing PDF file
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2150
Printed Date: 04 Apr 26 at 10:15PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Searching a string in an existing PDF file
Posted By: balane78
Subject: Searching a string in an existing PDF file
Date Posted: 15 Feb 12 at 1:58PM
Hi Sorry for this newbie question which will most probably looks stupid but I am crawling since yesterday in documentation. Which function should I use to search a predefined string inside a PDF file and get page number ?
|
Replies:
Posted By: edvoigt
Date Posted: 15 Feb 12 at 5:27PM
Hi,
the normal way is more than one step.
You may gon page by page through your document. Depending from your goal, stop with the first result or building a list of places, where the searchstring is found.
First get the text of a page by GetPageText. Depending from your wishes and knoeledge about the PDF you want to be searched, use the right extractoption. In the result of GetPageText you uses the correct kind of search (which you have to code in your program by yourself), depending on choosen extractoption. Play with extractionoptions and look into its output. Keep an eye on option 3 or 4.
Important! Because a PDF is by definition not a kind of wordprocessing datafile, textextraction may not guarantee to detect words as words. In a PDF a word can be drawn letter by letter and in a wrong order. So the textextraction of QuickPDF has a harder job, as it seems to be. But in normal case (shall mean: a text is written with one font, one size and without tricks in order) you have only problems with words going from end of a line to start of next line. They will possible come as two words, but are in your searchstring only one word.
To get more information use the search for other posts in forum, dealing with textextraction and searching words.
Werner
|
Posted By: balane78
Date Posted: 15 Feb 12 at 6:22PM
OK thanks. BTW I wonder how is Acrobat Reader search tool working.
|
Posted By: Ingo
Date Posted: 15 Feb 12 at 7:35PM
Acrobat Reader comes along with an over 100-mb-installation... so it's probably a bit faster ;-)
|
Posted By: edvoigt
Date Posted: 15 Feb 12 at 7:56PM
Hi,
I did the following test: a very small word-text, printed as PDF with PDF-Creator. It looks so:
Test search- ing with acro- bat
For better understanding, inside it looks so: [(T)-15.8907(e)-2.05734(s)3.21993(t)0.721099( )-3.16695(f)7.49943(o)-6.3339(r)-4.55617( )-3.16695(s)3.21993(e)-2.05734(a)-2.05734(r)-4.55617(c)-2.05726(h)5.7217(-)333]TJ 11.52 TL T*[(i)0.721099(n)5.7217(g)5.7217( )-27.2782(w)10.7194(i)0.721099(t)0.721099(h)5.7217( )-3.16695(a)-2.05734(c)-2.05734(r)-4.55617(o)-6.3339(-)]TJ 11.4 TL T*[(b)-6.3339(a)-2.05734(t)0.721161( )]TJ
I have marked the word Test by red.
Try a search for "search-", Acrobat Reader X dont find it! But "searching" is found.
Conclusion: Acrobat does a lot of things, to get the searchresults. It seems, as they would prepare the text by omitting some chars (newline, -+newline). Its a little bit like a compiler ignores comments and spaces.
In most cases, it finds you are searching for.
You see it depend on quality of textextraction, preparation and searchtactics.
Werner
|
|