I need help - I can help - Extract non-formatted Tabular Text

Extract non-formatted Tabular Text

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3057
Printed Date: 28 Apr 24 at 11:50PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Extract non-formatted Tabular Text

Posted By: chrisreed
Subject: Extract non-formatted Tabular Text
Date Posted: 21 Jan 15 at 10:16AM

Can't find any site to upload the example PDF that I'm trying to process without our Firewall blocking it (tried docdroid, scribd, dropbox) so the best I can do is upload an image.

http://s5.postimg.org/5hncgugsn/Example_PDF.jpg" rel="nofollow - http://s5.postimg.org/5hncgugsn/Example_PDF.jpg

The text "looks" like it is separated by TABS, but there is no formatting. When I try to use the DAExtractPageText and DAExtractBlockText functions, instead of the <Field Name>: <Field Value> aligning with each, they are all over the place.

I also tried all the differenet options in DASetTextExtractionOptions to no avail.

How can I extract this unformatted text so the <Field Name>: <Field Value> align with each other

eg. Surname: TEST etc.

Thanks Chris.

Replies:

Posted By: AndrewC
Date Posted: 27 Jan 15 at 10:18AM

Chris,

PDF's file do not have TAB characters, words, sentences or paragraphs. Text is drawn at a specific x and y location. Extraction attempts to collect all the drawn text but is not always perfect.

GetPageText of DAExtractPageText using option 7 will be your best chance.

Andrew.

Posted By: chrisreed
Date Posted: 04 Feb 15 at 7:10AM

Hi Andrew,

Sorry for the lateness in my reply, but I never received an e-mail that you had posted a reply

Believe me I tried all the Extraction Options (from 1 to 11) and none of them were any good. So instead of having the fields/values go across the page I just had them going down the page as follows:

Surname: Tester

Firstname: Kenneth

DOB: 29 Mar 1928

Exam Date: 30 Jan 2015 07:46

Site ID: RPH etc....

and used the Extraction Option (5) - Sort text blocks based on top left position.

This worked a lot better, in that this option returned most of the <Field Names> first and then the <Field Values> next, but some still got mixed up so that I couldn't associate all the correct <Field Name> with the matching <Field Value>.

Posted By: chrisreed
Date Posted: 04 Feb 15 at 10:17AM

Sorry Andrew I was too quick with my reply.

Yes if I use Option 7 it matches very well what is on the PDF file - thanks for your help.

Chris