Print Page | Close Window

Extract non-formatted Tabular Text

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3057
Printed Date: 28 Apr 24 at 11:50PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Extract non-formatted Tabular Text
Posted By: chrisreed
Subject: Extract non-formatted Tabular Text
Date Posted: 21 Jan 15 at 10:16AM
Can't find any site to upload the example PDF that I'm trying to process without our Firewall blocking it (tried docdroid, scribd, dropbox) so the best I can do is upload an image.
 
http://s5.postimg.org/5hncgugsn/Example_PDF.jpg" rel="nofollow - http://s5.postimg.org/5hncgugsn/Example_PDF.jpg
 
The text "looks" like it is separated by TABS, but there is no formatting.  When I try to use the DAExtractPageText and DAExtractBlockText functions, instead of the <Field Name>: <Field Value> aligning with each, they are all over the place.
 
I also tried all the differenet options in DASetTextExtractionOptions to no avail.
 
How can I extract this unformatted text so the <Field Name>: <Field Value> align with each other
eg.  Surname: TEST etc.

Thanks Chris.



Replies:
Posted By: AndrewC
Date Posted: 27 Jan 15 at 10:18AM
Chris,

PDF's file do not have TAB characters, words, sentences or paragraphs.  Text is drawn at a specific x and y location.  Extraction attempts to collect all the drawn text  but is not always perfect.

GetPageText of DAExtractPageText using option 7 will be your best chance.

Andrew.


Posted By: chrisreed
Date Posted: 04 Feb 15 at 7:10AM
Hi Andrew,
Sorry for the lateness in my reply, but I never received an e-mail that you had posted a reply Unhappy
 
Believe me I tried all the Extraction Options (from 1 to 11) and none of them were any good.  So instead of having the fields/values go across the page I just had them going down the page as follows:
 
<Field Name> <Field Value>
Surname:        Tester
Firstname:       Kenneth
DOB:               29 Mar 1928
Exam Date:     30 Jan 2015 07:46
Site ID:            RPH    etc....
 
and used the Extraction Option (5) - Sort text blocks based on top left position.
 
This worked a lot better, in that this option returned most of the <Field Names> first and then the <Field Values> next, but some still got mixed up so that I couldn't associate all the correct <Field Name> with the matching <Field Value>.


Posted By: chrisreed
Date Posted: 04 Feb 15 at 10:17AM
Sorry Andrew I was too quick with my reply.
 
Yes if I use Option 7 it matches very well what is on the PDF file - thanks for your help.
 
Chris



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk