Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Extract non-formatted Tabular Text
  FAQ FAQ  Forum Search   Register Register  Login Login

Extract non-formatted Tabular Text

 Post Reply Post Reply
Author
Message
chrisreed View Drop Down
Team Player
Team Player
Avatar

Joined: 29 Apr 13
Location: Australia
Status: Offline
Points: 35
Post Options Post Options   Thanks (0) Thanks(0)   Quote chrisreed Quote  Post ReplyReply Direct Link To This Post Topic: Extract non-formatted Tabular Text
    Posted: 21 Jan 15 at 10:16AM
Can't find any site to upload the example PDF that I'm trying to process without our Firewall blocking it (tried docdroid, scribd, dropbox) so the best I can do is upload an image.
 
 
The text "looks" like it is separated by TABS, but there is no formatting.  When I try to use the DAExtractPageText and DAExtractBlockText functions, instead of the <Field Name>: <Field Value> aligning with each, they are all over the place.
 
I also tried all the differenet options in DASetTextExtractionOptions to no avail.
 
How can I extract this unformatted text so the <Field Name>: <Field Value> align with each other
eg.  Surname: TEST etc.

Thanks Chris.
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 27 Jan 15 at 10:18AM
Chris,

PDF's file do not have TAB characters, words, sentences or paragraphs.  Text is drawn at a specific x and y location.  Extraction attempts to collect all the drawn text  but is not always perfect.

GetPageText of DAExtractPageText using option 7 will be your best chance.

Andrew.
Back to Top
chrisreed View Drop Down
Team Player
Team Player
Avatar

Joined: 29 Apr 13
Location: Australia
Status: Offline
Points: 35
Post Options Post Options   Thanks (0) Thanks(0)   Quote chrisreed Quote  Post ReplyReply Direct Link To This Post Posted: 04 Feb 15 at 7:10AM
Hi Andrew,
Sorry for the lateness in my reply, but I never received an e-mail that you had posted a reply Unhappy
 
Believe me I tried all the Extraction Options (from 1 to 11) and none of them were any good.  So instead of having the fields/values go across the page I just had them going down the page as follows:
 
<Field Name> <Field Value>
Surname:        Tester
Firstname:       Kenneth
DOB:               29 Mar 1928
Exam Date:     30 Jan 2015 07:46
Site ID:            RPH    etc....
 
and used the Extraction Option (5) - Sort text blocks based on top left position.
 
This worked a lot better, in that this option returned most of the <Field Names> first and then the <Field Values> next, but some still got mixed up so that I couldn't associate all the correct <Field Name> with the matching <Field Value>.
Back to Top
chrisreed View Drop Down
Team Player
Team Player
Avatar

Joined: 29 Apr 13
Location: Australia
Status: Offline
Points: 35
Post Options Post Options   Thanks (0) Thanks(0)   Quote chrisreed Quote  Post ReplyReply Direct Link To This Post Posted: 04 Feb 15 at 10:17AM
Sorry Andrew I was too quick with my reply.
 
Yes if I use Option 7 it matches very well what is on the PDF file - thanks for your help.
 
Chris
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store