Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Text in separate cells being extracted together
  FAQ FAQ  Forum Search   Register Register  Login Login

Text in separate cells being extracted together

 Post Reply Post Reply
Author
Message
l33m4n View Drop Down
Beginner
Beginner


Joined: 26 Jan 12
Status: Offline
Points: 3
Post Options Post Options   Thanks (0) Thanks(0)   Quote l33m4n Quote  Post ReplyReply Direct Link To This Post Topic: Text in separate cells being extracted together
    Posted: 28 Jan 12 at 8:12PM
Hi,

First I have to say how impressed I am with this toolkit and the fantastic help and support that comes with it. I'm sure there is a quick answer to this and it's probably come up a number of times.

I'm extracting text from a PDF that was generated by printing a word document in CutePDF (my application of this is specific to PDFs that have been created this way).

I'm using the following code to extract the text with co-ordinates etc:-

  strText = strText + QP.ExtractFilePageText(strInputFilePath, "", nPage, 3);

This on the whole works well but in some cases it is including a piece of text that is actually made up of a number of separate pieces of text that are in separate cells in the same row in a table in the original Word document and the PDF generated from it. The cells have borders so it seems sensible that the text in each cell would be detected separately but I suspect that the text extraction function is simply looking for text and isn't adjusting for lines between text.

I should note that this is intermittent as you will see in this example. I suspect it's something to do with the proximity of the text in the next cell that varies depending on the length of the word.


One
Two
Three
Four
Five
Six
Seven
1
2
3
4
5
6
7

Is extracted as:-

"VRQGOC+TT15Ct00",#000000,11.04,71.9995,759.1013,95.493,759.1013,95.493,766.7079,71.9995,766.7079,"One  "
"VRQGOC+TT15Ct00",#000000,11.04,112.8,759.1013,134.3735,759.1013,134.3735,766.7079,112.8,766.7079,"Two "
"VRQGOC+TT15Ct00",#000000,11.04,148.3199,759.1013,208.0535,759.1013,208.0535,766.7079,148.3199,766.7079,"Three  Four "
"VRQGOC+TT15Ct00",#000000,11.04,219.1199,759.1013,239.6135,759.1013,239.6135,766.7079,219.1199,766.7079,"Five "
"VRQGOC+TT15Ct00",#000000,11.04,254.6399,759.1013,269.4934,759.1013,269.4934,766.7079,254.6399,766.7079,"Six "
"VRQGOC+TT15Ct00",#000000,11.04,283.1999,759.1013,312.4534,759.1013,312.4534,766.7079,283.1999,766.7079,"Seven "
"VRQGOC+TT15Ct00",#000000,11.04,71.9995,745.1813,80.013,745.1813,80.013,752.7879,71.9995,752.7879,"1 "
"VRQGOC+TT15Ct00",#000000,11.04,112.8,745.1813,120.8135,745.1813,120.8135,752.7879,112.8,752.7879,"2 "
"VRQGOC+TT15Ct00",#000000,11.04,148.3199,745.1813,156.3335,745.1813,156.3335,752.7879,148.3199,752.7879,"3 "
"VRQGOC+TT15Ct00",#000000,11.04,185.0399,745.1813,193.0535,745.1813,193.0535,752.7879,185.0399,752.7879,"4 "
"VRQGOC+TT15Ct00",#000000,11.04,219.1199,745.1813,227.1335,745.1813,227.1335,752.7879,219.1199,752.7879,"5 "
"VRQGOC+TT15Ct00",#000000,11.04,254.6399,745.1813,262.6534,745.1813,262.6534,752.7879,254.6399,752.7879,"6 "
"VRQGOC+TT15Ct00",#000000,11.04,283.1999,745.1813,291.2134,745.1813,291.2134,752.7879,283.1999,752.7879,"7 "
"VRQGOC+TT15Ct00",#000000,11.04,71.9995,731.1413,74.493,731.1413,74.493,738.7479,71.9995,738.7479," "

I would be grateful if anybody can offer some advice.

Thanks!


Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store