Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Extract text from PDF with Layout.
  FAQ FAQ  Forum Search   Register Register  Login Login

Extract text from PDF with Layout.

 Post Reply Post Reply
Author
Message
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Topic: Extract text from PDF with Layout.
    Posted: 02 Apr 08 at 9:48AM
Hi everybody !

I'm working with Visual Basic 6.
And my goal, now, is to extract the text from a PDF file to import in a Oracle DB.
I've found an OCX that give me the entire text in a string variable. But without separate data like the file.

My test file contain value placed in columns. And I need to have these values separated by a semi-column for example.

So, I would like to know if your library can permit to do this ?


Thanks in advance Wink


P.S: I can send you test file if you need to understand ( I don't know if my explanation is clear .. Confused )
Back to Top
chicks View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert


Joined: 29 Oct 05
Location: United States
Status: Offline
Points: 251
Post Options Post Options   Thanks (0) Thanks(0)   Quote chicks Quote  Post ReplyReply Direct Link To This Post Posted: 02 Apr 08 at 11:52AM
Your best bet is probably pdftohtml. Its XML output option provides positional information. You can then do an XSL transform to get the data into your final format. It's worked well for me in the past.
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 2:39AM
Hi,

Thanks you for you answer!

My goal is to have the content of the PDF in my app, in a variable to treat it.
If I'll can, I'll prefer to don't use the file convertion.

But I keep your solution as last solution.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 3:00AM
Hi!
I'm wondering... Perhaps i don't understand but...
Why not use the textextract-functions from QuickPDF?
They are working page by page - so you can get the textcontent of each page. With option 3 you can get the single textstrings from each page with additional data like position on the page, font, color, ...
I can't imagine that you need more ;-)
Best regards,
Ingo
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 3:32AM
Hi Ingo,

Yes, my question is just to know if QuickPDF can extract the text from a PDF having columns, and return me text separated following the PDF layout...

I've check the iSEDQuickPDF 5.11 Reference Guide.pdf, and see 3 functions :
- GetPageLayout
- SetPageLayout
- ExtractFilePageContent

I've see that there are vb6 examples code.
I hope that one of theses help me.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 5:19AM
Hi!

GetPageMode and GetPageLayout only retrieve what you're seeing opening a document... This won't help you.

For textextraction you can use this functions:
DAExtractPageText
ExtractFilePageText
GetPageText

With option "3" you'll get csv-strings with position data (in pixel) and much more. With these data you can rebuild your pdf-layout as a textfile.

Best regards,
Ingo
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 8:54AM
Re-Hi ;)

Thank you very much for your help Ingo !
I'm trying methodes that you send me.

But is it possible that I upload my test file to show you my exact need ?

Thank you again for your help !

Edit :
I've try your function ( DAExtractPageText with DAOpenFile and DAFindPage ) to try to get the PDF text, and with the option 3, and I get a string like that :
Quote "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,104.2188,1112.2983,112.3148,1112.2983,112.3148,1117.0743,104.2188,1117.0743,""
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,118.3948,1112.2983,149.1708,1112.2983,149.1708,1117.0743,118.3948,1117.0743,"  "
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,48.9348,1107.1483,75.7988,1107.1483,75.7988,1111.9243,48.9348,1111.9243,""
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,91.1228,1107.1483,97.5708,1107.1483,97.5708,1111.9243,91.1228,1111.9243,""
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,104.2148,1107.1483,112.3108,1107.1483,112.3108,1111.9243,104.2148,1111.9243,""
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,118.3908,1107.1483,126.4868,1107.1483,126.4868,1111.9243,118.3908,1111.9243,""
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,130.7188,1107.1483,149.1668,1107.1483,149.1668,1111.9243,130.7188,1111.9243," "
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,48.9348,1101.9983,78.7468,1101.9983,78.7468,1106.7743,48.9348,1106.7743,""

But all text fields ( at the end of lines ) contain spaces or not, but no values of my file...

Can you help me ?


Edited by devMan - 03 Apr 08 at 9:35AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 9:33AM
ingo [dot] schmoekel [at] ewetel [dot] net
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 9:39AM
Email sent !

Thank you !!
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 03 Apr 08 at 10:10AM
Hi!

I've get the same result... No content!
How the pdf was created? Is it only scanned?
Anyway i've tested more than one function - QP can't extract in this case. Sorry.

Best regards,
Ingo
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 1:59AM
Hello,

Oula .... my pdf test is a part of an other pdf file... And I think that the person who have split the file have create a bad file...
( With an other OCX, it return me strings and numbers not displayed in the file....

Ok, I try with a new test file !


Edit : I've take a new file to test and now QuickPDF york very fine !!
It's exactly what I'm searching !!
It parse each part of my PDF as fields with the option 3 in the methode DAExtractPageText() !
And with the position of fields, I'll can use it to select an aera in the file...

So, I think my compagny will buy a liscence of your ActiveX ! Wink


Edited by devMan - 04 Apr 08 at 2:36AM
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 4:59AM
Can you tell me where can I found conditions to purchase a license and all other informations about QuickZip please ?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 5:53AM
I don't know where you can get "QuickZip" :-)
Perhaps you mean "QuickPDF" ;-)

Have a look here:
http://www.quickpdf.org/forum/forum_posts.asp?TID=698

Best regards,
Ingo



Edited by Ingo - 04 Apr 08 at 5:54AM
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 7:45AM
Oops LOL
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 7:55AM
Last question (normally) :

We are a company of 120 users, and in my team, we are 5 developers.
We need to by 1 license for everyone, or more ?

And after that, last technical question :
If we scan a paper, with a standard scanner, is there possibility that the quickpdf don't extract the text correctly ?
Have you some recommendations ?

Thanks you for all of your support !


Edited by devMan - 04 Apr 08 at 7:57AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 8:29AM
Normally a scanned text will be an image later...
There are less scanner who can do scanning in an ocr-mode... Then you can do textextraction, too.

For your company you need one Enterprise-license...

Best regards,
Ingo 
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 9:47AM
And how many cost this Enterprise-license?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Apr 08 at 11:35AM

Sorry... The correct version-name is "Site License". It's with source. If you have it you can send me the invoice or one of the smallest file from the source package and then you'll get a password for the source section to get the latest version.

 
Please keep in mind: We're doing this here 'cause we like to help... we get nothing and we want nothing... one for all and all for one ;-)
We've nothing to do with the iSED-team. It still sells these old version 5.11. Here many talented "pdf-artists" had pushed the version number up to 6.02... and that's not the end.
 
Best regards,
Ingo 
  
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 07 Apr 08 at 3:09AM
Originally posted by Ingo Ingo wrote:

Please keep in mind: We're doing this here 'cause we like to help... we get nothing and we want nothing... one for all and all for one ;-)
We've nothing to do with the iSED-team. It still sells these old version 5.11. Here many talented "pdf-artists" had pushed the version number up to 6.02... and that's not the end.


OK If I've understand, the iSEQ team sell only the licence for the v5.11 of QuickPDF, and you and your team, you're developing the new versions.

If it's the case, if we want to use your (better) version of QuickPDF, should have we something to pay ??
Or we must buy a licence for the 5.11 version on the iSEQ site ?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 07 Apr 08 at 4:06AM
Hi!
Buy a iSed-site-license ...
and send me a copy of the invoice as pdf or one of the smallest source-file.
Then you'll get access to our last version... and you have to pay nothing.
Best regards,
Ingo
Back to Top
devMan View Drop Down
Beginner
Beginner
Avatar

Joined: 02 Apr 08
Location: Luxembourg
Status: Offline
Points: 14
Post Options Post Options   Thanks (0) Thanks(0)   Quote devMan Quote  Post ReplyReply Direct Link To This Post Posted: 08 Apr 08 at 1:42AM
Ok thank you.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store