Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Extract Text (2)
  FAQ FAQ  Forum Search   Register Register  Login Login

Extract Text (2)

 Post Reply Post Reply
Author
Message
tren View Drop Down
Beginner
Beginner


Joined: 07 Feb 06
Location: Australia
Status: Offline
Points: 5
Post Options Post Options   Thanks (0) Thanks(0)   Quote tren Quote  Post ReplyReply Direct Link To This Post Topic: Extract Text (2)
    Posted: 13 Jun 06 at 10:24PM
Hi There,

I was hoping someone could help me with a problem I'm having or at least point me in the right direction. I'm trying to extract the co-ordinates of every word on the page of PDF programatically. I've tried to use GetPageText(4), in conjunction with GetPageText(1) -- because GetPageText(4) gives extremely corrupted results. In some cases I've got it to work, but in others, words are merged into one another or letters in the wrong order making it hard to compare the two results.

Does anyone know of another package that could let me do this? The price of the product doesn't matter too much, as long as it's reliable.

Thank you
Back to Top
chicks View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert


Joined: 29 Oct 05
Location: United States
Status: Offline
Points: 251
Post Options Post Options   Thanks (0) Thanks(0)   Quote chicks Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jun 06 at 11:14PM
http://pdftohtml.sourceforge.net/

This generally does an excellent job, and it's free. Command-line only, unless you know C really well...

Use the XML output option to get the text, font and positonal info. You will probably need to do an XSL transform to get the output in a really usable form - if you need help, I can share some examples.
Back to Top
tren View Drop Down
Beginner
Beginner


Joined: 07 Feb 06
Location: Australia
Status: Offline
Points: 5
Post Options Post Options   Thanks (0) Thanks(0)   Quote tren Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jun 06 at 11:51PM
Chicks, you're a legend.

I've done a bit of XSLT in the past so should be ok with this. To get the absolute position of words, I imagine I'm going to have to calculate the width of each word on the line based on font/font size?

Thanks again for the help -- btw quickpdf rules.
Back to Top
ECPVFR View Drop Down
Beginner
Beginner
Avatar

Joined: 17 May 06
Location: Germany
Status: Offline
Points: 11
Post Options Post Options   Thanks (0) Thanks(0)   Quote ECPVFR Quote  Post ReplyReply Direct Link To This Post Posted: 12 Jul 06 at 1:27PM
Hi tren,

did You ever try 'ExtractFilePageText' with option '4'?
This does the whole job extracting every piece of text and option '4' also extracts all included information (font, color, text size, position and the text).
Best Regards,
Volker
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store