Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Extract Text (2) |
Post Reply |
Author | |
tren
Beginner Joined: 07 Feb 06 Location: Australia Status: Offline Points: 5 |
Post Options
Thanks(0)
Posted: 13 Jun 06 at 10:24PM |
Hi There,
I was hoping someone could help me with a problem I'm having or at least point me in the right direction. I'm trying to extract the co-ordinates of every word on the page of PDF programatically. I've tried to use GetPageText(4), in conjunction with GetPageText(1) -- because GetPageText(4) gives extremely corrupted results. In some cases I've got it to work, but in others, words are merged into one another or letters in the wrong order making it hard to compare the two results. Does anyone know of another package that could let me do this? The price of the product doesn't matter too much, as long as it's reliable. Thank you |
|
chicks
Debenu Quick PDF Library Expert Joined: 29 Oct 05 Location: United States Status: Offline Points: 251 |
Post Options
Thanks(0)
|
http://pdftohtml.sourceforge.net/
This generally does an excellent job, and it's free. Command-line only, unless you know C really well... Use the XML output option to get the text, font and positonal info. You will probably need to do an XSL transform to get the output in a really usable form - if you need help, I can share some examples. |
|
tren
Beginner Joined: 07 Feb 06 Location: Australia Status: Offline Points: 5 |
Post Options
Thanks(0)
|
Chicks, you're a legend.
I've done a bit of XSLT in the past so should be ok with this. To get the absolute position of words, I imagine I'm going to have to calculate the width of each word on the line based on font/font size? Thanks again for the help -- btw quickpdf rules. |
|
ECPVFR
Beginner Joined: 17 May 06 Location: Germany Status: Offline Points: 11 |
Post Options
Thanks(0)
|
Hi tren,
did You ever try 'ExtractFilePageText' with option '4'? This does the whole job extracting every piece of text and option '4' also extracts all included information (font, color, text size, position and the text). |
|
Best Regards,
Volker |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store