I need help - I can help - Extract Text (2)

Print Page | Close Window

Extract Text (2)

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=440
Printed Date: 19 May 24 at 2:11AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Extract Text (2)

Posted By: tren
Subject: Extract Text (2)
Date Posted: 13 Jun 06 at 10:24PM

Hi There,

I was hoping someone could help me with a problem I'm having or at least point me in the right direction. I'm trying to extract the co-ordinates of every word on the page of PDF programatically. I've tried to use GetPageText(4), in conjunction with GetPageText(1) -- because GetPageText(4) gives extremely corrupted results. In some cases I've got it to work, but in others, words are merged into one another or letters in the wrong order making it hard to compare the two results.

Does anyone know of another package that could let me do this? The price of the product doesn't matter too much, as long as it's reliable.

Thank you

Replies:

Posted By: chicks
Date Posted: 13 Jun 06 at 11:14PM

http://pdftohtml.sourceforge.net/

This generally does an excellent job, and it's free. Command-line only, unless you know C really well...

Use the XML output option to get the text, font and positonal info. You will probably need to do an XSL transform to get the output in a really usable form - if you need help, I can share some examples.

Posted By: tren
Date Posted: 13 Jun 06 at 11:51PM

Chicks, you're a legend.

I've done a bit of XSLT in the past so should be ok with this. To get the absolute position of words, I imagine I'm going to have to calculate the width of each word on the line based on font/font size?

Thanks again for the help -- btw quickpdf rules.

Posted By: ECPVFR
Date Posted: 12 Jul 06 at 1:27PM

Hi tren,

did You ever try 'ExtractFilePageText' with option '4'?
This does the whole job extracting every piece of text and option '4' also extracts all included information (font, color, text size, position and the text).

-------------
Best Regards,
Volker