Debenu Quick PDF Library - PDF SDK Community Forum : Extract Text (2)

Debenu Quick PDF Library - PDF SDK Community Forum : Extract Text (2) http://www.quickpdf.org/forum/ Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved. Sun, 05 Apr 2026 20:36:00 +0000 Wed, 12 Jul 2006 13:27:35 +0000 http://blogs.law.harvard.edu/tech/rss Web Wiz Forums 11.01 360 www.quickpdf.org/forum/RSS_post_feed.asp?TID=440 <![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]> http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png http://www.quickpdf.org/forum/ <![CDATA[Extract Text (2) : Hi tren, did You ever try 'ExtractFilePageText'...]]> http://www.quickpdf.org/forum/extract-text-2_topic440_post2056.html#2056 Author: ECPVFR
Subject: 440
Posted: 12 Jul 06 at 1:27PM

Hi tren,

did You ever try 'ExtractFilePageText' with option '4'?
This does the whole job extracting every piece of text and option '4' also extracts all included information (font, color, text size, position and the text).
]]> Wed, 12 Jul 2006 13:27:35 +0000 http://www.quickpdf.org/forum/extract-text-2_topic440_post2056.html#2056 <![CDATA[Extract Text (2) : Chicks, you're a legend. I've...]]> http://www.quickpdf.org/forum/extract-text-2_topic440_post1952.html#1952 Author: tren
Subject: 440
Posted: 13 Jun 06 at 11:51PM

Chicks, you're a legend.

I've done a bit of XSLT in the past so should be ok with this. To get the absolute position of words, I imagine I'm going to have to calculate the width of each word on the line based on font/font size?

Thanks again for the help -- btw quickpdf rules.]]> Tue, 13 Jun 2006 23:51:59 +0000 http://www.quickpdf.org/forum/extract-text-2_topic440_post1952.html#1952 <![CDATA[Extract Text (2) : http://pdftohtml.sourceforge.net/ This...]]> http://www.quickpdf.org/forum/extract-text-2_topic440_post1951.html#1951 Author: chicks
Subject: 440
Posted: 13 Jun 06 at 11:14PM

http://pdftohtml.sourceforge.net/

This generally does an excellent job, and it's free. Command-line only, unless you know C really well...

Use the XML output option to get the text, font and positonal info. You will probably need to do an XSL transform to get the output in a really usable form - if you need help, I can share some examples.
]]> Tue, 13 Jun 2006 23:14:08 +0000 http://www.quickpdf.org/forum/extract-text-2_topic440_post1951.html#1951 <![CDATA[Extract Text (2) : Hi There, I was hoping someone...]]> http://www.quickpdf.org/forum/extract-text-2_topic440_post1950.html#1950 Author: tren
Subject: 440
Posted: 13 Jun 06 at 10:24PM

Hi There,

I was hoping someone could help me with a problem I'm having or at least point me in the right direction. I'm trying to extract the co-ordinates of every word on the page of PDF programatically. I've tried to use GetPageText(4), in conjunction with GetPageText(1) -- because GetPageText(4) gives extremely corrupted results. In some cases I've got it to work, but in others, words are merged into one another or letters in the wrong order making it hard to compare the two results.

Does anyone know of another package that could let me do this? The price of the product doesn't matter too much, as long as it's reliable.

Thank you]]> Tue, 13 Jun 2006 22:24:43 +0000 http://www.quickpdf.org/forum/extract-text-2_topic440_post1950.html#1950