Print Page | Close Window

Text extraction and columns

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1727
Printed Date: 12 May 25 at 11:21PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Text extraction and columns
Posted By: StMike38
Subject: Text extraction and columns
Date Posted: 05 Feb 11 at 10:24PM
I am trying to extract text in continuous form -- right across lines when in normal paragraphs, successive lines within a column when the text is in columns.

GetPageText( 2 ) does a good job of distinguishing columns and keeping their contents separate when extracting text. But in ordinary paragraphs, GetPageText(2) often breaks lines mid-word, and that's a pain.

GetPageText( 3 ) provides better detail in ordinary paragraphs, but intermittently (not always!) it runs columns together.

Is there any way to GetPageText( 3 ) to return the text in each column separately [without sacrificing its ability to handle ordinary paragraphs]?

StMike38



Replies:
Posted By: Ingo
Date Posted: 06 Feb 11 at 12:40PM
Hi Mike!

QuickPDF doesn't offer real support for extracting text columns.
First in - first out... and if you're inserting few corrections in the end on the first pageline...
the corrections will be extracted last.

I would calculate the columns by my own.
The extract-functions offer all position and font data.

If you want the extraction for searching i can only suggest option 4 (for me the best).

Cheers and welcome here,
Ingo


Posted By: StMike38
Date Posted: 10 Feb 11 at 12:49AM
Ingo,

I agree that option 4 is the best for getting at continuous text. On some PDFs it nicely puts out a word at a time. But in a 2 MB PDF file it suddenly decides to output one or two letters at a time. With variable width fonts, there is no sure way of recognizing / calculating space between words. Attempts so far have two undesirable results -- some words wind up fragmented, while parts of separate words get pushed together.

Is there any way to get control so that option 4 actually gets a word at a time?

StMike38



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk