Text extraction and columns
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1727
Printed Date: 12 May 25 at 11:21PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Text extraction and columns
Posted By: StMike38
Subject: Text extraction and columns
Date Posted: 05 Feb 11 at 10:24PM
I am trying to extract text in continuous form -- right across lines when in normal paragraphs, successive lines within a column when the text is in columns.
GetPageText( 2 ) does a good job of distinguishing columns and keeping their contents separate when extracting text. But in ordinary paragraphs, GetPageText(2) often breaks lines mid-word, and that's a pain.
GetPageText( 3 ) provides better detail in ordinary paragraphs, but intermittently (not always!) it runs columns together.
Is there any way to GetPageText( 3 ) to return the text in each column separately [without sacrificing its ability to handle ordinary paragraphs]?
StMike38
|
Replies:
Posted By: Ingo
Date Posted: 06 Feb 11 at 12:40PM
Hi Mike!
QuickPDF doesn't offer real support for extracting text columns. First in - first out... and if you're inserting few corrections in the end on the first pageline... the corrections will be extracted last.
I would calculate the columns by my own. The extract-functions offer all position and font data.
If you want the extraction for searching i can only suggest option 4 (for me the best).
Cheers and welcome here, Ingo
|
Posted By: StMike38
Date Posted: 10 Feb 11 at 12:49AM
Ingo,
I agree that option 4 is the best for getting at continuous text. On some PDFs it nicely puts out a word at a time. But in a 2 MB PDF file it suddenly decides to output one or two letters at a time. With variable width fonts, there is no sure way of recognizing / calculating space between words. Attempts so far have two undesirable results -- some words wind up fragmented, while parts of separate words get pushed together.
Is there any way to get control so that option 4 actually gets a word at a time?
StMike38
|
|