Print Page | Close Window

Extract Text From Renered Pages.

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2309
Printed Date: 19 May 26 at 10:10AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Extract Text From Renered Pages.
Posted By: alinux08
Subject: Extract Text From Renered Pages.
Date Posted: 20 Jun 12 at 3:25PM
Is it possible to extract text from a rendered page based on a user-defined bounding box?

Thanks.

Mark



Replies:
Posted By: Ingo
Date Posted: 20 Jun 12 at 8:03PM
Hi Mark!

A rendered page means an image for me.
So it's not possible to extract text from it...?

Cheers and welcome here,
Ingo


Posted By: alinux08
Date Posted: 22 Jun 12 at 10:42PM
Ingo, thanks.

What about extracting text from the real page based on a defined boundary box?


Posted By: AndrewC
Date Posted: 23 Jun 12 at 8:44AM

You can use SetTextExtractionArea to limit the extraction results.

http://www.quickpdflibrary.com/help/quickpdf/SetTextExtractionArea.php - http://www.quickpdflibrary.com/help/quickpdf/SetTextExtractionArea.php

If you are wanting to perform multiple extractions from the same page then it would be more efficient to process the bounding box results from GetPageText(3) or (4) yourself which is quite easy to do.

If you can highlight and select (copy/paste) text using Acrobat Reader then it should be possible to use GetPageText to perform text extraction.  Many image based documents have been processed using OCR.

Andrew.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk