Print Page | Close Window

ExtractFilePageText - Options 0 and 8

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2929
Printed Date: 29 Mar 24 at 1:08AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: ExtractFilePageText - Options 0 and 8
Posted By: mLipok
Subject: ExtractFilePageText - Options 0 and 8
Date Posted: 02 Jul 14 at 12:49PM
In some cases I have issue like this:

I have PDF scaned and OCR with FineReader Recognition Server 3..
there is something like this

blabla
TEXT1 TEXT2
blablabla

When I use option 0 then I get:
....
....
TEXT1 TEXT2
....
....


When I use option 8 then I get:
....
....
TEXT1
TEXT2
....
....

I need to use option 8 because this option give me all content.
But I want to get text in this same line like in option 0.



-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600



Replies:
Posted By: mLipok
Date Posted: 02 Jul 14 at 1:04PM
I need this because of this:
http://www.quickpdf.org/forum/extractfilepagetext-strange-behavior_topic2906.html

btw.
option 7 works OK.

So now I have a question.

What is the real difference between the option 7 and 8 ? 

I have observed that in the case of option 7, the result contains the indentation so that after writing the output to a file, text file, for example, is located on the right side (there are extra spaces on the left), provided that it was located in a PDF file. 


Or in some specific cases, option 8, gives more text than option 7?



-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600


Posted By: AndrewC
Date Posted: 03 Jul 14 at 7:02AM

We would need to see the original PDF file.

The problem is most likely that the two text blocks are using a different font or could have overlapping bounding boxes. FineReader doesn't always output the cleanest text boxes.

Option 0 will only work on some files.  Option 8 extracts all text lines and outputs them 1 by 1.  A line of text is consider a group of characters that have the same font and size and colour.  You can ignore some of these options by using SetTextExtractionOptions.

SetTextExtractionOptions is quite powerful and can be used to solve all sorts of complex PDF issues.  

Text extraction, like OCR, is not an exact science and Debenu Quick PDF Library has to make decisions about where words and linebreaks are located which requires characters to be first grouped and then analysed into words and then lines.  We can get it wrong when PDF's use strange logic, fonts without any font information, fonts without a ToUnicode table, overlapping bounding boxes etc...


Andrew


Posted By: mLipok
Date Posted: 03 Jul 14 at 7:36AM
I can send you this PDF file but you must send me your public GPG key for encrypt this file.



-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600


Posted By: AndrewC
Date Posted: 03 Jul 14 at 7:41AM

Michael,
You create a support case and it will only seen by support staff and can be deleted when resolved.

Andrew.


Posted By: mLipok
Date Posted: 03 Jul 14 at 7:52AM
I will but please understand me: I apply security procedures for the protection of personal data. Encrypt PDF files using PGP in this case is a standard option, and I can not ignore this point my client's internal rules.

-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk