General Discussion - Bug when extracting text

Print Page | Close Window

Bug when extracting text

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1171
Printed Date: 13 Mar 26 at 6:19PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Bug when extracting text

Posted By: AIM
Subject: Bug when extracting text
Date Posted: 12 Aug 09 at 10:09AM

Hi,

I use QuickPDF 7.15 with Option #3 to extract text from PDF files and ran into an annoying bug.

Create a simple PDF file that contains the text "QuickPDF Library" and use another color for the character "P".

Then QuickPDF extracts the following content from "QuickPDF Library":

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,147.3920,776.6920,147.3920,784.7920,56.8000,784.7920,"Quick DF Library"
"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,92.8720,776.6920,92.8720,784.7920,86.2000,784.7920,"P"

As you can see, "P" is extracted after "Quick DF Library" with a missing "P", but the output should definitely be:

...,"Quick"
...,"P"
...,"DF Library"

When you use however more than one character in another color, then it works correctly. Use another color for "PD", then the text extraction from "QuickPDF Library" works in the correct order:

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,86.1040,776.6920,86.1040,784.7920,56.8000,784.7920,"Quick"
"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,101.5600,776.6920,101.5600,784.7920,86.2000,784.7920,"PD"
"BAAAAA+TimesNewRomanPSMT",#000000,12.00,101.5000,776.6920,147.1960,776.6920,147.1960,784.7920,101.5000,784.7920,"F Library"

So it seems that this happens only for single characters. Any chance to get this fixed in the next version?

Replies:

Posted By: Ingo
Date Posted: 12 Aug 09 at 10:43AM

Hi AIM!

If i'm looking on your sample it's like sorted by row and beginning column...
and this would make sense ;-)
Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.

If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.

Cheers, Ingo

Posted By: AIM
Date Posted: 12 Aug 09 at 12:53PM

If i'm looking on your sample it's like sorted by row and beginning column...
and this would make sense ;-)
Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.

Ingo, I'm not sure if I fully understand your answer. But if you have for example "Ingo" in your PDF, you would get "In o" and "g". I don't know how or why this would make sense, eg. if I want to search a PDF for "ingo".

In my opinion, "In" + "g" + "o" would be the only correct solution. This is at least the way how it works if two or more letters are in red.

In the meantime I also saw that it happens only with single characters in the middle of a word, not at the beginning.

OK, let's use the following examples:

- "Ingo" extracts "I" + "ngo" ..... OK
- "Ingo" extracts "In o" + "g" ..... error in my opinion
- "Ingo" extracts "I" + "ng" + "o" ..... OK

In all three tests I entered "Ingo" and colored a character in red afterwards.

If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.

Do you mean that QuickPDF should extract "ckPDF Library" + "Qui" ?

But in that case you get "Qui" + "ckPDF Library" (what is correct in my opinion).

Thanks,
Martin

Posted By: Ingo
Date Posted: 12 Aug 09 at 1:39PM

Hi Martin!

If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".
If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".
That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.
If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)
If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ... so if you want to do a textsearch you shouldn't use option 3.

Cheers, Ingo

Posted By: AIM
Date Posted: 12 Aug 09 at 6:19PM

If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".
If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".
That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.
If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)

OK, I understand your answer, but it doesn't explain the different behavior of QuickPDF for the 3 examples I gave (I always entered the text in Open Office and colored the letters afterwards, then I created the PDF). Either two or one of them do not work correctly then.

If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ...

The other options seem to be a bit buggy, Option 3 always extracts the most text (except this annoyance with single characters).

OK, back to the "QuickPDF Library." example.

Option 0 gives the following output:

k  .

Option 1 and 2 give the following output:

56.80,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","k"
93.50,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","."

Option 3 gives the following output:

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,149.9120,776.6920,149.9120,784.7920,56.8000,784.7920,"Quick DF Library."
"CAAAAA+TimesNewRomanPS-BoldMT",#000000,12.00,86.2000,776.6920,93.5200,776.6920,93.5200,784.7920,86.2000,784.7920,"P"

Options 0, 1 and 2 are completely useless in that example. Option 4 would work here but didn't extract as much as Option 3 from several other PDFs I have tried (so not a real solution in my case).

So what would you suggest to fully extract these two words? Or is it impossible?

Thanks for any tips,
Martin

Posted By: Ingo
Date Posted: 12 Aug 09 at 7:16PM

Hi Martin!

So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits. The first PSUnit is at the bottom of the page on the left side. Each thing (pictures, textstrings, ...) can be put on this page at anytime. The coordinates inside the pdf says where the objects shall appear. Please keep in mind that below the surface of the pdf it doesn't look as nice as later in the pdf-reader ;-)

Cheers, Ingo

Posted By: AIM
Date Posted: 12 Aug 09 at 8:40PM

So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits.

Do you have any code snippets or demos?
But I think that this is a too complicated way for a functionality like extracting text that should be part of QuickPDF (where it nearly works).

I'm sorry, but I still believe that this is a bug in QuickPDF and if the "Ingo" example #2 would behave like example #1 and #3, there wouldn't be a problem and everything would work perfectly.

JFYI, I tried your pdftext.dll and it has the same problem! Here is the output of your DLL

page 1 / 1
 
Quick DF Library.
P

Posted By: swb1
Date Posted: 12 Aug 09 at 9:21PM

Martin,

Ingo is correct in that this is not a bug but rather the nature of the way that the PDF is constructed. There is no rule that says text that appears to be one word when displayed by Acrobat or rendered by QuickPDF actually be stored as one word inside the PDF. I have seen PDFs that were constructed one letter at a time! Each letter would appear as a single text element complete with location and font information. While this is an extremely inefficient way to construct a PDF it works nonetheless and appears to be just fine from the outside when rendered by Acrobat.

The text extraction routines of QuickPDF do not re-assemble the words. These routines merely extract the text as it is stored in the document and tell you where it should appear on the page and how it should be formatted.

A text extraction routine that is smart enough to re-assemble the words and tell me their origins would be a terrific enhancement to the library but as far as I know no such routine exists here today.

If Debenu does not add such a feature soon (hint, hint Karl;-) ) I will probably have to write one of my own.

Best luck to you,

Steve,

Posted By: AIM
Date Posted: 12 Aug 09 at 10:18PM

Thanks for all your information and suggestions, I think I understood now that my real "problem" of these 3 examples happens at PDF creation time.

Seems that I have to invest some time and implement a fully working text extraction myself...