Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > General Discussion
  New Posts New Posts RSS Feed - Bug when extracting text
  FAQ FAQ  Forum Search   Register Register  Login Login

Bug when extracting text

 Post Reply Post Reply
Author
Message
AIM View Drop Down
Beginner
Beginner


Joined: 12 Aug 09
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote AIM Quote  Post ReplyReply Direct Link To This Post Topic: Bug when extracting text
    Posted: 12 Aug 09 at 10:09AM

Hi,

I use QuickPDF 7.15 with Option #3 to extract text from PDF files and ran into an annoying bug.

Create a simple PDF file that contains the text "QuickPDF Library" and use another color for the character "P".

Then QuickPDF extracts the following content from "QuickPDF Library":

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,147.3920,776.6920,147.3920,784.7920,56.8000,784.7920,"Quick DF Library"
"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,92.8720,776.6920,92.8720,784.7920,86.2000,784.7920,"P"

As you can see, "P" is extracted after "Quick DF Library" with a missing "P", but the output should definitely be:

...,"Quick"
...,"P"
...,"DF Library"

When you use however more than one character in another color, then it works correctly. Use another color for "PD", then the text extraction from "QuickPDF Library" works in the correct order:

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,86.1040,776.6920,86.1040,784.7920,56.8000,784.7920,"Quick"
"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,101.5600,776.6920,101.5600,784.7920,86.2000,784.7920,"PD"
"BAAAAA+TimesNewRomanPSMT",#000000,12.00,101.5000,776.6920,147.1960,776.6920,147.1960,784.7920,101.5000,784.7920,"F Library"

So it seems that this happens only for single characters. Any chance to get this fixed in the next version?

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 10:43AM
Hi AIM!

If i'm looking on your sample it's like sorted by row and beginning column...
and this would make sense ;-)
Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.

If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.

Cheers, Ingo
 
Back to Top
AIM View Drop Down
Beginner
Beginner


Joined: 12 Aug 09
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote AIM Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 12:53PM

Quote If i'm looking on your sample it's like sorted by row and beginning column...
and this would make sense ;-)
Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.

Ingo, I'm not sure if I fully understand your answer. But if you have for example "Ingo" in your PDF, you would get "In o" and "g". I don't know how or why this would make sense, eg. if I want to search a PDF for "ingo".

In my opinion, "In" + "g" + "o" would be the only correct solution. This is at least the way how it works if two or more letters are in red.

In the meantime I also saw that it happens only with single characters in the middle of a word, not at the beginning.

OK, let's use the following examples:

- "Ingo" extracts "I" + "ngo" ..... OK
- "Ingo" extracts "In o" + "g" ..... error in my opinion
- "Ingo" extracts "I" + "ng" + "o" ..... OK

In all three tests I entered "Ingo" and colored a character in red afterwards.
 
Quote If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.

Do you mean that QuickPDF should extract "ckPDF Library" + "Qui" ?

But in that case you get "Qui" + "ckPDF Library" (what is correct in my opinion).

Thanks,
Martin

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 1:39PM
Hi Martin!

If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".
If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".
That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.
If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)
If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ... so if you want to do a textsearch you shouldn't use option 3.

Cheers, Ingo

Back to Top
AIM View Drop Down
Beginner
Beginner


Joined: 12 Aug 09
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote AIM Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 6:19PM

Quote If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".
If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".
That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.
If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)

OK, I understand your answer, but it doesn't explain the different behavior of QuickPDF for the 3 examples I gave (I always entered the text in Open Office and colored the letters afterwards, then I created the PDF). Either two or one of them do not work correctly then.

Quote If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ...

The other options seem to be a bit buggy, Option 3 always extracts the most text (except this annoyance with single characters).

OK, back to the "QuickPDF Library." example.

Option 0 gives the following output:

k  .

Option 1 and 2 give the following output:

56.80,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","k"
93.50,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","."

Option 3 gives the following output:

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,149.9120,776.6920,149.9120,784.7920,56.8000,784.7920,"Quick DF Library."
"CAAAAA+TimesNewRomanPS-BoldMT",#000000,12.00,86.2000,776.6920,93.5200,776.6920,93.5200,784.7920,86.2000,784.7920,"P"

Options 0, 1 and 2 are completely useless in that example. Option 4 would work here but didn't extract as much as Option 3 from several other PDFs I have tried (so not a real solution in my case).

So what would you suggest to fully extract these two words? Or is it impossible?

Thanks for any tips,
Martin



Edited by AIM - 12 Aug 09 at 6:21PM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 7:16PM
Hi Martin!

So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits. The first PSUnit is at the bottom of the page on the left side. Each thing (pictures, textstrings, ...) can be put on this page at anytime. The coordinates inside the pdf says where the objects shall appear. Please keep in mind that below the surface of the pdf it doesn't look as nice as later in the pdf-reader ;-)

Cheers, Ingo
 
Back to Top
AIM View Drop Down
Beginner
Beginner


Joined: 12 Aug 09
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote AIM Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 8:40PM

Quote So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits.

Do you have any code snippets or demos? 
But I think that this is a too complicated way for a functionality like extracting text that should be part of QuickPDF (where it nearly works).

I'm sorry, but I still believe that this is a bug in QuickPDF Ouch and if the "Ingo" example #2 would behave like example #1 and #3, there wouldn't be a problem and everything would work perfectly.

JFYI, I tried your pdftext.dll and it has the same problem! Here is the output of your DLL Embarrassed
 

page 1 / 1
 
Quick DF Library.
P

Back to Top
swb1 View Drop Down
Debenu Quick PDF Library Expert
Debenu Quick PDF Library Expert
Avatar

Joined: 05 Dec 05
Location: United States
Status: Offline
Points: 100
Post Options Post Options   Thanks (0) Thanks(0)   Quote swb1 Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 9:21PM

Martin,

 

Ingo is correct in that this is not a bug but rather the nature of the way that the PDF is constructed. There is no rule that says text that appears to be one word when displayed by Acrobat or rendered by QuickPDF actually be stored as one word inside the PDF. I have seen PDFs that were constructed one letter at a time! Each letter would appear as a single text element complete with location and font information. While this is an extremely inefficient way to construct a PDF it works nonetheless and appears to be just fine from the outside when rendered by Acrobat.

 

The text extraction routines of QuickPDF do not re-assemble the words. These routines merely extract the text as it is stored in the document and tell you where it should appear on the page and how it should be formatted.

 

A text extraction routine that is smart enough to re-assemble the words and tell me their origins would be a terrific enhancement to the library but as far as I know no such routine exists here today.

 

If Debenu does not add such a feature soon (hint, hint Karl;-) ) I will probably have to write one of my own.

  

Best luck to you,

Steve,



Edited by swb1 - 12 Aug 09 at 9:31PM
Back to Top
AIM View Drop Down
Beginner
Beginner


Joined: 12 Aug 09
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote AIM Quote  Post ReplyReply Direct Link To This Post Posted: 12 Aug 09 at 10:18PM

Thanks for all your information and suggestions, I think I understood now that my real "problem" of these 3 examples happens at PDF creation time.

Seems that I have to invest some time and implement a fully working text extraction myself...

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store