Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - problems with GetPageText()
  FAQ FAQ  Forum Search   Register Register  Login Login

problems with GetPageText()

 Post Reply Post Reply
Author
Message
tj asher View Drop Down
Beginner
Beginner


Joined: 09 Jan 13
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote tj asher Quote  Post ReplyReply Direct Link To This Post Topic: problems with GetPageText()
    Posted: 09 Jan 13 at 7:26PM
Hello,
 
Using Delphi XE2 and version 912 of Debenu Library VCL component.
 
Doing a GetPageText I get some odd decoding issues with some PDFs.
 
A snippet of my code to get the page text which is pretty straigt forward:
 
          for x := 1 to PDFLibrary.PageCount do begin
            PDFLibrary.SelectPage(x);
            Memo1.Text := Memo1.Text + PDFLibrary.GetPageText(7);//passing 7 preserves formatting
          end;
 
Here is how the text looks on the PDF. You'll have to trust me that it looks like this since I cannot post a screen shot.
 
Tax
        Labor Tax          @    7.00%        $4.20  
        Parts Tax           @    7.00%      $27.75  
  Tax Total                                       $31.95
 
 
When I get the page text I get stuff like this:
 
Tax
                                             $4.20
           Labor Tax   @       7.00%        $27.75
                      @
            Parts Tax
                               7.00%
 Tax Total                                  $31.95
 
I'm guessing there is something wacky about how this PDF is created. Is there anything I can do to get my page text in a format more closely to how it shows on the actual PDF? I need the text to be properly formatted to parse it.
 
Thanks for any advice.
 
Regards,
TJ Asher
 
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3530
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 09 Jan 13 at 8:21PM
Hi TJ!

Option 7 is best for your needs.
For better parsing you can try the word-by-word-extraction option.
Another idea: Do the extraction with the additional data regarding textformatting and positions.
Then you can do the layout by your own.

Cheers and welcome here,
Ingo

Back to Top
tj asher View Drop Down
Beginner
Beginner


Joined: 09 Jan 13
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote tj asher Quote  Post ReplyReply Direct Link To This Post Posted: 09 Jan 13 at 10:02PM
Ingo,

I am using option 7 but for some reason the actual text returned *from* the page is not how it looks *on* the page.

I will consider the option of the text with data positions.
 
Trying to use Acrobat or Foxit PDF Reader and selecting just the text in question is difficult as other areas get selected that don't appear related so I suspect flaws in the orginization of the PDF document itself.
 
Regards,
TJ Asher
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 13 Jan 13 at 1:48PM
We would need to see the actual PDF to explain exactly why the results looks the way they do.  

I suspect the PDF uses different fonts and sizes for the text.  Text extraction is not an exact science and it is a bit like putting together a jigsaw puzzle.

Andrew.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store