Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - extract Text (words)
  FAQ FAQ  Forum Search   Register Register  Login Login

extract Text (words)

 Post Reply Post Reply
Author
Message
munteanu24d View Drop Down
Beginner
Beginner


Joined: 03 Sep 09
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote munteanu24d Quote  Post ReplyReply Direct Link To This Post Topic: extract Text (words)
    Posted: 11 Sep 09 at 2:36PM
hello!

I am trying to get the text from a pdf file, using the getPageText(option) method.
I have tried option = 3 and option = 4.

When I print the text obtained with option 3, I get just the first word from each row, but the coordinates of the whole row.

When I print the text obtained with option 4, I get the fragmented piece of words, for instance for constant word, i get const and ant .

I need implement the find word functionality, but I cannot do it, as long as instead of the whole word, i get fragment of words.

What can I do?

P.S. I have tried for different pdfs and the result is the same.

best wishes,
D.M.

Edited by munteanu24d - 11 Sep 09 at 2:38PM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3530
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 11 Sep 09 at 7:16PM
Hi D.M.!

Option 3 gets strings if strings were inserted.
It's not a must that the line of characters you can see in the pdf was inserted in one run.
Another thing is: If there was one word missed in a pdf-row and if it was inserted later ... so this word would be extracted as the last content of the page 'cause it was inserted late and it doesn't matter to which row it belongs.
If you get always one word with option 3 then i think that all pdf-documents are from the same source and that they are automatically generated.
You can send two samples to me then we can examine them:
ingo  -dot-  schmoekel  -at-   ewetel  -dot-  net
The other way round i can send you a file with "longer" strings ;-)

Cheers, Ingo

Back to Top
munteanu24d View Drop Down
Beginner
Beginner


Joined: 03 Sep 09
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote munteanu24d Quote  Post ReplyReply Direct Link To This Post Posted: 15 Sep 09 at 10:30AM
Thank you for your answer. I Have changed the pdf files, and now I am managing to search for words in the rows, with option 3.

What I do not manage to do is to take the char width.


The _QP.CharWidth(myASCIcode) always returns a 0 value.
I have checked the selected font ID and it is also 0. Here might be the problem, but i do not manage to fix it... :(




Edited by munteanu24d - 15 Sep 09 at 10:31AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3530
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 15 Sep 09 at 11:48AM
Hi!

With option=3 you get the value for font-height, too.
With a bit calculations regarding the other values (from the four rectangles) you can get the complete length, too.
If you want the character-length you can get the string-lenght with "len(...)", "length(...)" or any similar syntax in many languages.
So you have the string-length and the height and it shouldn't be a big problem to find/calculate a matching factor for each character-width.

Cheers, Ingo

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store