Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - ExtractFilePageText
  FAQ FAQ  Forum Search   Register Register  Login Login

ExtractFilePageText

 Post Reply Post Reply
Author
Message
REGH View Drop Down
Beginner
Beginner
Avatar

Joined: 14 Aug 17
Location: Sweden
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote REGH Quote  Post ReplyReply Direct Link To This Post Topic: ExtractFilePageText
    Posted: 14 Aug 17 at 5:16PM
Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this?

I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are  read twice which gives me two arrays having different number of texts...

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 14 Aug 17 at 9:22PM
Hi Reg,

strange behavior you're telling from.
For me the extract functions are the most stable ones in the library.
What you should do is:
Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code.
Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-)

Cheers and welcome here,
Ingo

Cheers,
Ingo

Back to Top
REGH View Drop Down
Beginner
Beginner
Avatar

Joined: 14 Aug 17
Location: Sweden
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote REGH Quote  Post ReplyReply Direct Link To This Post Posted: 15 Aug 17 at 5:29PM
Hi Ingo,
The file I'm extracting texts from is created from a CAD drawing (having TTF texts).
When I tried my code, but instead used a pdf created from MS Word there is no problem.
However, this is my VB code for testing the text extraction:
    QP.UnlockKey (strLicenseKey)
    QP.DASetTextExtractionOptions 12, 0 'Include rotated texts
    QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates
    QP.DASetTextExtractionOptions 5, 1 'Sort
   
    For iOption = 2 To 3 Step 1
        strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption)
   
        iOutFileNo = FreeFile
        strOutFileName = "C:\Temp\Option=" & iOption & ".txt"
        Open strOutFileName For Output As #iOutFileNo
   
        TextArray = Split(strTmpText, vbCr)
        For i = 0 To UBound(TextArray) - 1
            Print #iOutFileNo, TextArray(i)
        Next i
       
        Close #iOutFileNo
    Next iOption

One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2:
67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11"

And för Option=3 the same text is:
"AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ???   ?A  ?  ? ? ??? ?A?    A ? ???? ???  ???  ??         "

Option=2 gives me readable text, but Option=3 doesn't.
Here's a link to a zip containing the two textfiles and the pdf used for the test.
http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 15 Aug 17 at 7:37PM
Option 2 is like option 3 but a bit more accurate in extracting.
Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content.
At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions!
Your hoster wants my email-adress - he won't get it ;-)
If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too.
 



Cheers,
Ingo

Back to Top
REGH View Drop Down
Beginner
Beginner
Avatar

Joined: 14 Aug 17
Location: Sweden
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote REGH Quote  Post ReplyReply Direct Link To This Post Posted: 16 Aug 17 at 6:47AM
http://elcc.se/download/ExtractFilePageText.zip
I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text...

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 16 Aug 17 at 8:18AM
Hi Reg,

I've made some tests with the pdf...
The source is from BricsCAD.
It's converted from dwg-format.
I myself have the same probs while extracting text.
Perhaps a codepage problem?
Rendering works but there are few text parts overlaying each other.
BTW: At the end there's a malformed xref table.

With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD.
Another thing: Encoding is identity-H - this can be a problem, too.
My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks.

Cheers,
Ingo

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store