I need help - I can help - ExtractFilePageText

Print Page | Close Window

ExtractFilePageText

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3492
Printed Date: 05 Apr 26 at 1:03AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: ExtractFilePageText

Posted By: REGH
Subject: ExtractFilePageText
Date Posted: 14 Aug 17 at 5:16PM

Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this?

I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are read twice which gives me two arrays having different number of texts...

Replies:

Posted By: Ingo
Date Posted: 14 Aug 17 at 9:22PM

Hi Reg,

strange behavior you're telling from.

For me the extract functions are the most stable ones in the library.

What you should do is:

Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code.

Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-)

Cheers and welcome here,

Ingo

-------------
Cheers,
Ingo

Posted By: REGH
Date Posted: 15 Aug 17 at 5:29PM

Hi Ingo,
The file I'm extracting texts from is created from a CAD drawing (having TTF texts).
When I tried my code, but instead used a pdf created from MS Word there is no problem.
However, this is my VB code for testing the text extraction:
    QP.UnlockKey (strLicenseKey)
    QP.DASetTextExtractionOptions 12, 0 'Include rotated texts
    QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates
    QP.DASetTextExtractionOptions 5, 1 'Sort

    For iOption = 2 To 3 Step 1
        strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption)

        iOutFileNo = FreeFile
        strOutFileName = "C:\Temp\Option=" & iOption & ".txt"
        Open strOutFileName For Output As #iOutFileNo

        TextArray = Split(strTmpText, vbCr)
        For i = 0 To UBound(TextArray) - 1
            Print #iOutFileNo, TextArray(i)
        Next i

        Close #iOutFileNo
    Next iOption

One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2:
67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11"

And för Option=3 the same text is:
"AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ???   ?A ? ? ? ??? ?A?    A ? ???? ??? ??? ??         "

Option=2 gives me readable text, but Option=3 doesn't.
Here's a link to a zip containing the two textfiles and the pdf used for the test.
http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip" rel="nofollow - http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip

Posted By: Ingo
Date Posted: 15 Aug 17 at 7:37PM

Option 2 is like option 3 but a bit more accurate in extracting.

Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content.

At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions!

Your hoster wants my email-adress - he won't get it ;-)

If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too.

-------------
Cheers,
Ingo

Posted By: REGH
Date Posted: 16 Aug 17 at 6:47AM

http://elcc.se/download/ExtractFilePageText.zip" rel="nofollow - http://elcc.se/download/ExtractFilePageText.zip
I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text...

Posted By: Ingo
Date Posted: 16 Aug 17 at 8:18AM

Hi Reg,

I've made some tests with the pdf...
The source is from BricsCAD.
It's converted from dwg-format.
I myself have the same probs while extracting text.
Perhaps a codepage problem?
Rendering works but there are few text parts overlaying each other.
BTW: At the end there's a malformed xref table.

With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD.
Another thing: Encoding is identity-H - this can be a problem, too.
My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks.

-------------
Cheers,
Ingo