Print Page | Close Window

ExtractFilePageText

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3492
Printed Date: 28 Apr 24 at 7:55PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: ExtractFilePageText
Posted By: REGH
Subject: ExtractFilePageText
Date Posted: 14 Aug 17 at 5:16PM
Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this?

I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are  read twice which gives me two arrays having different number of texts...




Replies:
Posted By: Ingo
Date Posted: 14 Aug 17 at 9:22PM
Hi Reg,

strange behavior you're telling from.
For me the extract functions are the most stable ones in the library.
What you should do is:
Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code.
Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-)

Cheers and welcome here,
Ingo



-------------
Cheers,
Ingo



Posted By: REGH
Date Posted: 15 Aug 17 at 5:29PM
Hi Ingo,
The file I'm extracting texts from is created from a CAD drawing (having TTF texts).
When I tried my code, but instead used a pdf created from MS Word there is no problem.
However, this is my VB code for testing the text extraction:
    QP.UnlockKey (strLicenseKey)
    QP.DASetTextExtractionOptions 12, 0 'Include rotated texts
    QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates
    QP.DASetTextExtractionOptions 5, 1 'Sort
   
    For iOption = 2 To 3 Step 1
        strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption)
   
        iOutFileNo = FreeFile
        strOutFileName = "C:\Temp\Option=" & iOption & ".txt"
        Open strOutFileName For Output As #iOutFileNo
   
        TextArray = Split(strTmpText, vbCr)
        For i = 0 To UBound(TextArray) - 1
            Print #iOutFileNo, TextArray(i)
        Next i
       
        Close #iOutFileNo
    Next iOption

One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2:
67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11"

And för Option=3 the same text is:
"AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ???   ?A  ?  ? ? ??? ?A?    A ? ???? ???  ???  ??         "

Option=2 gives me readable text, but Option=3 doesn't.
Here's a link to a zip containing the two textfiles and the pdf used for the test.
http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip" rel="nofollow - http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip



Posted By: Ingo
Date Posted: 15 Aug 17 at 7:37PM
Option 2 is like option 3 but a bit more accurate in extracting.
Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content.
At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions!
Your hoster wants my email-adress - he won't get it ;-)
If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too.
 





-------------
Cheers,
Ingo



Posted By: REGH
Date Posted: 16 Aug 17 at 6:47AM
http://elcc.se/download/ExtractFilePageText.zip" rel="nofollow - http://elcc.se/download/ExtractFilePageText.zip
I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text...



Posted By: Ingo
Date Posted: 16 Aug 17 at 8:18AM
Hi Reg,

I've made some tests with the pdf...
The source is from BricsCAD.
It's converted from dwg-format.
I myself have the same probs while extracting text.
Perhaps a codepage problem?
Rendering works but there are few text parts overlaying each other.
BTW: At the end there's a malformed xref table.

With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD.
Another thing: Encoding is identity-H - this can be a problem, too.
My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks.



-------------
Cheers,
Ingo




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk