Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
![]() |
ExtractFilePageText |
Post Reply ![]() |
Author | |
REGH ![]() Beginner ![]() ![]() Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4 |
![]() ![]() ![]() ![]() ![]() Posted: 14 Aug 17 at 5:16PM |
Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this? I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are read twice which gives me two arrays having different number of texts... |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Reg,
strange behavior you're telling from. For me the extract functions are the most stable ones in the library. What you should do is: Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code. Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-) Cheers and welcome here, Ingo |
|
Cheers,
Ingo |
|
![]() |
|
REGH ![]() Beginner ![]() ![]() Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4 |
![]() ![]() ![]() ![]() ![]() |
Hi Ingo,
The file I'm extracting texts from is created from a CAD drawing (having TTF texts). When I tried my code, but instead used a pdf created from MS Word there is no problem. However, this is my VB code for testing the text extraction: QP.UnlockKey (strLicenseKey) QP.DASetTextExtractionOptions 12, 0 'Include rotated texts QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates QP.DASetTextExtractionOptions 5, 1 'Sort For iOption = 2 To 3 Step 1 strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption) iOutFileNo = FreeFile strOutFileName = "C:\Temp\Option=" & iOption & ".txt" Open strOutFileName For Output As #iOutFileNo TextArray = Split(strTmpText, vbCr) For i = 0 To UBound(TextArray) - 1 Print #iOutFileNo, TextArray(i) Next i Close #iOutFileNo Next iOption One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2: 67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11" And för Option=3 the same text is: "AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ??? ?A ? ? ? ??? ?A? A ? ???? ??? ??? ?? " Option=2 gives me readable text, but Option=3 doesn't. Here's a link to a zip containing the two textfiles and the pdf used for the test. http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Option 2 is like option 3 but a bit more accurate in extracting.
Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content. At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions! Your hoster wants my email-adress - he won't get it ;-) If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too. |
|
Cheers,
Ingo |
|
![]() |
|
REGH ![]() Beginner ![]() ![]() Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4 |
![]() ![]() ![]() ![]() ![]() |
http://elcc.se/download/ExtractFilePageText.zip
I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text... |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Reg,
I've made some tests with the pdf... The source is from BricsCAD. It's converted from dwg-format. I myself have the same probs while extracting text. Perhaps a codepage problem? Rendering works but there are few text parts overlaying each other. BTW: At the end there's a malformed xref table. With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD. Another thing: Encoding is identity-H - this can be a problem, too. My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks. |
|
Cheers,
Ingo |
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store