ExtractFilePageText

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this?

I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are  read twice which gives me two arrays having different number of texts...

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
REGH Members Profile Find Members Posts Beginner Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4	Post Options Post Reply Quote REGH Report Post Thanks(0) Quote Reply Topic: ExtractFilePageText Posted: 14 Aug 17 at 5:16PM
	Hi! I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this? I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are read twice which gives me two arrays having different number of texts...

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 14 Aug 17 at 9:22PM
	Hi Reg, strange behavior you're telling from. For me the extract functions are the most stable ones in the library. What you should do is: Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code. Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-) Cheers and welcome here, Ingo
	Cheers, Ingo

REGH Members Profile Find Members Posts Beginner Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4	Post Options Post Reply Quote REGH Report Post Thanks(0) Quote Reply Posted: 15 Aug 17 at 5:29PM
	Hi Ingo, The file I'm extracting texts from is created from a CAD drawing (having TTF texts). When I tried my code, but instead used a pdf created from MS Word there is no problem. However, this is my VB code for testing the text extraction: QP.UnlockKey (strLicenseKey) QP.DASetTextExtractionOptions 12, 0 'Include rotated texts QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates QP.DASetTextExtractionOptions 5, 1 'Sort For iOption = 2 To 3 Step 1 strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption) iOutFileNo = FreeFile strOutFileName = "C:\Temp\Option=" & iOption & ".txt" Open strOutFileName For Output As #iOutFileNo TextArray = Split(strTmpText, vbCr) For i = 0 To UBound(TextArray) - 1 Print #iOutFileNo, TextArray(i) Next i Close #iOutFileNo Next iOption One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2: 67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11" And för Option=3 the same text is: "AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ??? ?A ? ? ? ??? ?A? A ? ???? ??? ??? ?? " Option=2 gives me readable text, but Option=3 doesn't. Here's a link to a zip containing the two textfiles and the pdf used for the test. http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 15 Aug 17 at 7:37PM
	Option 2 is like option 3 but a bit more accurate in extracting. Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content. At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions! Your hoster wants my email-adress - he won't get it ;-) If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too.
	Cheers, Ingo

REGH Members Profile Find Members Posts Beginner Joined: 14 Aug 17 Location: Sweden Status: Offline Points: 4	Post Options Post Reply Quote REGH Report Post Thanks(0) Quote Reply Posted: 16 Aug 17 at 6:47AM
	http://elcc.se/download/ExtractFilePageText.zip I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text...

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 16 Aug 17 at 8:18AM
	Hi Reg, I've made some tests with the pdf... The source is from BricsCAD. It's converted from dwg-format. I myself have the same probs while extracting text. Perhaps a codepage problem? Rendering works but there are few text parts overlaying each other. BTW: At the end there's a malformed xref table. With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD. Another thing: Encoding is identity-H - this can be a problem, too. My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks.
	Cheers, Ingo