Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
PDF Extracting values incorrectly... |
Post Reply |
Author | |
Anu77
Beginner Joined: 16 Sep 20 Status: Offline Points: 2 |
Post Options
Thanks(0)
Posted: 16 Sep 20 at 2:15PM |
Hi Guys, I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped. In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason. Based on the coordinates the Title text should extracted as below BOROREN SUBSTATION COMMON SERVICES DRAWING LIST However its extracted as below. DRAWING LIST SUBSTATION COMMON SERVICES BOROREN Similar with Ref.No text should extracted in a one line as below. A1-H-157950-001 However it's splitted into two lines and the text extracted as below. -H-157950-001 A1 Below is the code I've used for text extraction. pdfLibrary.LoadFromFile(pdfFileName, ""); pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch pdfLibrary.CombineContentStreams(); pdfLibrary.NormalizePage(0); pdfLibrary.SetTextExtractionArea(left, top, width, height); extractedContent = pdfLibrary.GetPageText(8); I have attached the Sample PDF I've used for text extraction. It can be download using the below link. https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing If someone can give any assistance on this it would be greatly appreciate. Thank You.
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Anu,
try to extract with extractoption 2, 3, 4 or 5. Due to this you'll have the coordinates of the strings or words and perhaps you'll see small differences as the reason for the split. I've tried it by my own now. It's not a normal pdf you've processed. Seems to be sampled by different small tables, in landscape, as portrait, ... Seems option 8 (more accurate) with its special algo has probs with it. As i said already: Try option 2, 3, 4 or 5 and it will work. Make your own sort with the values of rows and columns and at least remove the position data - then you'll have the correct result. Below my test result (font data already removed): Extraction with positions: page;from pages;row;column;textcontent >>> 00001;00001;00798;00880"BOROREN >>> 00001;00001;00806;00846"SUBSTATION COMMON SERVICES >>> 00001;00001;00813;00873"DRAWING LIST Extraction as wordlist (with wrong results): ... S.BENIWAL >>> DRAWING >>> LIST >>> SUBSTATION >>> COMMON >>> SERVICES >>> BOROREN A1 ... Extraction as wordlist with positions: ... 00001;00001;00798;00880"BOROREN ... 00001;00001;00806;00846"SUBSTATION 00001;00001;00806;00886"COMMON 00001;00001;00806;00915"SERVICES ... 00001;00001;00813;00873"DRAWING 00001;00001;00813;00904"LIST ... Cheers and welcome here, Ingo Edited by Ingo - 16 Sep 20 at 7:20PM |
|
Cheers,
Ingo |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store