PDF Extracting values incorrectly...

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi Guys,

I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped. 
In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason.

Based on the coordinates the Title text should extracted as below
BOROREN
SUBSTATION COMMON SERVICES
DRAWING LIST

However its extracted as below.

DRAWING LIST
SUBSTATION COMMON SERVICES
BOROREN

Similar with Ref.No text should extracted in a one line as below.
A1-H-157950-001

However it's splitted into two lines and the text extracted as below.
-H-157950-001
A1

Below is the code I've used for text extraction.

pdfLibrary.LoadFromFile(pdfFileName, "");
pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch
pdfLibrary.CombineContentStreams();
pdfLibrary.NormalizePage(0);
pdfLibrary.SetTextExtractionArea(left, top, width, height);
extractedContent = pdfLibrary.GetPageText(8);

I have attached the Sample PDF I've used for text extraction. It can be download using the below link.

https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing

If someone can give any assistance on this it would be greatly appreciate.

Thank You.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Anu77 Members Profile Find Members Posts Beginner Joined: 16 Sep 20 Status: Offline Points: 2	Post Options Post Reply Quote Anu77 Report Post Thanks(0) Quote Reply Topic: PDF Extracting values incorrectly... Posted: 16 Sep 20 at 2:15PM
	Hi Guys, I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped. In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason. Based on the coordinates the Title text should extracted as below BOROREN SUBSTATION COMMON SERVICES DRAWING LIST However its extracted as below. DRAWING LIST SUBSTATION COMMON SERVICES BOROREN Similar with Ref.No text should extracted in a one line as below. A1-H-157950-001 However it's splitted into two lines and the text extracted as below. -H-157950-001 A1 Below is the code I've used for text extraction. pdfLibrary.LoadFromFile(pdfFileName, ""); pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch pdfLibrary.CombineContentStreams(); pdfLibrary.NormalizePage(0); pdfLibrary.SetTextExtractionArea(left, top, width, height); extractedContent = pdfLibrary.GetPageText(8); I have attached the Sample PDF I've used for text extraction. It can be download using the below link. https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing If someone can give any assistance on this it would be greatly appreciate. Thank You.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 16 Sep 20 at 7:18PM
	Hi Anu, try to extract with extractoption 2, 3, 4 or 5. Due to this you'll have the coordinates of the strings or words and perhaps you'll see small differences as the reason for the split. I've tried it by my own now. It's not a normal pdf you've processed. Seems to be sampled by different small tables, in landscape, as portrait, ... Seems option 8 (more accurate) with its special algo has probs with it. As i said already: Try option 2, 3, 4 or 5 and it will work. Make your own sort with the values of rows and columns and at least remove the position data - then you'll have the correct result. Below my test result (font data already removed): Extraction with positions: page;from pages;row;column;textcontent >>> 00001;00001;00798;00880"BOROREN >>> 00001;00001;00806;00846"SUBSTATION COMMON SERVICES >>> 00001;00001;00813;00873"DRAWING LIST Extraction as wordlist (with wrong results): ... S.BENIWAL >>> DRAWING >>> LIST >>> SUBSTATION >>> COMMON >>> SERVICES >>> BOROREN A1 ... Extraction as wordlist with positions: ... 00001;00001;00798;00880"BOROREN ... 00001;00001;00806;00846"SUBSTATION 00001;00001;00806;00886"COMMON 00001;00001;00806;00915"SERVICES ... 00001;00001;00813;00873"DRAWING 00001;00001;00813;00904"LIST ... Cheers and welcome here, Ingo Edited by Ingo - 16 Sep 20 at 7:20PM
	Cheers, Ingo