Print Page | Close Window

PDF Extracting values incorrectly...

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3842
Printed Date: 22 Oct 20 at 12:38PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: PDF Extracting values incorrectly...
Posted By: Anu77
Subject: PDF Extracting values incorrectly...
Date Posted: 16 Sep 20 at 2:15PM
Hi Guys,

I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped. 
In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason.

Based on the coordinates the Title text should extracted as below
BOROREN
SUBSTATION COMMON SERVICES
DRAWING LIST

However its extracted as below.

DRAWING LIST
SUBSTATION COMMON SERVICES
BOROREN

Similar with Ref.No text should extracted in a one line as below.
A1-H-157950-001

However it's splitted into two lines and the text extracted as below.
-H-157950-001
A1

Below is the code I've used for text extraction.

pdfLibrary.LoadFromFile(pdfFileName, "");
pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch
pdfLibrary.CombineContentStreams();
pdfLibrary.NormalizePage(0);
pdfLibrary.SetTextExtractionArea(left, top, width, height);
extractedContent = pdfLibrary.GetPageText(8);

I have attached the Sample PDF I've used for text extraction. It can be download using the below link.

https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing

If someone can give any assistance on this it would be greatly appreciate.

Thank You.



Replies:
Posted By: Ingo
Date Posted: 16 Sep 20 at 7:18PM
Hi Anu,

try to extract with extractoption 2, 3, 4 or 5.
Due to this you'll have the coordinates of the strings or words and perhaps you'll see small differences as the reason for the split.
I've tried it by my own now.
It's not a normal pdf you've processed.
Seems to be sampled by different small tables, in landscape, as portrait, ...
Seems option 8 (more accurate) with its special algo has probs with it.
As i said already: Try option 2, 3, 4 or 5 and it will work.
Make your own sort with the values of rows and columns and at least remove the position data - then you'll have the correct result.

Below my test result (font data already removed):

Extraction with positions:
page;from pages;row;column;textcontent
>>> 00001;00001;00798;00880"BOROREN
>>> 00001;00001;00806;00846"SUBSTATION COMMON SERVICES
>>> 00001;00001;00813;00873"DRAWING LIST

Extraction as wordlist (with wrong results):
...
S.BENIWAL
>>> DRAWING
>>> LIST
>>> SUBSTATION
>>> COMMON
>>> SERVICES
>>> BOROREN
A1
...

Extraction as wordlist with positions:
...
00001;00001;00798;00880"BOROREN
...
00001;00001;00806;00846"SUBSTATION
00001;00001;00806;00886"COMMON
00001;00001;00806;00915"SERVICES
...
00001;00001;00813;00873"DRAWING
00001;00001;00813;00904"LIST
...


Cheers and welcome here,
Ingo




-------------
Cheers,
Ingo




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk