I need help - I can help - PDF Extracting values incorrectly...

PDF Extracting values incorrectly...

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3842
Printed Date: 07 Jan 26 at 3:51PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: PDF Extracting values incorrectly...

Posted By: Anu77
Subject: PDF Extracting values incorrectly...
Date Posted: 16 Sep 20 at 2:15PM

Hi Guys,

I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped.

In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason.

Based on the coordinates the Title text should extracted as below

BOROREN

SUBSTATION COMMON SERVICES

DRAWING LIST

However its extracted as below.

DRAWING LIST

SUBSTATION COMMON SERVICES

BOROREN

Similar with Ref.No text should extracted in a one line as below.

A1-H-157950-001

However it's splitted into two lines and the text extracted as below.

-H-157950-001

Below is the code I've used for text extraction.

pdfLibrary.LoadFromFile(pdfFileName, "");

pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left

pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch

pdfLibrary.CombineContentStreams();

pdfLibrary.NormalizePage(0);

pdfLibrary.SetTextExtractionArea(left, top, width, height);

extractedContent = pdfLibrary.GetPageText(8);

I have attached the Sample PDF I've used for text extraction. It can be download using the below link.

https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing

If someone can give any assistance on this it would be greatly appreciate.

Thank You.

Replies:

Posted By: Ingo
Date Posted: 16 Sep 20 at 7:18PM

Hi Anu,

try to extract with extractoption 2, 3, 4 or 5.
Due to this you'll have the coordinates of the strings or words and perhaps you'll see small differences as the reason for the split.
I've tried it by my own now.
It's not a normal pdf you've processed.
Seems to be sampled by different small tables, in landscape, as portrait, ...
Seems option 8 (more accurate) with its special algo has probs with it.
As i said already: Try option 2, 3, 4 or 5 and it will work.
Make your own sort with the values of rows and columns and at least remove the position data - then you'll have the correct result.

Below my test result (font data already removed):

Extraction with positions:
page;from pages;row;column;textcontent
>>> 00001;00001;00798;00880"BOROREN
>>> 00001;00001;00806;00846"SUBSTATION COMMON SERVICES
>>> 00001;00001;00813;00873"DRAWING LIST

Extraction as wordlist (with wrong results):
...
S.BENIWAL
>>> DRAWING
>>> LIST
>>> SUBSTATION
>>> COMMON
>>> SERVICES
>>> BOROREN
A1
...

Extraction as wordlist with positions:
...
00001;00001;00798;00880"BOROREN
...
00001;00001;00806;00846"SUBSTATION
00001;00001;00806;00886"COMMON
00001;00001;00806;00915"SERVICES
...
00001;00001;00813;00873"DRAWING
00001;00001;00813;00904"LIST
...

Cheers and welcome here,
Ingo

-------------
Cheers,
Ingo