Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - PDF Extracting values incorrectly...
  FAQ FAQ  Forum Search   Register Register  Login Login

PDF Extracting values incorrectly...

 Post Reply Post Reply
Author
Message
Anu77 View Drop Down
Beginner
Beginner
Avatar

Joined: 16 Sep 20
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote Anu77 Quote  Post ReplyReply Direct Link To This Post Topic: PDF Extracting values incorrectly...
    Posted: 16 Sep 20 at 2:15PM
Hi Guys,

I'm extracting Titles from PDFs for an upload but when it extracts the titles the text is flipped. 
In Addition, when doing the same for Ref. No, the number is split into 2 parts with the prefix A1 being entered in a separate line, in the same cell for the Document No, for some reason.

Based on the coordinates the Title text should extracted as below
BOROREN
SUBSTATION COMMON SERVICES
DRAWING LIST

However its extracted as below.

DRAWING LIST
SUBSTATION COMMON SERVICES
BOROREN

Similar with Ref.No text should extracted in a one line as below.
A1-H-157950-001

However it's splitted into two lines and the text extracted as below.
-H-157950-001
A1

Below is the code I've used for text extraction.

pdfLibrary.LoadFromFile(pdfFileName, "");
pdfLibrary.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
pdfLibrary.SetMeasurementUnits(measureUnit); // Set the measurement unit to MM/Inch
pdfLibrary.CombineContentStreams();
pdfLibrary.NormalizePage(0);
pdfLibrary.SetTextExtractionArea(left, top, width, height);
extractedContent = pdfLibrary.GetPageText(8);

I have attached the Sample PDF I've used for text extraction. It can be download using the below link.

https://drive.google.com/file/d/19Ge9L4udGVCQgYhmhFz3WZS5GuDlJmhx/view?usp=sharing

If someone can give any assistance on this it would be greatly appreciate.

Thank You.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 16 Sep 20 at 7:18PM
Hi Anu,

try to extract with extractoption 2, 3, 4 or 5.
Due to this you'll have the coordinates of the strings or words and perhaps you'll see small differences as the reason for the split.
I've tried it by my own now.
It's not a normal pdf you've processed.
Seems to be sampled by different small tables, in landscape, as portrait, ...
Seems option 8 (more accurate) with its special algo has probs with it.
As i said already: Try option 2, 3, 4 or 5 and it will work.
Make your own sort with the values of rows and columns and at least remove the position data - then you'll have the correct result.

Below my test result (font data already removed):

Extraction with positions:
page;from pages;row;column;textcontent
>>> 00001;00001;00798;00880"BOROREN
>>> 00001;00001;00806;00846"SUBSTATION COMMON SERVICES
>>> 00001;00001;00813;00873"DRAWING LIST

Extraction as wordlist (with wrong results):
...
S.BENIWAL
>>> DRAWING
>>> LIST
>>> SUBSTATION
>>> COMMON
>>> SERVICES
>>> BOROREN
A1
...

Extraction as wordlist with positions:
...
00001;00001;00798;00880"BOROREN
...
00001;00001;00806;00846"SUBSTATION
00001;00001;00806;00886"COMMON
00001;00001;00806;00915"SERVICES
...
00001;00001;00813;00873"DRAWING
00001;00001;00813;00904"LIST
...


Cheers and welcome here,
Ingo




Edited by Ingo - 16 Sep 20 at 7:20PM
Cheers,
Ingo

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store