Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - PDF Extract Text is not extract text accurately
  FAQ FAQ  Forum Search   Register Register  Login Login

PDF Extract Text is not extract text accurately

 Post Reply Post Reply
Author
Message
nshsk View Drop Down
Beginner
Beginner
Avatar

Joined: 28 Sep 17
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote nshsk Quote  Post ReplyReply Direct Link To This Post Topic: PDF Extract Text is not extract text accurately
    Posted: 28 Sep 17 at 2:52AM

For some Pdf files this code is not accurately extracting the text. For some Pdf files this code works perfect. Is there any solution for this? Is there any thing wrong I'm doing here? 

Here is my sample code. 

pdfLibrary.LoadFromFile(pdfFileName, "");
pdfLibrary.SetOrigin(1); 
pdfLibrary.SetMeasurementUnits(1);
pdfLibrary.SetTextExtractionArea(left, top, width, height);
pdfLibrary.SetTextExtractionOptions(5, 1);
pdfLibrary.NormalizePage(0);
string ExtractedContent = pdfLibrary.GetPageText(8);
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 144
Post Options Post Options   Thanks (1) Thanks(1)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 28 Sep 17 at 9:54AM
What exactly do you mean by 'not accurately'?  Are some characters incorrect? Some characters missing? Whole sections of text missing?  Or is some text not extracted because it is part of an image (which would need OCR to identify)?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 2599
Post Options Post Options   Thanks (1) Thanks(1)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 28 Sep 17 at 8:22PM
Hi nshsk,

is the result the same trying all other possible extract options or only while using 8?
Where is the pdf? You should upload it anywhere using a filehoster - so we can check if we'll get the same results.
Your sample is okay.

Cheers and welcome here,
Ingo

Cheers,
Ingo

Back to Top
nshsk View Drop Down
Beginner
Beginner
Avatar

Joined: 28 Sep 17
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote nshsk Quote  Post ReplyReply Direct Link To This Post Posted: 02 Oct 17 at 12:46AM
Hi Ingo / tfrost,

Thanks for the reply. Some PDF files the extracted text is not with in the coordinates which i'm specifying in my code. I have only tried with option 8 for my requirement, which is to extract specific text with in the given region. Since options extracts more text with some formatting information, I'm not them.

I have uploaded 2 sample files in drop box which I have used to test my code. I have given them in the below links. From these Pdfs I need to extract the information like Dwg No, Rev No, Project No and, Title etc..Those are in the the bottom right region of the file. These details does not extract correctly.

https://www.dropbox.com/s/zq6hjxtuohrlz6z/1_Sample_DA-2014-626-1-Architectural-Plans-Part-1-1-5-Main-Street-MOUNT-ANNAN-LOT-206-DP-1070297.pdf?dl=0

https://www.dropbox.com/s/lmjyr8ipysk2bf7/A02.01%202%20-%20BASEMENT%204%20FLOOR%20PLAN.pdf?dl=0



Edited by nshsk - 02 Oct 17 at 4:36AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 2599
Post Options Post Options   Thanks (1) Thanks(1)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 02 Oct 17 at 3:39PM
Hi nshsk,

creating pdf-documents rotate-functionality is often used.
This mean that the text you can see while loading the pdf with a viewer can be INSIDE the pdf at a different place and perhaps on a different layer.
So you should use 
CombineContentStreams
and
NormalizePage
first before any extraction.
 
Cheers,
Ingo

Back to Top
nshsk View Drop Down
Beginner
Beginner
Avatar

Joined: 28 Sep 17
Status: Offline
Points: 4
Post Options Post Options   Thanks (0) Thanks(0)   Quote nshsk Quote  Post ReplyReply Direct Link To This Post Posted: 03 Oct 17 at 6:46AM
Hi Ingo,

Thanks for the prompt response !!!
As per the instructions I have used CombineContentStreams method. Now the Pdf text extraction works perfect. Smile
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store