I need help - I can help - PDF Extract Text is not extract text accurately

Print Page | Close Window

PDF Extract Text is not extract text accurately

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3507
Printed Date: 31 Mar 26 at 1:17AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: PDF Extract Text is not extract text accurately

Posted By: nshsk
Subject: PDF Extract Text is not extract text accurately
Date Posted: 28 Sep 17 at 2:52AM

For some Pdf files this code is not accurately extracting the text. For some Pdf files this code works perfect. Is there any solution for this? Is there any thing wrong I'm doing here?

Here is my sample code.

pdfLibrary.LoadFromFile(pdfFileName, "");

pdfLibrary.SetOrigin(1);

pdfLibrary.SetMeasurementUnits(1);

pdfLibrary.SetTextExtractionArea(left, top, width, height);

pdfLibrary.SetTextExtractionOptions(5, 1);

pdfLibrary.NormalizePage(0);

string ExtractedContent = pdfLibrary.GetPageText(8);

Replies:

Posted By: tfrost
Date Posted: 28 Sep 17 at 9:54AM

What exactly do you mean by 'not accurately'? Are some characters incorrect? Some characters missing? Whole sections of text missing? Or is some text not extracted because it is part of an image (which would need OCR to identify)?

Posted By: Ingo
Date Posted: 28 Sep 17 at 8:22PM

Hi nshsk,

is the result the same trying all other possible extract options or only while using 8?

Where is the pdf? You should upload it anywhere using a filehoster - so we can check if we'll get the same results.

Your sample is okay.

Cheers and welcome here,

Ingo

-------------
Cheers,
Ingo

Posted By: nshsk
Date Posted: 02 Oct 17 at 12:46AM

Hi Ingo / tfrost,

Thanks for the reply. Some PDF files the extracted text is not with in the coordinates which i'm specifying in my code. I have only tried with option 8 for my requirement, which is to extract specific text with in the given region. Since options extracts more text with some formatting information, I'm not them.

I have uploaded 2 sample files in drop box which I have used to test my code. I have given them in the below links. From these Pdfs I need to extract the information like Dwg No, Rev No, Project No and, Title etc..Those are in the the bottom right region of the file. These details does not extract correctly.

https://www.dropbox.com/s/zq6hjxtuohrlz6z/1_Sample_DA-2014-626-1-Architectural-Plans-Part-1-1-5-Main-Street-MOUNT-ANNAN-LOT-206-DP-1070297.pdf?dl=0

https://www.dropbox.com/s/lmjyr8ipysk2bf7/A02.01%202%20-%20BASEMENT%204%20FLOOR%20PLAN.pdf?dl=0

Posted By: Ingo
Date Posted: 02 Oct 17 at 3:39PM

Hi nshsk,

creating pdf-documents rotate-functionality is often used.

This mean that the text you can see while loading the pdf with a viewer can be INSIDE the pdf at a different place and perhaps on a different layer.

So you should use

CombineContentStreams

and

NormalizePage

first before any extraction.

-------------
Cheers,
Ingo

Posted By: nshsk
Date Posted: 03 Oct 17 at 6:46AM

Hi Ingo,

Thanks for the prompt response !!!

As per the instructions I have used CombineContentStreams method. Now the Pdf text extraction works perfect. Smile