PDF Extract Text is not extract text accurately

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

For some Pdf files this code is not accurately extracting the text. For some Pdf files this code works perfect. Is there any solution for this? Is there any thing wrong I'm doing here? 

Here is my sample code. 

pdfLibrary.LoadFromFile(pdfFileName, "");
pdfLibrary.SetOrigin(1); 
pdfLibrary.SetMeasurementUnits(1);
pdfLibrary.SetTextExtractionArea(left, top, width, height);
pdfLibrary.SetTextExtractionOptions(5, 1);
pdfLibrary.NormalizePage(0);
string ExtractedContent = pdfLibrary.GetPageText(8);

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
nshsk Members Profile Find Members Posts Beginner Joined: 28 Sep 17 Status: Offline Points: 6	Post Options Post Reply Quote nshsk Report Post Thanks(0) Quote Reply Topic: PDF Extract Text is not extract text accurately Posted: 28 Sep 17 at 2:52AM
	For some Pdf files this code is not accurately extracting the text. For some Pdf files this code works perfect. Is there any solution for this? Is there any thing wrong I'm doing here? Here is my sample code. pdfLibrary.LoadFromFile(pdfFileName, ""); pdfLibrary.SetOrigin(1); pdfLibrary.SetMeasurementUnits(1); pdfLibrary.SetTextExtractionArea(left, top, width, height); pdfLibrary.SetTextExtractionOptions(5, 1); pdfLibrary.NormalizePage(0); string ExtractedContent = pdfLibrary.GetPageText(8);

tfrost Members Profile Find Members Posts Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437	Post Options Post Reply Quote tfrost Report Post Thanks(1) Quote Reply Posted: 28 Sep 17 at 9:54AM
	What exactly do you mean by 'not accurately'? Are some characters incorrect? Some characters missing? Whole sections of text missing? Or is some text not extracted because it is part of an image (which would need OCR to identify)?

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(1) Quote Reply Posted: 28 Sep 17 at 8:22PM
	Hi nshsk, is the result the same trying all other possible extract options or only while using 8? Where is the pdf? You should upload it anywhere using a filehoster - so we can check if we'll get the same results. Your sample is okay. Cheers and welcome here, Ingo
	Cheers, Ingo

nshsk Members Profile Find Members Posts Beginner Joined: 28 Sep 17 Status: Offline Points: 6	Post Options Post Reply Quote nshsk Report Post Thanks(0) Quote Reply Posted: 02 Oct 17 at 12:46AM
	Hi Ingo / tfrost, Thanks for the reply. Some PDF files the extracted text is not with in the coordinates which i'm specifying in my code. I have only tried with option 8 for my requirement, which is to extract specific text with in the given region. Since options extracts more text with some formatting information, I'm not them. I have uploaded 2 sample files in drop box which I have used to test my code. I have given them in the below links. From these Pdfs I need to extract the information like Dwg No, Rev No, Project No and, Title etc..Those are in the the bottom right region of the file. These details does not extract correctly. https://www.dropbox.com/s/zq6hjxtuohrlz6z/1_Sample_DA-2014-626-1-Architectural-Plans-Part-1-1-5-Main-Street-MOUNT-ANNAN-LOT-206-DP-1070297.pdf?dl=0 https://www.dropbox.com/s/lmjyr8ipysk2bf7/A02.01%202%20-%20BASEMENT%204%20FLOOR%20PLAN.pdf?dl=0 Edited by nshsk - 02 Oct 17 at 4:36AM

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(1) Quote Reply Posted: 02 Oct 17 at 3:39PM
	Hi nshsk, creating pdf-documents rotate-functionality is often used. This mean that the text you can see while loading the pdf with a viewer can be INSIDE the pdf at a different place and perhaps on a different layer. So you should use CombineContentStreams and NormalizePage first before any extraction.
	Cheers, Ingo

nshsk Members Profile Find Members Posts Beginner Joined: 28 Sep 17 Status: Offline Points: 6	Post Options Post Reply Quote nshsk Report Post Thanks(0) Quote Reply Posted: 03 Oct 17 at 6:46AM
	Hi Ingo, Thanks for the prompt response !!! As per the instructions I have used CombineContentStreams method. Now the Pdf text extraction works perfect.