Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
PDF Extract Text is not extract text accurately |
Post Reply |
Author | |
nshsk
Beginner Joined: 28 Sep 17 Status: Offline Points: 6 |
Post Options
Thanks(0)
Posted: 28 Sep 17 at 2:52AM |
For some Pdf files this code is not accurately extracting the text. For some Pdf files this code works perfect. Is there any solution for this? Is there any thing wrong I'm doing here? Here is my sample code. pdfLibrary.LoadFromFile(pdfFileName, ""); pdfLibrary.SetOrigin(1); pdfLibrary.SetMeasurementUnits(1); pdfLibrary.SetTextExtractionArea(left, top, width, height); pdfLibrary.SetTextExtractionOptions(5, 1); pdfLibrary.NormalizePage(0); string ExtractedContent = pdfLibrary.GetPageText(8);
|
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(1)
|
What exactly do you mean by 'not accurately'? Are some characters incorrect? Some characters missing? Whole sections of text missing? Or is some text not extracted because it is part of an image (which would need OCR to identify)?
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(1)
|
Hi nshsk,
is the result the same trying all other possible extract options or only while using 8? Where is the pdf? You should upload it anywhere using a filehoster - so we can check if we'll get the same results. Your sample is okay. Cheers and welcome here, Ingo |
|
Cheers,
Ingo |
|
nshsk
Beginner Joined: 28 Sep 17 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hi Ingo / tfrost, I have uploaded 2 sample files in drop box which I have used to test my code. I have given them in the below links. From these Pdfs I need to extract the information like Dwg No, Rev No, Project No and, Title etc..Those are in the the bottom right region of the file. These details does not extract correctly. https://www.dropbox.com/s/zq6hjxtuohrlz6z/1_Sample_DA-2014-626-1-Architectural-Plans-Part-1-1-5-Main-Street-MOUNT-ANNAN-LOT-206-DP-1070297.pdf?dl=0 https://www.dropbox.com/s/lmjyr8ipysk2bf7/A02.01%202%20-%20BASEMENT%204%20FLOOR%20PLAN.pdf?dl=0 Edited by nshsk - 02 Oct 17 at 4:36AM |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(1)
|
Hi nshsk,
creating pdf-documents rotate-functionality is often used. This mean that the text you can see while loading the pdf with a viewer can be INSIDE the pdf at a different place and perhaps on a different layer. So you should use CombineContentStreams and NormalizePage first before any extraction. |
|
Cheers,
Ingo |
|
nshsk
Beginner Joined: 28 Sep 17 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hi Ingo,
Thanks for the prompt response !!! As per the instructions I have used CombineContentStreams method. Now the Pdf text extraction works perfect.
|
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store