I need help - I can help - DASetTextExtractionArea with different origin

DASetTextExtractionArea with different origin

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2850
Printed Date: 12 Apr 26 at 4:28AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: DASetTextExtractionArea with different origin

Posted By: Cirunz
Subject: DASetTextExtractionArea with different origin
Date Posted: 18 Mar 14 at 2:32PM

Hi, I'm trying to extract text in a specific area, on a large number of pdf files.

My first approach is to loop for every file, open the file, select the page and proceed to extract the text with GetPageText:

//Code to initialize dll reference DPDF

int i = 0;
int mode = 7;
List<string> foundlines = new List<string>();
for (; i < pdffiles.Length; i++)
{
	if (DPDF.LoadFromFile(pdffiles, "") != 0)
	{
		if (DPDF.SelectPage(1) != 0)//I'm always searching in the first page
		{
			DPDF.SetMeasurementUnits(1);//Millimeters
			DPDF.SetOrigin(1);//Left-Top margin

			//field contains extraction area data
			if (DPDF.SetTextExtractionArea(field.Left, field.Top, field.Width, field.Height) == 1)
			{
				foundlines.Add(DPDF.GetPageText(mode).ToString().Trim());
			}
			
			DPDF.RemoveDocument(DPDF.SelectedDocument());
		}
		else
		{
			errormessage = "SelectPage: " + pdffiles;
			break;
		}
	}
	else
	{
		errormessage = "LoadFromFile: " + pdffiles;
		break;
	}
}//Extraction cycle end here

if (string.IsNullOrEmpty(errormessage))
{
	if (foundlines != null && foundlines.Count > 0)
	{
		File.WriteAllLines(@"C:\resultlines.txt", foundlines.ToArray());
		result = true;
	}
}

It works fine, but it's not very fast, and it uses lot of memory.

Worried by this results, I choosed to give a try to the ExtractFilePageText, so to keep low CPU and memory occupation.

So I've changed the above cycle in this way:

int i = 0;
int mode = 7;
List<string> foundlines = new List<string>();
DPDF.SetMeasurementUnits(1);//Millimeters
DPDF.SetOrigin(1);//Left-Top margin
for (; i < pdffiles.Length; i++)
{
	//field contains extraction area data
	if (DPDF.DASetTextExtractionArea(field.Left, field.Top, field.Width, field.Height) == 1)
	{
		foundlines.Add(DPDF.ExtractFilePageText(pdffiles, "", 1, mode).ToString().Trim());
	}
}//Extraction cycle end here

if (foundlines != null && foundlines.Count > 0)
{
	File.WriteAllLines(@"C:\resultlines.txt", foundlines.ToArray());
	result = true;
}

This does not find anything.

There is a simple explanation for this: Documentation says http://www.debenu.com/docs/pdf_library_reference/DASetTextExtractionArea.php" rel="nofollow - DASetTextExtractionArea is relative to the bottom left corner of the page, and do no mention a way to make the SetOrigin (or the SetMeasurementUnits), affect this function.

There is not a way to do so? The ExtractFilePageText can be only used with the default origin?

Thank you.

Replies:

Posted By: AndrewC
Date Posted: 19 Mar 14 at 10:58AM

Cirunz,

Yes. It is a complex thing to explain. The DA functions do not support the SetOrigin function as SetOrigin is not a DA supported functions. You cannot normally mix DA and non DA functions as they use different functions to process the file. The exception to this rule are that most of the Extract* functions do use the DA code and and not the non DA functions.

You need to adjust the Y position by calling

YPos := QP.DAGetPageHeight(dahandle, dapageref) - YPos;

Andrew.

Posted By: Cirunz
Date Posted: 19 Mar 14 at 11:03AM

AndrewC wrote:

You need to adjust the Y position by calling

YPos := QP.DAGetPageHeight(dahandle, dapageref) - YPos;

Andrew.

Thank you Andrew, this is really helpfull.

I have a mixed scenario, so I will use this function to adjust the coordinates, depending on the case.

Thanks again.

Fabio.