Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Extracting Text By CSV Coordinates |
Post Reply |
Author | |
andreweberle
Beginner Joined: 21 Apr 20 Location: Australia Status: Offline Points: 6 |
Post Options
Thanks(0)
Posted: 21 Apr 20 at 3:16AM |
Hello,
I'm trying to extract text by the coordinates using C# I first get the coordinates like so List<TextExtraction> textExtractions = quickPdf.ExtractFilePageText(pdfPath, null, 1, 3).ToCSV('\n').ToList(); this converts the CSV data into a generic list. the list object looks like this. public class TextExtraction { public double FontSize { get; set; } public string FontName { get; set; } public string FontColour { get; set; } public string Text { get; set; } public List<TextPoint> Points { get; set; } } public class TextPoint { public Vector X1Y1 { get; set; } public Vector X2Y2 { get; set; } public Vector X3Y3 { get; set; } public Vector X4Y4 { get; set; } } I am then trying to Set The Text Extraction Area however I am having trouble getting the correct coordinates so it only gets that particular text. var bottomLeft = textExtractions[2].Points[0].X4Y4.X; var topRight = textExtractions[2].Points[0].X2Y2.Y; var width = textExtractions[2].Points[0].X2Y2.X; var height = textExtractions[2].Points[0].X3Y3.Y; quickPdf.SetTextExtractionArea(bottomLeft, topRight, width, height); string getTextByCoordinates = quickPdf.GetPageText(3); Here is an example of the data. ( I have removed most of the lines to save space) "CIDFont+F1",#000000,11.04,481.44,735.4012,483.9351,735.4012,483.9351,748.8706,481.44,748.8706," " "CIDFont+F1",#000000,11.04,36,724.0012,38.4951,724.0012,38.4951,737.4706,36,737.4706," " "CIDFont+F1",#000000,11.04,36,709.4812,95.8794,709.4812,95.8794,722.9506,36,722.9506,"TAX INVOICE " "CIDFont+F1",#000000,11.04,107.9576,709.4812,110.4528,709.4812,110.4528,722.9506,107.9576,722.9506," " "CIDFont+F1",#000000,11.04,143.9495,709.4812,146.4447,709.4812,146.4447,722.9506,143.9495,722.9506," " "CIDFont+F1",#000000,11.04,180.0518,709.4812,182.5469,709.4812,182.5469,722.9506,180.0518,722.9506," " "CIDFont+F1",#000000,11.04,216.0436,709.4812,229.6547,709.4812,229.6547,722.9506,216.0436,722.9506,"25 " Edited by andreweberle - 21 Apr 20 at 3:20AM |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Andrew,
Before extraction you should set the origin you wanna have: SetOrigin https://www.debenu.com/docs/pdf_library_reference/SetOrigin.php To set your local measurementunits is good as well: SetMeasurementUnits https://www.debenu.com/docs/pdf_library_reference/SetMeasurementUnits.php Because you can have a document with rotated textcontent or other specialities you should use: CombineContentStreams https://www.debenu.com/docs/pdf_library_reference/CombineContentStreams.php NormalizePage https://www.debenu.com/docs/pdf_library_reference/NormalizePage.php Now starts your extraction and it will work like expected ;-) Cheers and welcome here, Ingo |
|
Cheers,
Ingo |
|
andreweberle
Beginner Joined: 21 Apr 20 Location: Australia Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hey Ingo,
Thanks very much for your reply. I was able to achieve this with your advice. Here is how I achieved it for future people. Although it can detect other lines, from the testing I have done the word you want will always be the first line, anything else can be disregarded.
class Program { public static PDFLibrary QP = new PDFLibrary("DebenuPDFLibraryDLL1016.dll"); static void Main(string[] args) { QP.UnlockKey(LICENCE_KEY); if (QP.Unlocked() > 0) { // Load The File. QP.LoadFromFile(pdfPath, null); QP.SetOrigin(1); QP.SetMeasurementUnits(2); QP.CombineContentStreams(); QP.NormalizePage(2); // Get The Text Collection. List<TextExtraction> textCollection = QP.GetPageText(3).ToList("\r\n").ToList(); // Get Rec. Rect rec = textCollection[6].Points[0].Rect; // X4 Y4 -- X3 QP.DrawBox(rec.Left, rec.Top, rec.Width, rec.Height, 0); // Set The Text Region. if (QP.SetTextExtractionArea(rec.Left, rec.Top, rec.Width, rec.Height) > 0) { // Attempt To Get The Text From The Selected Region. string text = QP.GetPageText(3); // Save The New File. QP.SaveToFile(newPath); // Print Text. Console.WriteLine(text); } // Release Library. QP.ReleaseLibrary(); // Open The PDF. System.Diagnostics.Process.Start(newPath); Console.ReadKey(); } } } /// <summary> /// /// </summary> public class TextExtraction { public string FontName { get; set; } public string FontColour { get; set; } public string Text { get; set; } public List<TextPoint> Points { get; set; } } /// <summary> /// /// </summary> public class TextPoint { public double FontSize { get; set; } public double TextWidth { get; set; } public (double,double) X1Y1 { get; set; } public (double, double) X2Y2 { get; set; } public (double, double) X3Y3 { get; set; } public (double, double) X4Y4 { get; set; } public Rect Rect { get; set; } } /// <summary> /// /// </summary> public static class Extenstions { static readonly Regex CsvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled); /// <summary> /// /// </summary> /// <param name="payload"></param> /// <param name="c"></param> /// <returns></returns> public static IEnumerable<TextExtraction> ToList(this string payload, string c) { // Split The Payload String. List<string> payloadCollection = payload.Split(new string[] {c}, StringSplitOptions.RemoveEmptyEntries).ToList(); payloadCollection.Remove(payloadCollection.Last()); foreach (string str in payloadCollection) { StringBuilder row = new StringBuilder(); // Split The String To Make It Easier To // Get Each Object. foreach (Match match in CsvSplit.Matches(str)) { row.Append(match.Value.TrimStart(',') + '\t'); } row.Length--; string[] obj = row.ToString().Split('\t'); //Create The Text Extraction Object. TextExtraction textExtraction = new TextExtraction() { FontName = obj[0], FontColour = obj[1], Text = obj.Last() }; textExtraction.Text.Replace("\"", ""); textExtraction.Text.TrimEnd(); // Create The Point. textExtraction.Points = new List<TextPoint> { new TextPoint() { FontSize = Convert.ToDouble(obj[2]), X1Y1 = (Convert.ToDouble(obj[3]), Convert.ToDouble(obj[4])), X2Y2 = (Convert.ToDouble(obj[5]), Convert.ToDouble(obj[6])), X3Y3 = (Convert.ToDouble(obj[7]), Convert.ToDouble(obj[8])), X4Y4 = (Convert.ToDouble(obj[9]), Convert.ToDouble(obj[10])), Rect = new Rect(Convert.ToDouble(obj[9]) - 0.1, Convert.ToDouble(obj[10]) + 0.1, Convert.ToDouble(obj[3]), Convert.ToDouble(obj[2]) /72), } }; // TODO: Get Text Width. // Get Text Width. textExtraction.Points[0].TextWidth = GetTextWidth(textExtraction); //Add The Point To The Collection. yield return textExtraction; } } } |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Andrew,
thanks for your sample - this will help :) I'll move your sample in the samples section later. Again... Thanks a lot. |
|
Cheers,
Ingo |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store