Extract table(s) from pdf-file (Delphi)

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi!

How to extract a table (or tables, if there are several) from PDF file. 
I looked at the functions on your site, and also looked for examples for extracting tables, 
but did not find anything like it.

An example of use would be desirable on Delphi, if possible, 
but it is possible on Sharpe too. 

Thanks in advance!

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
meligo Members Profile Find Members Posts Beginner Joined: 04 May 20 Location: RU Status: Offline Points: 6	Post Options Post Reply Quote meligo Report Post Thanks(0) Quote Reply Topic: Extract table(s) from pdf-file (Delphi) Posted: 04 May 20 at 5:02PM
	Hi! How to extract a table (or tables, if there are several) from PDF file. I looked at the functions on your site, and also looked for examples for extracting tables, but did not find anything like it. An example of use would be desirable on Delphi, if possible, but it is possible on Sharpe too. Thanks in advance!

meligo Members Profile Find Members Posts Beginner Joined: 04 May 20 Location: RU Status: Offline Points: 6	Post Options Post Reply Quote meligo Report Post Thanks(0) Quote Reply Posted: 04 May 20 at 5:17PM
	Clarification: Each extracted table would have to be a text block in the form of a list of rows, where each row of the list is a row of the table, and the fields in these rows should be separated by a separator, for example, <TAB>.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(1) Quote Reply Posted: 04 May 20 at 7:04PM
	Hi Meligo, sorry but like you've determined already there aren't samples published about extraction of table content. You should take a deeper look into the text extraction functionalities for your needs. The csv-option should be relevant for you to have data for positions of rows and columns. Like AndrewC (R.I.P) told in the past: "There is going to be no easy way to read this with Debenu Quick PDF Library without some complex strategies.". His advice for the first steps was: "Try this QP.SetTextExtractionScaling(0, 2, 8); QP.SetTextExtractionWordGap(0.2); And then call GetPageText(7); You can then split the text into columns and trim leading and trailing spaces.". Good luck! Cheers and welcome here, Ingo
	Cheers, Ingo

meligo Members Profile Find Members Posts Beginner Joined: 04 May 20 Location: RU Status: Offline Points: 6	Post Options Post Reply Quote meligo Report Post Thanks(0) Quote Reply Posted: 06 May 20 at 1:33AM
	Hi Ingo! Thanks for such a quick reply! Your answer involves analyzing tables in the more general complex case where the code for these tables can be drawn arbitrarily. And then indeed, the cells of these tables must be calculated in a poorly formalized non-trivial way. I consider a simpler case when the source file with standard tables is created, for example, in MS Word or OpenOffice, then it is either saved as a PDF file or printed on a virtual PDF printer, generating the same PDF file in which the structure from the original tables are stored. Now, if you send such a PDF file to convert PDF2Word to any of the many free web services and get the “.docx” document, then the whole structure of the source tables will be completely saved in this resulting file, and they can be easily extracted from this document . I am attaching an archive Test.zip with an example MS Word test file with two tables created using the standard “Insert table” method, then a PDF file, obtained from it via a virtual PDF printer, and the result of the pdf2word file conversion, as well as two text data tables separated by <TAB> extracted from the last file (my own table extractor program from msword document). In my case, I would like to eliminate the additional step of PDF2Word conversion and immediately extract these tables directly from PDF, since such information, as we see, is stored in the pdf file and is successfully restored during conversion. Note: if you drag these two text files, for example, into open MS Excel, you will see two extracted tables from the PDF file, including, by the way, empty cells, the problem of which you discussed here on the forum in one of your posts. Edited by meligo - 08 May 20 at 5:04AM

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3530	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 06 May 20 at 10:26PM
	Hi Meligo, seems you don't know that this is a user forum here and all given help here is given using personal free time? Perhaps somebody can give you an detailed advice...
	Cheers, Ingo

meligo Members Profile Find Members Posts Beginner Joined: 04 May 20 Location: RU Status: Offline Points: 6	Post Options Post Reply Quote meligo Report Post Thanks(0) Quote Reply Posted: 08 May 20 at 2:52AM
	Dear Ingo! You in vain saw the irony in my previous post! I really sincerely thanked you for your prompt reply! Regarding the substance of the issue under discussion: It could be a false impression that the method for extracting the tables discussed above applies exclusively to tables created in MSWord or OpenOffice documents sent to a virtual PDF printer. However, if you create a PDF file with a table directly, exclusively using CreateTable () in DebenuPDFLibrary, as, for example, in the http://www.quickpdf.org/forum/create-table-exactly-like-this-sample_topic1907.html, it is also easy to extract this table from it using the same technology: 1. Send the created file (table.pdf) from this demo to a web service, for example, https://www.pdf2go.com/pdf-to-word and get the MS Word document back. 2. Open the downloaded file in MS Word and select the table by clicking the icon in the upper left corner of the table. 3. Now the selected table can be easily extracted (for example, into open MS Excel) using simple Ctrl-C / Ctrl-V - Profit! Obviously, DebenuPDFLibrary lacks the reverse functionality to CreateTable () - ExtractTable () or something like that! Edited by meligo - 09 May 20 at 6:57PM

meligo Members Profile Find Members Posts Beginner Joined: 04 May 20 Location: RU Status: Offline Points: 6	Post Options Post Reply Quote meligo Report Post Thanks(0) Quote Reply Posted: 08 May 20 at 4:52AM
	In addition to my previous post: Since the test file table.pdf created directly does not store any data about the table as an independent object (as inside .docx: in document.xml, where each table is really a separate table with the <w:tbl> tag), it’s easy to see , looking at its contents in any text viewer (both table.pdf and document.xml), the question arose - what will happen when extracting the tables when there are several of them in the pdf-file and they are partially overlapped? To clarify this issue, we will conduct the following experiment: make small changes to the program - after the original string: QP.DrawTableRows (tableID, 50, 50, 400, 1, 0); insert the lines: QP.DrawTableRows (TableID, 70, 210, 400, 1, 0); // 2-nd table shift <right / down> QP.DrawTableRows (TableID, 30, 300, 400, 1, 0); // 3-rd table shift <left / up> - partially overlaps 2-nd table If you comment out the last row, we will have 2 tables and they will be successfully recognized and extracted from the pdf-file, despite the fact that the 2nd table is partially shifted horizontally and vertically relative to the first. However, if we uncomment the third row, the third table partially overlaps the second table, and after recognizing and extracting them, we get an amazing picture: the first table is successfully recognized and can be extracted, but the second and third tables are glued into one complex composite table! This fully confirmed our assumption when examining the contents of a pdf file that it does not store any data about the table as an independent object and the pdf2word-algorithm actually performs non-trivial canvas calculations to recognize the table. PS: Excuse me for my google-translate english Edited by meligo - 08 May 20 at 10:02PM

mLipok Members Profile Find Members Posts Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453	Post Options Post Reply Quote mLipok Report Post Thanks(0) Quote Reply Posted: 18 May 20 at 4:03AM
	Using this kind of code: For $iObj_idx =1 To $iObjectCount ClipPut(BinaryToString($oQP.GetObjectToVariant($iObj_idx))) MsgBox(0, 'QP Obj: ' & $iObj_idx & ' = ' & $oQP.GetObjectDecodeError($iObj_idx), BinaryToString($oQP.GetObjectToVariant($iObj_idx))) Next I Get: << /Filter /FlateDecode /Length 269 >> stream xś]‘ÁnĂ †ďy Ţ & ¤•*_şK›¦m/@ÁT9” šööĂxŮa‡é[č·űóĺĺ’–Mőďeőź´©¸¤Pč±>‹'uĄŰ’:=¨°ří×Úéď.wýůŐĺŻďLŞPswę?ôÜn´ôř5Đ#;OĹĄu' Question: How to decode stream using QuickPDF Library ?
	Here you can find description how to test my examples: http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600

mLipok Members Profile Find Members Posts Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453	Post Options Post Reply Quote mLipok Report Post Thanks(0) Quote Reply Posted: 19 May 20 at 2:05AM
	solved with zLib deflate. and now ..... I see that there are in this test.pdf Tj as a glyph not a simple text.
	Here you can find description how to test my examples: http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600