I need help - I can help - Extract table(s) from pdf-file (Delphi)

Print Page | Close Window

Extract table(s) from pdf-file (Delphi)

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3810
Printed Date: 31 Jul 26 at 3:10AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Extract table(s) from pdf-file (Delphi)

Posted By: meligo
Subject: Extract table(s) from pdf-file (Delphi)
Date Posted: 04 May 20 at 5:02PM

Hi!

How to extract a table (or tables, if there are several) from PDF file.

I looked at the functions on your site, and also looked for examples for extracting tables,

but did not find anything like it.

An example of use would be desirable on Delphi, if possible,

but it is possible on Sharpe too.

Thanks in advance!

Replies:

Posted By: meligo
Date Posted: 04 May 20 at 5:17PM

Clarification:

Each extracted table would have to be a text block in the form of a list of rows, where each row of the list is a row of the table, and the fields in these rows should be separated by a separator, for example, <TAB>.

Posted By: Ingo
Date Posted: 04 May 20 at 7:04PM

Hi Meligo,

sorry but like you've determined already there aren't samples published about extraction of table content.
You should take a deeper look into the text extraction functionalities for your needs. The csv-option should be relevant for you to have data for positions of rows and columns.

Like AndrewC (R.I.P) told in the past:
"There is going to be no easy way to read this with Debenu Quick PDF Library without some complex strategies.".

His advice for the first steps was:
"Try this

QP.SetTextExtractionScaling(0, 2, 8);
QP.SetTextExtractionWordGap(0.2);

And then call GetPageText(7); You can then split the text into columns and trim leading and trailing spaces.".

Good luck!

Cheers and welcome here,
Ingo

-------------
Cheers,
Ingo

Posted By: meligo
Date Posted: 06 May 20 at 1:33AM

Hi Ingo!

Thanks for such a quick reply!

Your answer involves analyzing tables in the more general complex case where the code for these tables can be drawn arbitrarily. And then indeed, the cells of these tables must be calculated in a poorly formalized non-trivial way.

I consider a simpler case when the source file with standard tables is created, for example, in MS Word or OpenOffice, then it is either saved as a PDF file or printed on a virtual PDF printer, generating the same PDF file in which the structure from the original tables are stored.

Now, if you send such a PDF file to convert PDF2Word to any of the many free web services and get the “.docx” document, then the whole structure of the source tables will be completely saved in this resulting file, and they can be easily extracted from this document .

I am attaching an archive https://yadi.sk/d/nHjE9l9Xkh8HNg" rel="nofollow -

Posted By: Ingo
Date Posted: 06 May 20 at 10:26PM

Hi Meligo,

seems you don't know that this is a user forum here and all given help here is given using personal free time?
Perhaps somebody can give you an detailed advice...

-------------
Cheers,
Ingo

Posted By: meligo
Date Posted: 08 May 20 at 2:52AM

Dear Ingo!

You in vain saw the irony in my previous post!

I really sincerely thanked you for your prompt reply!

Regarding the substance of the issue under discussion:

It could be a false impression that the method for extracting the tables discussed above applies exclusively to tables created in MSWord or OpenOffice documents sent to a virtual PDF printer.

However, if you create a PDF file with a table directly, exclusively using CreateTable () in DebenuPDFLibrary, as, for example, in the http://www.quickpdf.org/forum/create-table-exactly-like-this-sample_topic1907.html" rel="nofollow - http://www.quickpdf.org/forum/create-table-exactly-like-this-sample_topic1907.html , it is also easy to extract this table from it using the same technology:

1. Send the created file (table.pdf) from this demo to a web service, for example, https://www.pdf2go.com/pdf-to-word" rel="nofollow - https://www.pdf2go.com/pdf-to-word and get the MS Word document back.

2. Open the downloaded file in MS Word and select the table by clicking the icon in the upper left corner of the table.

3. Now the selected table can be easily extracted (for example, into open MS Excel) using simple Ctrl-C / Ctrl-V - Profit!

Obviously, DebenuPDFLibrary lacks the reverse functionality to CreateTable () - ExtractTable () or something like that!

Posted By: meligo
Date Posted: 08 May 20 at 4:52AM

In addition to my previous post:

Since the test file table.pdf created directly does not store any data about the table as an independent object (as inside .docx: in document.xml, where each table is really a separate table with the <w:tbl> tag), it’s easy to see , looking at its contents in any text viewer (both table.pdf and document.xml), the question arose - what will happen when extracting the tables when there are several of them in the pdf-file and they are partially overlapped?

To clarify this issue, we will conduct the following experiment: make small changes to the program - after the original string:

QP.DrawTableRows (tableID, 50, 50, 400, 1, 0);

insert the lines:

QP.DrawTableRows (TableID, 70, 210, 400, 1, 0); // 2-nd table shift <right / down>

QP.DrawTableRows (TableID, 30, 300, 400, 1, 0); // 3-rd table shift <left / up> - partially overlaps 2-nd table

If you comment out the last row, we will have 2 tables and they will be successfully recognized and extracted from the pdf-file, despite the fact that the 2nd table is partially shifted horizontally and vertically relative to the first.

However, if we uncomment the third row, the third table partially overlaps the second table, and after recognizing and extracting them, we get an amazing picture: the first table is successfully recognized and can be extracted, but the second and third tables are glued into one complex composite table! Shocked

This fully confirmed our assumption when examining the contents of a pdf file that it does not store any data about the table as an independent object and the pdf2word-algorithm actually performs non-trivial canvas calculations to recognize the table.

PS: Excuse me for my google-translate english Embarrassed

Posted By: mLipok
Date Posted: 18 May 20 at 4:03AM

Using this kind of code:

For $iObj_idx =1 To $iObjectCount

ClipPut(BinaryToString($oQP.GetObjectToVariant($iObj_idx)))

MsgBox(0, 'QP Obj: ' & $iObj_idx & ' = ' & $oQP.GetObjectDecodeError($iObj_idx), BinaryToString($oQP.GetObjectToVariant($iObj_idx)))

I Get:

/Filter /FlateDecode

/Length 269

stream

xś]‘ÁnĂ †ďy

Question:

How to decode stream using QuickPDF Library ?

-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600

Posted By: mLipok
Date Posted: 19 May 20 at 2:05AM

solved with zLib deflate.

and now ..... I see that there are in this test.pdf Tj as a glyph not a simple text.

-------------
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600