Debenu Quick PDF Library - PDF SDK Community Forum : Extract table(s) from pdf-file (Delphi)

Extract table(s) from pdf-file (Delphi) : solved with zLib deflate.and now...

Tue, 19 May 2020 02:05:40 +0000

Author: mLipok
Subject: 3810
Posted: 19 May 20 at 2:05AM

solved with zLib deflate.

and now ..... I see that there are in this test.pdf Tj as a glyph not a simple text.

Extract table(s) from pdf-file (Delphi) : Using this kind of code: For $iObj_idx...

Mon, 18 May 2020 04:03:28 +0000

Author: mLipok
Subject: 3810
Posted: 18 May 20 at 4:03AM

Using this kind of code:

For $iObj_idx =1 To $iObjectCount

ClipPut(BinaryToString($oQP.GetObjectToVariant($iObj_idx)))

MsgBox(0, 'QP Obj: ' & $iObj_idx & ' = ' & $oQP.GetObjectDecodeError($iObj_idx), BinaryToString($oQP.GetObjectToVariant($iObj_idx)))

I Get:

/Filter /FlateDecode

/Length 269

stream

xś]‘ÁnĂ †ďy

Ţ & ¤•*_şK›¦m/@ÁT9” šööĂxŮa‡é[č·űóĺĺ’–Mőďeőź´©¸¤Pč±>‹'uĄŰ’:=¨°ří×Úéď.wýůŐĺŻďLŞPswę?ôÜn´ôř5Đ#;OĹĄu'

Question:

How to decode stream using QuickPDF Library ?

Extract table(s) from pdf-file (Delphi) : In addition to my previous post:Since...

Fri, 08 May 2020 04:52:51 +0000

Author: meligo
Subject: 3810
Posted: 08 May 20 at 4:52AM

In addition to my previous post:

Since the test file table.pdf created directly does not store any data about the table as an independent object (as inside .docx: in document.xml, where each table is really a separate table with the tag), it’s easy to see , looking at its contents in any text viewer (both table.pdf and document.xml), the question arose - what will happen when extracting the tables when there are several of them in the pdf-file and they are partially overlapped?

To clarify this issue, we will conduct the following experiment: make small changes to the program - after the original string:

QP.DrawTableRows (tableID, 50, 50, 400, 1, 0);

insert the lines:

QP.DrawTableRows (TableID, 70, 210, 400, 1, 0); // 2-nd table shift

QP.DrawTableRows (TableID, 30, 300, 400, 1, 0); // 3-rd table shift - partially overlaps 2-nd table

If you comment out the last row, we will have 2 tables and they will be successfully recognized and extracted from the pdf-file, despite the fact that the 2nd table is partially shifted horizontally and vertically relative to the first.

However, if we uncomment the third row, the third table partially overlaps the second table, and after recognizing and extracting them, we get an amazing picture: the first table is successfully recognized and can be extracted, but the second and third tables are glued into one complex composite table!

This fully confirmed our assumption when examining the contents of a pdf file that it does not store any data about the table as an independent object and the pdf2word-algorithm actually performs non-trivial canvas calculations to recognize the table.

PS: Excuse me for my google-translate english

Edited by meligo - 08 May 20 at 10:02PM

Extract table(s) from pdf-file (Delphi) : Dear Ingo!You in vain saw the...

Fri, 08 May 2020 02:52:56 +0000

Author: meligo
Subject: 3810
Posted: 08 May 20 at 2:52AM

Dear Ingo!

You in vain saw the irony in my previous post!

I really sincerely thanked you for your prompt reply!

Regarding the substance of the issue under discussion:

It could be a false impression that the method for extracting the tables discussed above applies exclusively to tables created in MSWord or OpenOffice documents sent to a virtual PDF printer.

However, if you create a PDF file with a table directly, exclusively using CreateTable () in DebenuPDFLibrary, as, for example, in the http://www.quickpdf.org/forum/create-table-exactly-like-this-sample_topic1907.html, it is also easy to extract this table from it using the same technology:

1. Send the created file (table.pdf) from this demo to a web service, for example, https://www.pdf2go.com/pdf-to-word and get the MS Word document back.

2. Open the downloaded file in MS Word and select the table by clicking the icon in the upper left corner of the table.

3. Now the selected table can be easily extracted (for example, into open MS Excel) using simple Ctrl-C / Ctrl-V - Profit!

Obviously, DebenuPDFLibrary lacks the reverse functionality to CreateTable () - ExtractTable () or something like that!

Edited by meligo - 09 May 20 at 6:57PM

Extract table(s) from pdf-file (Delphi) : Hi Meligo,seems you don't...

Wed, 06 May 2020 22:26:08 +0000

Author: Ingo
Subject: 3810
Posted: 06 May 20 at 10:26PM

Hi Meligo,

seems you don't know that this is a user forum here and all given help here is given using personal free time?
Perhaps somebody can give you an detailed advice...

Extract table(s) from pdf-file (Delphi) : Hi Ingo!Thanks for such a quick...

Wed, 06 May 2020 01:33:01 +0000

Author: meligo
Subject: 3810
Posted: 06 May 20 at 1:33AM

Hi Ingo!

Thanks for such a quick reply!

Your answer involves analyzing tables in the more general complex case where the code for these tables can be drawn arbitrarily. And then indeed, the cells of these tables must be calculated in a poorly formalized non-trivial way.

I consider a simpler case when the source file with standard tables is created, for example, in MS Word or OpenOffice, then it is either saved as a PDF file or printed on a virtual PDF printer, generating the same PDF file in which the structure from the original tables are stored.

Now, if you send such a PDF file to convert PDF2Word to any of the many free web services and get the “.docx” document, then the whole structure of the source tables will be completely saved in this resulting file, and they can be easily extracted from this document .

I am attaching an archive Test.zip with an example MS Word test file with two tables created using the standard “Insert table” method, then a PDF file, obtained from it via a virtual PDF printer, and the result of the pdf2word file conversion, as well as two text data tables separated by extracted from the last file (my own table extractor program from msword document).

In my case, I would like to eliminate the additional step of PDF2Word conversion and immediately extract these tables directly from PDF, since such information, as we see, is stored in the pdf file and is successfully restored during conversion.

Note: if you drag these two text files, for example, into open MS Excel, you will see two extracted tables from the PDF file, including, by the way, empty cells, the problem of which you discussed here on the forum in one of your posts.

Edited by meligo - 08 May 20 at 5:04AM

Extract table(s) from pdf-file (Delphi) : Hi Meligo,sorry but like you've...

Mon, 04 May 2020 19:04:07 +0000

Author: Ingo
Subject: 3810
Posted: 04 May 20 at 7:04PM

Hi Meligo,

sorry but like you've determined already there aren't samples published about extraction of table content.
You should take a deeper look into the text extraction functionalities for your needs. The csv-option should be relevant for you to have data for positions of rows and columns.

Like AndrewC (R.I.P) told in the past:
"There is going to be no easy way to read this with Debenu Quick PDF Library without some complex strategies.".

His advice for the first steps was:
"Try this

QP.SetTextExtractionScaling(0, 2, 8);
QP.SetTextExtractionWordGap(0.2);

And then call GetPageText(7); You can then split the text into columns and trim leading and trailing spaces.".

Good luck!

Cheers and welcome here,
Ingo

Extract table(s) from pdf-file (Delphi) : Clarification:Each extracted table...

Mon, 04 May 2020 17:17:11 +0000

Author: meligo
Subject: 3810
Posted: 04 May 20 at 5:17PM

Clarification:

Each extracted table would have to be a text block in the form of a list of rows, where each row of the list is a row of the table, and the fields in these rows should be separated by a separator, for example, .

Extract table(s) from pdf-file (Delphi) : Hi!How to extract a table (or...

Mon, 04 May 2020 17:02:26 +0000

Author: meligo
Subject: 3810
Posted: 04 May 20 at 5:02PM

Hi!

How to extract a table (or tables, if there are several) from PDF file.

I looked at the functions on your site, and also looked for examples for extracting tables,

but did not find anything like it.

An example of use would be desirable on Delphi, if possible,

but it is possible on Sharpe too.

Thanks in advance!