Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Extract table(s) from pdf-file (Delphi)
  FAQ FAQ  Forum Search   Register Register  Login Login

Extract table(s) from pdf-file (Delphi)

 Post Reply Post Reply
Author
Message
meligo View Drop Down
Beginner
Beginner
Avatar

Joined: 04 May 20
Location: RU
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote meligo Quote  Post ReplyReply Direct Link To This Post Topic: Extract table(s) from pdf-file (Delphi)
    Posted: 04 May 20 at 5:02PM
Hi!

How to extract a table (or tables, if there are several) from PDF file. 
I looked at the functions on your site, and also looked for examples for extracting tables, 
but did not find anything like it.

An example of use would be desirable on Delphi, if possible, 
but it is possible on Sharpe too. 

Thanks in advance!
Back to Top
meligo View Drop Down
Beginner
Beginner
Avatar

Joined: 04 May 20
Location: RU
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote meligo Quote  Post ReplyReply Direct Link To This Post Posted: 04 May 20 at 5:17PM
Clarification:
Each extracted table would have to be a text block in the form of a list of rows, where each row of the list is a row of the table, and the fields in these rows should be separated by a separator, for example, <TAB>.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (1) Thanks(1)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 May 20 at 7:04PM
Hi Meligo,

sorry but like you've determined already there aren't samples published about extraction of table content.
You should take a deeper look into the text extraction functionalities for your needs. The csv-option should be relevant for you to have data for positions of rows and columns.

Like AndrewC (R.I.P) told in the past:
"There is going to be no easy way to read this with Debenu Quick PDF Library without some complex strategies.".

His advice for the first steps was:
"Try this

  QP.SetTextExtractionScaling(0, 2, 8);
  QP.SetTextExtractionWordGap(0.2);

And then call GetPageText(7);  You can then split the text into columns and trim leading and trailing spaces.".

Good luck!


Cheers and welcome here,
Ingo



Cheers,
Ingo

Back to Top
meligo View Drop Down
Beginner
Beginner
Avatar

Joined: 04 May 20
Location: RU
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote meligo Quote  Post ReplyReply Direct Link To This Post Posted: 06 May 20 at 1:33AM
Hi Ingo!
Thanks for such a quick reply!

Your answer involves analyzing tables in the more general complex case where the code for these tables can be drawn arbitrarily. And then indeed, the cells of these tables must be calculated in a poorly formalized non-trivial way.

I consider a simpler case when the source file with standard tables is created, for example, in MS Word or OpenOffice, then it is either saved as a PDF file or printed on a virtual PDF printer, generating the same PDF file in which the structure from the original tables are stored.

Now, if you send such a PDF file to convert PDF2Word to any of the many free web services and get the “.docx” document, then the whole structure of the source tables will be completely saved in this resulting file, and they can be easily extracted from this document .

I am attaching an archive Test.zip with an example MS Word test file with two tables created using the standard “Insert table” method, then a PDF file, obtained from it via a virtual PDF printer, and the result of the pdf2word file conversion, as well as two text data tables separated by <TAB> extracted from the last file (my own table extractor program from msword document).

In my case, I would like to eliminate the additional step of PDF2Word conversion and immediately extract these tables directly from PDF, since such information, as we see, is stored in the pdf file and is successfully restored during conversion.

Note: if you drag these two text files, for example, into open MS Excel, you will see two extracted tables from the PDF file, including, by the way, empty cells, the problem of which you discussed here on the forum in one of your posts.


Edited by meligo - 08 May 20 at 5:04AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 06 May 20 at 10:26PM
Hi Meligo,

seems you don't know that this is a user forum here and all given help here is given using personal free time?
Perhaps somebody can give you an detailed advice...

Cheers,
Ingo

Back to Top
meligo View Drop Down
Beginner
Beginner
Avatar

Joined: 04 May 20
Location: RU
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote meligo Quote  Post ReplyReply Direct Link To This Post Posted: 08 May 20 at 2:52AM
Dear Ingo!
You in vain saw the irony in my previous post!
I really sincerely thanked you for your prompt reply!

Regarding the substance of the issue under discussion:

It could be a false impression that the method for extracting the tables discussed above applies exclusively to tables created in MSWord or OpenOffice documents sent to a virtual PDF printer.

However, if you create a PDF file with a table directly, exclusively using CreateTable () in DebenuPDFLibrary, as, for example, in the http://www.quickpdf.org/forum/create-table-exactly-like-this-sample_topic1907.html, it is also easy to extract this table from it using the same technology:

1. Send the created file (table.pdf) from this demo to a web service, for example, https://www.pdf2go.com/pdf-to-word and get the MS Word document back.

2. Open the downloaded file in MS Word and select the table by clicking the icon in the upper left corner of the table.

3. Now the selected table can be easily extracted (for example, into open MS Excel) using simple Ctrl-C / Ctrl-V - Profit!

Obviously, DebenuPDFLibrary lacks the reverse functionality to CreateTable () - ExtractTable () or something like that!


Edited by meligo - 09 May 20 at 6:57PM
Back to Top
meligo View Drop Down
Beginner
Beginner
Avatar

Joined: 04 May 20
Location: RU
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote meligo Quote  Post ReplyReply Direct Link To This Post Posted: 08 May 20 at 4:52AM
In addition to my previous post:

Since the test file table.pdf created directly does not store any data about the table as an independent object (as inside .docx: in document.xml, where each table is really a separate table with the <w:tbl> tag), it’s easy to see , looking at its contents in any text viewer (both table.pdf and document.xml), the question arose - what will happen when extracting the tables when there are several of them in the pdf-file and they are partially overlapped?

To clarify this issue, we will conduct the following experiment: make small changes to the program - after the original string:
    QP.DrawTableRows (tableID, 50, 50, 400, 1, 0);

insert the lines:
    QP.DrawTableRows (TableID, 70, 210, 400, 1, 0); // 2-nd table shift <right / down>
    QP.DrawTableRows (TableID, 30, 300, 400, 1, 0); // 3-rd table shift <left / up> - partially overlaps 2-nd table

If you comment out the last row, we will have 2 tables and they will be successfully recognized and extracted from the pdf-file, despite the fact that the 2nd table is partially shifted horizontally and vertically relative to the first.

However, if we uncomment the third row, the third table partially overlaps the second table, and after recognizing and extracting them, we get an amazing picture: the first table is successfully recognized and can be extracted, but the second and third tables are glued into one complex composite table! Shocked

This fully confirmed our assumption when examining the contents of a pdf file that it does not store any data about the table as an independent object and the pdf2word-algorithm actually performs non-trivial canvas calculations to recognize the table.

PS: Excuse me for my google-translate english Embarrassed


Edited by meligo - 08 May 20 at 10:02PM
Back to Top
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 449
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Posted: 18 May 20 at 4:03AM
Using this kind of code:

For $iObj_idx =1 To $iObjectCount
ClipPut(BinaryToString($oQP.GetObjectToVariant($iObj_idx)))
MsgBox(0, 'QP Obj: ' & $iObj_idx &  ' = ' & $oQP.GetObjectDecodeError($iObj_idx), BinaryToString($oQP.GetObjectToVariant($iObj_idx)))
Next


I Get:



<<
/Filter /FlateDecode
/Length 269
>>
stream
xś]‘ÁnĂ †ďy
Ţ & ¤•*_şK›¦m/@ÁT9” šööĂxŮa‡é[č·űóĺĺ’–Mőďeőź´©¸¤Pč±>‹'uĄŰ’:=¨°ří×Úéď.wýůŐĺŻďLŞPswę?ôÜn´ôř5Đ#;OĹĄu'



Question:
How to decode stream using QuickPDF Library ?

Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 449
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Posted: 19 May 20 at 2:05AM
solved with zLib deflate.
and now ..... I see that there are in this test.pdf      Tj as a glyph not a simple text.

Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store