Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Compare PDFs
  FAQ FAQ  Forum Search   Register Register  Login Login

Compare PDFs

 Post Reply Post Reply
Author
Message
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Topic: Compare PDFs
    Posted: 23 Oct 15 at 5:43PM
I need to compare PDFs to determine if a new PDF is a duplicate of one we already have received.
 
Is there anything in the library that will allow me to compare two PDFs?
 
Thanks,
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 23 Oct 15 at 6:37PM
Hi Tom,

to be honest for this job you won't need the library.
Read both documents into strings or streams and compare the strings/streams if there are differences. Or (still easier) check the file-lengths.
Further jobs you can do with the library:
Initiate a textextraction from both files and do a string-compare.
Check the last modify-date from both files.
There's a lot...
Here's an older post regarding this issue:
http://www.quickpdf.org/forum/pdf-comparison_topic2467.html

Cheers and welcome here,
Ingo


Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 15 at 3:58PM
Ingo,
 
I looked at the older post you provided and was able to find the duplicate using GetContentStreamToString() that you said in the other post returns the layout.
 
Is comparing the results of GetContentStreamToString() more comprehensive than comparing those from GetPageText()?
 
If I need to use GetPageText(), which ExtractOptions would be best?
 
Thanks,
Tom
 
 
Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 15 at 4:39PM
Ingo,
 
Another thing occurred to me.
 
Which of these methods would be better if the PDF contains an image or will either of them work?
 
Thanks,
Tom
Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 15 at 9:09PM
Ingo,
 
Some of my PDFs are multi-page and when I call GetContentStreamToString(), I get back an array 27 bytes in length and GetPageText() returns an empty string.
 
What do I need to do to get the content and text from a multi-page PDF?
 
Thanks,
Tom
Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 15 at 6:16PM
Ingo,
 
I found out GetContentStreamToString() returns an array 27 bytes in length and GetPageText() returns an empty string when the PDF was an image-only PDF.
 
I added a call to HasFontResources() to determine if the PDF is image-only and don't make calls to GetContentStreamToString() and GetPageText() if it is image-only.
 
Is HasFontResources() the best way to determine if the PDF is image-only?
 
Since I can't compare the images, I assume the PDFs are different.
 
Thanks,
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 08 Nov 15 at 8:22PM
Hi Tom,

GetContentStream analyses the content of the page.
The content will be more than just text ;-)

If two pdf-files are image-only you'll find differences in the result of the two GetContentStream-calls.
Yes: HasFontResources is a good function to check if the document is image-only.

With this code (without using the library) you can always see if there are differences:

// ...
   textcomplete : WideString;
// ...
   try
      textcomplete := TFile.ReadAllText(Edit1.Text); // c:\temp\test.pdf
   except
     { Catch the possible exceptions
      MessageDlg('Incorrect path', mtError, [mbOK], 0);
      Exit;    }
   end;
// ...

Cheers, Ingo




Edited by Ingo - 08 Nov 15 at 8:23PM
Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 11 Nov 15 at 7:07PM
Ingo,
 
My code is written in C# so I tried using File.ReadAllText() but the resulting text didn't contain any of the data from the PDF.  Below are a few lines from the text.
 
When I tried using GetPageText(), the only data it returned was the static text of the form and none of the data in the fields.  The form was a fillable PDF that I had flattened.  When I tried opening it in Adobe LiveCycle Designer, I got an error saying no fields could be found.
 
Is there some other method I can use to get the field data?
 
Would it be possible for me to upload the text file and the PDF document for you to take a look at?
 
Thanks,
Tom
---------------------
%PDF-1.6
%����
192 0 obj
<<
>>
endobj
190 0 obj
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 11 Nov 15 at 11:03PM
Hi Tom,

yes... you can upload it anywhere and then you can post the link here.
Then perhaps one of us here (user community) or me myself will have a look.

Cheers, Ingo

Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 12 Nov 15 at 5:03PM
Ingo,
 
I zipped both the PDF and the extracted text file.  They are at: http://www.filedropper.com/pdf_9.
 
Thanks,
Tom
Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 18 Nov 15 at 4:46PM
Ingo,
 
I tried the link today and it didn't work so I uploaded the file again.
 
Please try it now and you should be able to download the file.
 
Thanks,
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 23 Nov 15 at 9:17PM
Hi Tom,

link doesn't work.

Cheers, Ingo
Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 23 Nov 15 at 9:31PM
Ingo,
 
I just uploaded it again, now the link is:  http://www.filedropper.com/pdf_8
 
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 24 Nov 15 at 8:46PM
Hi Tom,

there's no problem extracting the textcontent.
The extract starts with "CERTIFICATE OF LIABILITY INSURANCE".
Formfields will be extracted, too.
You can see it at content like this: "AUTOMOBILE LIABILITY         X".

I've checked the file with my tool.
The pdf was a form and after filled out it was flattened.
But still there's one real formfield... Strange?

But you're telling us that you can't extract the textcontent and this shouldn't happen. Please post your code here. You should start with creating the instance und you should end with freeing the instance. I'm pretty sure that here's somebody telling you where's something wrong with the code ;-)

Cheers, Ingo

Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 24 Nov 15 at 9:03PM
Ingo,
 
My code is below.
 
Tom
 
----------------------------

public static bool GetPDFContent(byte[] pdfData, ref byte[] pdfContent, ref string pdfText)

 

{

 

 

if

(pdfData.Length > 0)

{

try

{

ClearPDFLibrary();

pdfContent = new byte[0];

pdfText = string.Empty;

byte[] pageBytes;

pdfData = FlattenPDF(pdfData);

// Load PDF Document

pdfLibrary.LoadFromString(pdfData, "");

// if the PDF has no fonts, it is an image only PDF

if (pdfLibrary.HasFontResources() == 0)

{

return false;

}

int documentId = pdfLibrary.SelectedDocument(); // get the document id

int count = pdfLibrary.PageCount();

int index = 1;

while (index <= count)

{

int newDocumentId = pdfLibrary.NewDocument(); // create new document (has an empty page)

pdfLibrary.CopyPageRanges(documentId, index.ToString()); // copy page to new document

// delete the empty page

pdfLibrary.DeletePages(1, 1);

pdfLibrary.SelectPage(1);

pdfLibrary.SelectContentStream(1);

// get the content of each page and append them to pageBytes

pageBytes = pdfLibrary.GetContentStreamToString();

Array.Resize(ref pdfContent, pdfContent.Length + pageBytes.Length);

Buffer.BlockCopy(pageBytes, 0, pdfContent, pdfContent.Length - pageBytes.Length, pageBytes.Length);

// get the text of each page and concatenate them to pdfText

pdfText += pdfLibrary.GetPageText(0);

// remove the new document

pdfLibrary.RemoveDocument(newDocumentId);

++index;

}

// remove the original document

pdfLibrary.RemoveDocument(documentId);

}

catch (Exception ex)

{

ExceptionLog exceptionLog = new ExceptionLog(ERCPDFApplication, ERCPDFUser, ex);

exceptionLog.Commit();

return false;

}

}

return true;

}

 



Edited by tbean - 24 Nov 15 at 9:09PM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 24 Nov 15 at 9:14PM
Hi Tom,

i don't know the language but i think the problem is already at the beginning.
What's inside pdfData? How and where you get these content?

In Pascal/Delphi you'll need something similar to this one:
// ...
  try
    QP := TDebenuPDFLibrary1114.Create;
    try
       ret1 := QP.UnlockKey'your_reg_key');
       ret2 := QP.LoadFromFile(FName.Text, '');
// ...

Perhaps the other guys here can better give better help?

Cheers, Ingo

Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 24 Nov 15 at 9:47PM
Ingo,
 
pdfData is a byte array with the contents of the PDF.  I get the PDF data from our database and use LoadString() to load it into the PDFLibrary, so, it's equivalent to LoadFromFile().
 
Regards,
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 25 Nov 15 at 8:24PM
Hi Tom,

why you're doing FlattenPDF before the LoadFromString?
I think this won't work...
Is the += okay in these programming language?
pdfText += pdfLibrary.GetPageText(0);

You should make your function smaller getting a better overview and try again.
Try textextraction first without the other stuff.
Only LoadFromString, SelectPage and Extraction ...

Cheers, Ingo
 
Cheers,
Ingo

Back to Top
tbean View Drop Down
Team Player
Team Player


Joined: 23 Dec 14
Location: Plano, TX
Status: Offline
Points: 20
Post Options Post Options   Thanks (0) Thanks(0)   Quote tbean Quote  Post ReplyReply Direct Link To This Post Posted: 25 Nov 15 at 10:03PM
Ingo,
 
I only added the FlattenPDF call after I encountered the problem with no field data being found for a couple of documents like the one I uploaded.  When we received this document it was fillable and I thought maybe that was the reason no field data was found.
 
The C# operator += is used to append the data of the variable on the right to the variable on the left.  It is equivalent to "pdfText = pdfText + pdfLibrary.GetPageText(0);".
 
Regards,
Tom
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 25 Nov 15 at 10:30PM
as i've written:
Remove all stuff and start again small with LoadFromString, SelectPage, ExtractPage, ...
If this run than step by step go on...

Cheers, Ingo
Cheers,
Ingo

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store