Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - How to mantain a searchable PDF File?
  FAQ FAQ  Forum Search   Register Register  Login Login

How to mantain a searchable PDF File?

 Post Reply Post Reply
Author
Message
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Topic: How to mantain a searchable PDF File?
    Posted: 28 Oct 20 at 3:28AM
Hi,
  I have a searchable PDF file transformed from MS Word.  I need retrieve some pages to tiff image file, then replace tiff image into dedicate pages of searchable PDF file.
  How I can do? Would you please provide me sample code? I'm using Delphi 7 to develop system. C# is ok too.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (1) Thanks(1)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 28 Oct 20 at 11:22AM
Hi Mike,

you want to export single pages from your pdf in tiff-Format (pdf-page -> tiff).
As a second step you want to replace these page positions inside your pdf with a new tiff (tiff -> pdf-page).
This is possible with QuickPDF but these new replaced pages won't be searchable anymore.
But i think this will be your show stopper?

Cheers and welcome here,
Ingo

Cheers,
Ingo

Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 318
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 28 Oct 20 at 5:21PM
If I scan a document with my Canon scanner, which includes an OCR function, it creates a picture of the document, which is displayed as an image in a PDF viewer- you can see that it is an image because it looks very slight fuzzy (it is not a very high resolution scanner).  But cleverly Canon also invisibly places the OCR text below the image, so that it appears you can select it and search the image. This scanned image then appears to be fully searchable. In theory you could implement the same trick with QuickPDF on a TIFF file which you have placed on the page, but you would need both to develop or incorporate an OCR facility AND the means to place OCR output text invisibly (white on white) on a layer under the image in exactly the right position and size, so that you can draw a box to highlight the search result.  In the end it would be much cheaper and much quicker to purchase an inexpensive scanner which has an OCR feature to scan each page which on which you have inserted a TIFF.

The searchable PDF you save from Office is completely different. It only contains the text, and the text is rendered onto the viewed or printed page directly, so is always searchable. Once you have destroyed all this by rendering it to TIFF, it cannot be unscrambled without using OCR.

Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 30 Oct 20 at 1:44AM
Dear Ingo,
  Thx for your help, You are right, I want to replace page positions inside my pdf with a new tiff (tiff -> pdf-page) and won't be searchable anymore.
  Would you please provide me sample code to export single pages from my pdf in tiff-Format (pdf-page -> tiff) and how to replace these page positions inside my pdf with a new tiff (tiff -> pdf-page).
  I'm usiing delphi 7 to develop system.
Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 30 Oct 20 at 1:48AM
Dear frost,
  Thx for you help.
  I need export some pages to tiff image, and replace back with new tiff image, it's okay for me these imported pages no longer searchable, the others pages still searchable.
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 318
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 30 Oct 20 at 10:08AM
To export, look at this function:

https://www.debenu.com/docs/pdf_library_reference/RenderPageToFile.php

To import, see:

https://www.debenu.com/docs/pdf_library_reference/AddImageFromFile.php

There are many similar functions shown in the reference guide, and there are examples of using them in the scripts that you can find in DebenuPDFLibrartyDemo.exe. 
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 02 Nov 20 at 8:24PM
Hi,

i think you'll succeed yourself in using RenderPageToFile (pdf to image format).
Here's a sample going the other way round (AddImageFromFile):
https://www.debenu.com/kb/add-images-pdf-programmatically/

Cheers,
Ingo

Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 03 Nov 20 at 5:41AM

Hi,

  Thx for your help. I can render page to tiff now, but I have been encountering on problem. One Traditional Chinese font broken. There are several fonts in the pdf page, all fonts are normal except one fonts named ”標楷體” has broken.

Do you know why? Should I set any parameters before using RenderPageToFile()?

I’m using RenderPageToFile(600, 1, 10, ‘C:\test.tiff’)



Edited by dky - 03 Nov 20 at 5:45AM
Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 03 Nov 20 at 5:43AM
I can't post the tiff image screen capture here.

Edited by dky - 03 Nov 20 at 5:48AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 03 Nov 20 at 5:00PM
Hi Mike,

it seems to me that these chinese font doesn't exist on your pc? Normal standard fonts can be rendered by QuickPDF - but not this strange font. If the font isn't installed on your pc rendering should fail.
To avoid problems like this the creators notmally embed special fonts into the pdf. Perhaps this wasn't made or the file is detached now or there was an error while embedding.
To check this problem more deeper you should upload the pdf anywhere on a free filehoster and post the link here.
 
Cheers,
Ingo

Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 20 at 1:18AM
Hi Indo,
  Thx again for your help.
 
  The chinese fonts does exists on my PC because its a papular font to us, and I can view the pdf using Acrobat reader normally.
 
  I was encontered the same problem using another PDF SDK named Spire.Pdf, then they update anoter verison including one Spire.FontType.dll and set the documentTextIsAsiaFont parameter to ture in program, then it's okay.
 
   The URL list below is the tested files, you can download to test.
 
1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010.
 
2.WordToPDF_Standard_Export0001.tif : This is the tiff image extract by RenderPageToFile() function. One Chinese font broken.
 
3.Test_01PDF_OriginalPDF.jpg : This is the original sechable PDF generate by MS Word 2010 and view by Acrobat reader DC.
 
4.Test_02Tiff_MarkBrokenFont.jpg : This is the screen capture image of broken tiff and I mark the broken font by red retangles.


Edited by dky - 04 Nov 20 at 1:22AM
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 318
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 20 at 6:42PM
Thanks for uploading these files. I can reproduce your font problem with QPDF 18.11. We have over the years reported several problems with CJK fonts in QPDF and though the rendering is much improved, it is not perfect in the default renderer.

I recommend that you try a different renderer, such as the PDFIUM renderer, which is supplied with Quick PDF. You can use the SelectRenderer function to try it: read the documentation for this function for details.  When I render your PDF with PDFIUM the font issues do not occur, but our application uses direct calls to the PDFIUM DLL, not via Quick PDF.  With PDFIUM the font looks exactly like your PDF, but I do not read Chinese so I cannot be 100% certain.  In our application we work mainly with faxable TIF at 200dpi so all your smaller fonts are a bit fuzzy in the TIF here, though the characters look correct.

Our applications can use either the standard renderer or PDFIUM; note that the latter may not work in a multi-threaded application.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Nov 20 at 7:00PM
i agree with tfrost.
Looking from greater distance it seems to be okay but coming nearer it looks a bit inaccurate and shady but not really wrong but i'm not a chinese ;-)

Cheers,
Ingo

Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 20 at 6:51AM

Hi tfrost & Ingo,

  Thanks for you two, I can render page to tiff with normal Chinese fonts, but the other problems occurred.

 The original PDF was searchable every page, after I delete or insert page into pdf and save pdf, the pages after I delete or insert is no longer searchable. The other pages before delete or insert were still searchable. Am I miss anything before delete or insert page?

 

   The URL list below is the tested files, you can download to test.

https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing

 

1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010, every pages is searchable.

 

2. WordToPDF_Standard_SaveInsertPage.pdf : This is the pdf I insert one blank page into page 2. Pages after page2 are no longer searchable, pages before page 2 still searchable.

 

3. WordToPDF_Standard_SaveDeletePage.pdf : This is the pdf I delete page 2. Pages after page2 are no longer searchable, pages before page 2 still searchable.

Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3218
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 20 at 7:30AM
Hi,

we told it already...
The leading page wasn't touched - so there's always the parallel textcontent behind the image.
The whole extracted page was made to an IMAGE (without the content behind) and this lonely IMAGE was made to the new pdf-page.
You don't make anything wrong - it's like it is.

Cheers,
Ingo

Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 20 at 3:06AM
Hi, Ingo
  My problem is not extract image, I just insert one blank page into or delete one current page from the pdf file then save pdf file, the pages before I insert or delete are still searchable, but pages after that page are unsearchable, it's seems like the page contents were destroyed after insert or delete page.
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 318
Post Options Post Options   Thanks (1) Thanks(1)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 06 Nov 20 at 10:34AM
Something seems wrong with your testing method. Because your two files insertpage and deletepage are fully searchable here.

For the insertpage version, in PDF Tools Pro I could still extract the text on all pages, shown as being on the correct page.  In Acrobat Reader DC I could highlight and copy text on page 3 (formerly page 2), for example as 客戶基本資料表.  And if I paste this into the find dialog, Acrobat finds it in the text and places a highlight over it, as expected.

The same applies with deletepage, where I can copy text at the top of page 2 (formerly 3) as for example 因發生第 and also paste this into the find dialog and successfully find it. In the case of this document I found it a little harder to select the exact characters - I sometimes got an extra character in the selection.  But it works, basically.

The example copied glyphs above appear OK in the preview in my browser here, but I guess they may not show correctly in all browsers.
Back to Top
dky View Drop Down
Beginner
Beginner


Joined: 27 Oct 20
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote dky Quote  Post ReplyReply Direct Link To This Post Posted: 10 Nov 20 at 6:07AM
Hi tfrost,
  Thanks again for your help.
  I can't seach the pdf after I insert one page into page 2 using Acrobat Reader DC Chinese Edition (version 2020.013.20064), when I search keyword it search only in page 1 and loop, it will not skip automatically to search the keyword after page 2. I need to skip to page by myself. But something strange, it will normal in DC English Edition in my customers PC.
 
  The url list below, and the file named "WordToPDFA3_InsertPage.pdf".
 
  There is one more strange problem, I can open a pdf and save to another pdf file, and it's normal. But I open a pdf file provided by my customer and save it to another pdf as well as I do with my own test file, the pdf file I saved was broken and can't not opened in acrabat reader, it's show me the pdf file crash. The file was named "TestSave_Bad.pdf", the normal original file was named "TestSave_Original.pdf". The pdf file is bigger than the original one event I delete pages or just open and save.
 
   Finally, can I transform pdf from color pages to Black and White pages to reduce pdf file size?
 
   Thanks you very much.
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 318
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 11 Nov 20 at 9:24AM
Since I do not have and could not use a Chinese Acrobat version, I cannot help with differences in how an English and Chinese Acrobat process your PDF.  And I suggest that this and also the 'bad PDF' issue are matters you should raise with Foxit support.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store