Print Page | Close Window

How to mantain a searchable PDF File?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3858
Printed Date: 20 Apr 24 at 1:16PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: How to mantain a searchable PDF File?
Posted By: dky
Subject: How to mantain a searchable PDF File?
Date Posted: 28 Oct 20 at 3:28AM
Hi,
  I have a searchable PDF file transformed from MS Word.  I need retrieve some pages to tiff image file, then replace tiff image into dedicate pages of searchable PDF file.
  How I can do? Would you please provide me sample code? I'm using Delphi 7 to develop system. C# is ok too.



Replies:
Posted By: Ingo
Date Posted: 28 Oct 20 at 11:22AM
Hi Mike,

you want to export single pages from your pdf in tiff-Format (pdf-page -> tiff).
As a second step you want to replace these page positions inside your pdf with a new tiff (tiff -> pdf-page).
This is possible with QuickPDF but these new replaced pages won't be searchable anymore.
But i think this will be your show stopper?

Cheers and welcome here,
Ingo



-------------
Cheers,
Ingo



Posted By: tfrost
Date Posted: 28 Oct 20 at 5:21PM
If I scan a document with my Canon scanner, which includes an OCR function, it creates a picture of the document, which is displayed as an image in a PDF viewer- you can see that it is an image because it looks very slight fuzzy (it is not a very high resolution scanner).  But cleverly Canon also invisibly places the OCR text below the image, so that it appears you can select it and search the image. This scanned image then appears to be fully searchable. In theory you could implement the same trick with QuickPDF on a TIFF file which you have placed on the page, but you would need both to develop or incorporate an OCR facility AND the means to place OCR output text invisibly (white on white) on a layer under the image in exactly the right position and size, so that you can draw a box to highlight the search result.  In the end it would be much cheaper and much quicker to purchase an inexpensive scanner which has an OCR feature to scan each page which on which you have inserted a TIFF.

The searchable PDF you save from Office is completely different. It only contains the text, and the text is rendered onto the viewed or printed page directly, so is always searchable. Once you have destroyed all this by rendering it to TIFF, it cannot be unscrambled without using OCR.



Posted By: dky
Date Posted: 30 Oct 20 at 1:44AM
Dear Ingo,
  Thx for your help, You are right, I want to replace page positions inside my pdf with a new tiff (tiff -> pdf-page) and won't be searchable anymore.
  Would you please provide me sample code to export single pages from my pdf in tiff-Format (pdf-page -> tiff) and how to replace these page positions inside my pdf with a new tiff (tiff -> pdf-page).
  I'm usiing delphi 7 to develop system.


Posted By: dky
Date Posted: 30 Oct 20 at 1:48AM
Dear frost,
  Thx for you help.
  I need export some pages to tiff image, and replace back with new tiff image, it's okay for me these imported pages no longer searchable, the others pages still searchable.


Posted By: tfrost
Date Posted: 30 Oct 20 at 10:08AM
To export, look at this function:

https://www.debenu.com/docs/pdf_library_reference/RenderPageToFile.php

To import, see:

https://www.debenu.com/docs/pdf_library_reference/AddImageFromFile.php

There are many similar functions shown in the reference guide, and there are examples of using them in the scripts that you can find in DebenuPDFLibrartyDemo.exe. 


Posted By: Ingo
Date Posted: 02 Nov 20 at 8:24PM
Hi,

i think you'll succeed yourself in using RenderPageToFile (pdf to image format).
Here's a sample going the other way round (AddImageFromFile):
https://www.debenu.com/kb/add-images-pdf-programmatically/



-------------
Cheers,
Ingo



Posted By: dky
Date Posted: 03 Nov 20 at 5:41AM

Hi,

  Thx for your help. I can render page to tiff now, but I have been encountering on problem. One Traditional Chinese font broken. There are several fonts in the pdf page, all fonts are normal except one fonts named ”標楷體” has broken.

Do you know why? Should I set any parameters before using RenderPageToFile()?

I’m using RenderPageToFile(600, 1, 10, ‘C:\test.tiff’)



Posted By: dky
Date Posted: 03 Nov 20 at 5:43AM
I can't post the tiff image screen capture here.


Posted By: Ingo
Date Posted: 03 Nov 20 at 5:00PM
Hi Mike,

it seems to me that these chinese font doesn't exist on your pc? Normal standard fonts can be rendered by QuickPDF - but not this strange font. If the font isn't installed on your pc rendering should fail.
To avoid problems like this the creators notmally embed special fonts into the pdf. Perhaps this wasn't made or the file is detached now or there was an error while embedding.
To check this problem more deeper you should upload the pdf anywhere on a free filehoster and post the link here.
 


-------------
Cheers,
Ingo



Posted By: dky
Date Posted: 04 Nov 20 at 1:18AM
Hi Indo,
  Thx again for your help.
 
  The chinese fonts does exists on my PC because its a papular font to us, and I can view the pdf using Acrobat reader normally.
 
  I was encontered the same problem using another PDF SDK named Spire.Pdf, then they update anoter verison including one Spire.FontType.dll and set the documentTextIsAsiaFont parameter to ture in program, then it's okay.
 
   The URL list below is the tested files, you can download to test.
https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing" rel="nofollow - https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing
 
1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010.
 
2.WordToPDF_Standard_Export0001.tif : This is the tiff image extract by RenderPageToFile() function. One Chinese font broken.
 
3.Test_01PDF_OriginalPDF.jpg : This is the original sechable PDF generate by MS Word 2010 and view by Acrobat reader DC.
 
4.Test_02Tiff_MarkBrokenFont.jpg : This is the screen capture image of broken tiff and I mark the broken font by red retangles.


Posted By: tfrost
Date Posted: 04 Nov 20 at 6:42PM
Thanks for uploading these files. I can reproduce your font problem with QPDF 18.11. We have over the years reported several problems with CJK fonts in QPDF and though the rendering is much improved, it is not perfect in the default renderer.

I recommend that you try a different renderer, such as the PDFIUM renderer, which is supplied with Quick PDF. You can use the SelectRenderer function to try it: read the documentation for this function for details.  When I render your PDF with PDFIUM the font issues do not occur, but our application uses direct calls to the PDFIUM DLL, not via Quick PDF.  With PDFIUM the font looks exactly like your PDF, but I do not read Chinese so I cannot be 100% certain.  In our application we work mainly with faxable TIF at 200dpi so all your smaller fonts are a bit fuzzy in the TIF here, though the characters look correct.

Our applications can use either the standard renderer or PDFIUM; note that the latter may not work in a multi-threaded application.


Posted By: Ingo
Date Posted: 04 Nov 20 at 7:00PM
i agree with tfrost.
Looking from greater distance it seems to be okay but coming nearer it looks a bit inaccurate and shady but not really wrong but i'm not a chinese ;-)



-------------
Cheers,
Ingo



Posted By: dky
Date Posted: 05 Nov 20 at 6:51AM

Hi tfrost & Ingo,

  Thanks for you two, I can render page to tiff with normal Chinese fonts, but the other problems occurred.

 The original PDF was searchable every page, after I delete or insert page into pdf and save pdf, the pages after I delete or insert is no longer searchable. The other pages before delete or insert were still searchable. Am I miss anything before delete or insert page?

 

   The URL list below is the tested files, you can download to test.

https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing" rel="nofollow - https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing

 

1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010, every pages is searchable.

 

2. WordToPDF_Standard_SaveInsertPage.pdf : This is the pdf I insert one blank page into page 2. Pages after page2 are no longer searchable, pages before page 2 still searchable.

 

3. WordToPDF_Standard_SaveDeletePage.pdf : This is the pdf I delete page 2. Pages after page2 are no longer searchable, pages before page 2 still searchable.



Posted By: Ingo
Date Posted: 05 Nov 20 at 7:30AM
Hi,

we told it already...
The leading page wasn't touched - so there's always the parallel textcontent behind the image.
The whole extracted page was made to an IMAGE (without the content behind) and this lonely IMAGE was made to the new pdf-page.
You don't make anything wrong - it's like it is.



-------------
Cheers,
Ingo



Posted By: dky
Date Posted: 06 Nov 20 at 3:06AM
Hi, Ingo
  My problem is not extract image, I just insert one blank page into or delete one current page from the pdf file then save pdf file, the pages before I insert or delete are still searchable, but pages after that page are unsearchable, it's seems like the page contents were destroyed after insert or delete page.


Posted By: tfrost
Date Posted: 06 Nov 20 at 10:34AM
Something seems wrong with your testing method. Because your two files insertpage and deletepage are fully searchable here.

For the insertpage version, in PDF Tools Pro I could still extract the text on all pages, shown as being on the correct page.  In Acrobat Reader DC I could highlight and copy text on page 3 (formerly page 2), for example as 客戶基本資料表.  And if I paste this into the find dialog, Acrobat finds it in the text and places a highlight over it, as expected.

The same applies with deletepage, where I can copy text at the top of page 2 (formerly 3) as for example 因發生第 and also paste this into the find dialog and successfully find it. In the case of this document I found it a little harder to select the exact characters - I sometimes got an extra character in the selection.  But it works, basically.

The example copied glyphs above appear OK in the preview in my browser here, but I guess they may not show correctly in all browsers.


Posted By: dky
Date Posted: 10 Nov 20 at 6:07AM
Hi tfrost,
  Thanks again for your help.
  I can't seach the pdf after I insert one page into page 2 using Acrobat Reader DC Chinese Edition (version 2020.013.20064), when I search keyword it search only in page 1 and loop, it will not skip automatically to search the keyword after page 2. I need to skip to page by myself. But something strange, it will normal in DC English Edition in my customers PC.
 
  The url list below, and the file named "WordToPDFA3_InsertPage.pdf".
  https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing" rel="nofollow - https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing
 
  There is one more strange problem, I can open a pdf and save to another pdf file, and it's normal. But I open a pdf file provided by my customer and save it to another pdf as well as I do with my own test file, the pdf file I saved was broken and can't not opened in acrabat reader, it's show me the pdf file crash. The file was named "TestSave_Bad.pdf", the normal original file was named "TestSave_Original.pdf". The pdf file is bigger than the original one event I delete pages or just open and save.
 
   Finally, can I transform pdf from color pages to Black and White pages to reduce pdf file size?
 
   Thanks you very much.


Posted By: tfrost
Date Posted: 11 Nov 20 at 9:24AM
Since I do not have and could not use a Chinese Acrobat version, I cannot help with differences in how an English and Chinese Acrobat process your PDF.  And I suggest that this and also the 'bad PDF' issue are matters you should raise with Foxit support.


Posted By: WalterWiggos
Date Posted: 05 Mar 23 at 8:32PM
To extract specific pages from a PDF and save them as TIFF image files, you can use a PDF library such as iTextSharp. Once you have the TIFF files, you can use a library like LibTiff to manipulate them as needed. To replace specific pages in your searchable PDF file with the TIFF images, you can use a PDF library to insert the images into the PDF at the desired page location. I won't have time to keep up with modern technologies, as everything is developing too fast. Recently, our company has integrated a document scanner from http://smartengines.com/" rel="nofollow - https://smartengines.com/ . This scanner is very effective and, most importantly, reduces the workload of employees.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk