Print Page | Close Window

Creating Text Searchable PDF with hocr file

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2673
Printed Date: 20 May 25 at 6:26AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Creating Text Searchable PDF with hocr file
Posted By: McHaigh
Subject: Creating Text Searchable PDF with hocr file
Date Posted: 18 Jun 13 at 4:16PM
Hi there,

Could someone please elaborate on the answer posed http://www.quickpdf.org/forum/creating-pdf-from-image-hocr-data_topic2562.html - here.

"With the draw-functionalities of QP you can insert the real text
without having an eye on the layout. "

I am unsure what the mean by "without having an eye on the layout" as far as I can see the steps would have to parse the "hocr" files XML and calculate the position of the boxes described within relative to the document.

Am I missing something. Is there simple functionality written to parse HOCR data onto a PDF document?



Replies:
Posted By: Ingo
Date Posted: 18 Jun 13 at 10:19PM
...without having an eye on the layout...

'cause you draw the text right from scratch.
The original layout you can post over the drawed text on a new layer/content group.
So the first text is unvisible but extractable.
Regarding hocr there was a hint: "Looking at Google" ;-)

Cheers and welcome here,
Ingo



Posted By: AndrewC
Date Posted: 20 Jun 13 at 4:15AM
Hello,

Debenu Quick PDF Library does not have any functions to process hocr files.  

Firstly you will need to turn the image into a PDF file by using AddImageFromFile to import the TIFF file that you have just OCR'ed.

Then as you have said you will need to parse the XML based hocr files to calculate the x,y position and pointsize of the font.  For each word you need to 

  QP.SetTextMode(3);
  for i = 1 to ocr_wordcount do
  begin
    QP.SetTextSize(ocr_size);
    QP.DrawText(ocr_x, ocr_y, ocr_word);
  end;

This will draw invisible text onto the PDF.  It is this process that will make it searchable.

Andrew.




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk