Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Creating Text Searchable PDF with hocr file
  FAQ FAQ  Forum Search   Register Register  Login Login

Creating Text Searchable PDF with hocr file

 Post Reply Post Reply
Author
Message
McHaigh View Drop Down
Beginner
Beginner


Joined: 18 Jun 13
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote McHaigh Quote  Post ReplyReply Direct Link To This Post Topic: Creating Text Searchable PDF with hocr file
    Posted: 18 Jun 13 at 4:16PM
Hi there,

Could someone please elaborate on the answer posed here.

"With the draw-functionalities of QP you can insert the real text
without having an eye on the layout. "

I am unsure what the mean by "without having an eye on the layout" as far as I can see the steps would have to parse the "hocr" files XML and calculate the position of the boxes described within relative to the document.

Am I missing something. Is there simple functionality written to parse HOCR data onto a PDF document?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3529
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 18 Jun 13 at 10:19PM
...without having an eye on the layout...

'cause you draw the text right from scratch.
The original layout you can post over the drawed text on a new layer/content group.
So the first text is unvisible but extractable.
Regarding hocr there was a hint: "Looking at Google" ;-)

Cheers and welcome here,
Ingo

Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 20 Jun 13 at 4:15AM
Hello,

Debenu Quick PDF Library does not have any functions to process hocr files.  

Firstly you will need to turn the image into a PDF file by using AddImageFromFile to import the TIFF file that you have just OCR'ed.

Then as you have said you will need to parse the XML based hocr files to calculate the x,y position and pointsize of the font.  For each word you need to 

  QP.SetTextMode(3);
  for i = 1 to ocr_wordcount do
  begin
    QP.SetTextSize(ocr_size);
    QP.DrawText(ocr_x, ocr_y, ocr_word);
  end;

This will draw invisible text onto the PDF.  It is this process that will make it searchable.

Andrew.

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store