Print Page | Close Window

DAExtractPageText losing characters

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1596
Printed Date: 06 May 25 at 4:25PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: DAExtractPageText losing characters
Posted By: Mike4ql
Subject: DAExtractPageText losing characters
Date Posted: 08 Oct 10 at 11:39AM

I am trying to extract the text from a PDF and most of it works fine but occasionally letters are missed in the extract.    This appears to be because the PDF is using octal codes for the characters.

This is the text which should be produced and is rendered correctly by DARenderPageToString:
Top Line ->  6 fyodor dostoyevsky
Space ->
Next Line -> flowers in a stuffy city apartment, but because everybody is

Here is the command extract for this same section

BT
0 0 0 1 k
/GS0 gs
/T1_0 1 Tf
8.25 0 0 8.25 262.7389 564.0571 Tm
[(\036)-100(\035)-55(\034)-100(\033)-100(\034)-100(\032)-100( )-100(\033)-100(\034)-100(\031)-100(\030)-82(\034)-45(\035)-100(\027)-100(\026)-100(\031)-100(\025)-100(\035)]TJ
8.5 0 0 8.5 83.3622 564.0571 Tm
(\f)Tj
10.5104 0 0 10.25 83.3622 543.058 Tm
[(\023)10(o)10(w)10(e)10(r)10(s)10( )-125(i)10(n)10( )-126(a)10( )-125(s)10(t)10(u)10(f)10(f)10(y)10( )-125(c)10(i)10(t)10(y)10( )-126(a)10(p)10(a)10(r)10(t)10(m)10(e)10(n)10(t)10(,)47( )-126(b)10(u)10(t)10( )-125(b)10(e)10(c)10(a)10(u)10(s)10(e)10( )-125(e)10(v)10(e)10(r)10(y)10(b)10(o)10(d)10(y)10( )-125(i)10(s )]TJ

The DAExtractPageText (option 3) returns 2 lines with an empty string and a space (or perhaps 2) for the Top Line and misses out the "fl" from the begining of the Next Line.

Is there any way I can correct this?




Replies:
Posted By: Mike4ql
Date Posted: 12 Oct 10 at 7:15PM
Has nobody else seen this?  
 
It seems to be a fundamental flaw preventing anyone from using PDF Quick to extract text from a PDF.
 
I would be grateful for any suggestions.
 
Mike



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk