Print Page | Close Window

The GetPageText(3/4) returns invalids rectangles

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=819
Printed Date: 10 May 24 at 2:21PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: The GetPageText(3/4) returns invalids rectangles
Posted By: DELBEKE
Subject: The GetPageText(3/4) returns invalids rectangles
Date Posted: 23 Nov 07 at 4:31AM

The gettextpage(3) or gettextpage(4) returns lines defining each text with it's bounding rectangle,font

Somes times to times, the data returned are not a rectangle.
 
The following Vb6 code has been writen to show this

Private Sub Command1_Click()
  Dim lPnt As Long
  Dim lRet As Long
  Dim iPosit As Integer
  Dim hFich As Integer
  Dim Text As String
  Dim InputFileName As String
  Dim OutputFileName As String
  Dim DocId As Long
  Dim Tbl1 As Variant
  Dim Tbl2 As Variant
  Dim sTemp As String
  Dim X1 As Double
  Dim Y1 As Double
  Dim X2 As Double
  Dim Y2 As Double
  Dim X3 As Double
  Dim Y3 As Double
  Dim X4 As Double
  Dim Y4 As Double
 
  Dim Ok As Boolean
 
  'get input and output filenames
  InputFileName = LCase(Text1)
  OutputFileName = Replace(InputFileName, ".pdf", ".txt")
  'get a free handle number
  hFich = FreeFile
  'ctreat a new occurence of ised
  Set Doc = New iSED.QuickPDF
  'unlock ised
  lRet = Doc.UnlockKey("XXXXXXXXXXXXXXXXXXXXXX")
  'load the sample file
  DocId = Doc.LoadFromFile(InputFileName )
  'select first page
  lRet = Doc.SelectPage(1)
  'combine layers to got the whole text
  lRet = Doc.CombineLayers
  'set SetMeasurementUnits to millimeters
  Doc.SetMeasurementUnits 1
  ' set origin to top left
  Doc.SetOrigin 1
  'get the text
  Text = Doc.GetPageText(3)
  
  'free memory
  Doc.RemoveDocument DocId
 
  ' Split the text in a table, using the cr+lf as a line séparator
  Tbl1 = Split(Text, vbCrLf)
  'analyse each table's line
  For lPnt = 0 To UBound(Tbl1) - 1
    sTemp = Tbl1(lPnt)
    'remove the font name
    iPosit = InStr(sTemp, Chr(34) & ",")
    sTemp = Mid(sTemp, iPosit + 1)
    'split the line into parts using comma as separator
    Tbl2 = Split(sTemp, ",")
    X1 = CDbl(Tbl2(3))
    Y1 = CDbl(Tbl2(4))
    X2 = CDbl(Tbl2(5))
    Y2 = CDbl(Tbl2(6))
    X3 = CDbl(Tbl2(7))
    Y3 = CDbl(Tbl2(8))
    X4 = CDbl(Tbl2(9))
    Y4 = CDbl(Tbl2(10))
    'to be a rectangle (assuming points are define clockwise)
    'x1 must equal x4
    'y1 must equal y2
    'x2 must equal x3
    'y3 must equal y4
    Ok = True 'by default, the datas define a rectangle
    If X1 <> X4 Then
      Ok = False
    End If
    If X2 <> X3 Then
      Ok = False
    End If
    If Y1 <> Y2 Then
      Ok = False
    End If
    If Y3 <> Y4 Then
      Ok = False
    End If
    If Not Ok Then
      MsgBox "this line do not define a rectangle" & vbCrLf & _
              Tbl1(lPnt)
    End If
  Next
  MsgBox "test finished"
End Sub

 
 



Replies:
Posted By: swb1
Date Posted: 23 Nov 07 at 9:06AM

I have had this problem as well. It seems depend upon what created the document. There are a number of different ways of expressing the location of text on the page and some of these ways seem to hide the text from QuickPDF. I don’t not have and answer and because I use QuickPDF principally to extract text I am hopeful that there is a solution.

 

Steve


Posted By: DELBEKE
Date Posted: 23 Nov 07 at 10:12AM
The problem is to get the real Y position, the bottom line can be found by adding the height for the current font
The Y position may be the bottom or the top line for the text (but should be always the same)
As far i've got, one of the y1/y2/y3/y4 is the good one, but i can not found which one.
 
Ps :  i've found sone documents where y1=y2=y3=y4


Posted By: swb1
Date Posted: 23 Nov 07 at 10:59AM

I guess my problem is not the same after all. My issue is not with the bounding rectangle. GetPageText(3) is returning the correct rectangle boundaries however it is not returning the correct text. In most cases the text is empty.




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk