Print Page | Close Window

ExtractFilePageText extracts question mark!

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=3740
Printed Date: 25 Apr 24 at 3:13PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: ExtractFilePageText extracts question mark!
Posted By: rezabb
Subject: ExtractFilePageText extracts question mark!
Date Posted: 09 Sep 19 at 3:05AM
Hi,

I have problem in extracting words inside my PDF file. the PDF content is in Persian or Farsi language. 

after using ExtractFilePageText function in vb6, I receive question marks (??? ??) instead of actual text. How can I get the real Persian Texts and not a series of question marks?

Thanks,
Reza

p.s. this is my code in vb6:

Dim ClassName
Dim LicenseKey

ClassName = "DebenuPDFLibraryAX1613.PDFLibrary"
LicenseKey = "***"

Dim DPL
Dim Result
 
Set DPL = CreateObject(ClassName)
Result = DPL.UnlockKey(LicenseKey)

DPL.LoadFromFile strInputFilePath, ""


iNumPages = DPL.PageCount() '// Calculate the number of pages

strText = ""
nPage = 0

For nPage = 1 To iNumPages

strText = DPL.ExtractFilePageText(strInputFilePath, "", nPage, 0)
Clipboard.Clear
Clipboard.SetText strText
text1.text = strText

Next



Replies:
Posted By: Ingo
Date Posted: 09 Sep 19 at 10:46AM
Hi Reza,

if you look into the online reference you'll see that all string content is handled as unicode (wchar). Your vb6 doesn't support unicode with normal vb6-code - you have to convert the content first from unicode to string or the other way round (depending if you want to get or put).

Cheers and welcome here,
Ingo



-------------
Cheers,
Ingo



Posted By: rezabb
Date Posted: 09 Sep 19 at 1:39PM
Dear Ingo,
Many Thanks for your answer and for your warm welcoming me in the forum,
Using your exact keyword and using the exact conversion you mentioned, I could find the proper vb code. and now I have extracted texts in Persian language. it worked like a charm.

Another question: in Persian, Texts are written from Right to left.... Quickpdf seems to extract texts (characters) from "Top Left" side of the page. is there any option to reverse this process lets say from "Top Right" side of the page. because my words are extracted in reveres character order and it does not make sense, making it unreadable. (for example consider the word "google", it is extracted like "elgoog" )

when I copy text from Persian PDF and paste it in MSWord, the text is correct. I want to have my extracted texts similar to what MSWord is doing.

Thanks again,
Reza



Posted By: Ingo
Date Posted: 09 Sep 19 at 4:56PM
try extract option 0 or 7 to have a normal readable content or option 4 to get word by word.
This should make it easier for you.
https://www.debenu.com/docs/pdf_library_reference/ExtractFilePageText.php



-------------
Cheers,
Ingo



Posted By: rezabb
Date Posted: 09 Sep 19 at 5:45PM
great. option 4 is a better solution so to deal with the text, word by word and post-processing extracted words later using vb code.
in my PDF I have a few half-space character which acts like a space but in this case, the two connecting words are assumed to be a one word rather than two separated words. (just like the character "-" in English; sample word: "non-destructive" which is considered 1 word not 2 words). 

I see that Quickpdf considers words which have a half-space character in the middle, as two separated words and extracts them separately. is there any way to extract them as one word? ... maybe define the separating character. because ascii code for space and half-space are not the same.


Posted By: Ingo
Date Posted: 09 Sep 19 at 7:10PM
Hi again,


i have few code snippets for you how to deal with unicode- and integer-values using vb6:

module1.bas
-----------

Attribute VB_Name = "Module1"

Public Declare Function functionname1 Lib "function.dll" (ByVal parameter As String) As Integer 
Public Declare Function functionname2 Lib "function.dll" (ByVal parameter1 As String, ByVal parameter2 As Integer) As Long ' The returned string content

Public Declare Function apiLStrCopyW Lib "kernel32.dll" Alias "lstrcpyW" (ByVal lpString1 As Long, ByVal lpString2 As Long) As Long
Public Declare Function apiLStrLenW Lib "kernel32.dll" Alias "lstrlenW" (ByVal lpString As Long) As Long

Public Function GetStringFromPtrW(ByVal ptr As Long) As String
  'create a matching buffer
  GetStringFromPtrW = String$(apiLStrLenW(ptr), 0)
  'copying the string into the buffer
  apiLStrCopyW StrPtr(GetStringFromPtrW), ptr
End Function

form1.frm
---------

VERSION 5.00
Begin VB.Form Form1
   Caption         =   "vb6-sample - ..."
   ClientHeight    =   5475
   ClientLeft      =   45
   ClientTop       =   435
   ClientWidth     =   7365
   LinkTopic       =   "Form1"
   ScaleHeight     =   5475
   ScaleWidth      =   7365
   StartUpPosition =   3  'Windows-Standard
   Begin VB.CheckBox Check7

"  . . .

Public r As String

Private Sub Command1_Click()
  Dim sPfad() As Byte

    sPfad = StrConv(Text1.Text, vbUnicode)
   
    Text7.Text = Str(functionsname1(sPfad))

End Sub

Private Sub option1_Click()
  Dim sPfad() As Byte
  Dim tPfad() As Byte
  Dim title() As Byte
  Dim sp As Integer
" . . .
    
    If Check1.Value = 1 Then
       sp = 1
      Else
       sp = 0
    End If
"   . . .



-------------
Cheers,
Ingo




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk