Problem with text extraction

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hi all,
I am trying to extract text from this PDF (on Mac):

http://research.microsoft.com/pubs/145347/bodypartrecognition.pdf

These are the results of my current efforts:

[pdfText appendString:[DQPL ExtractFilePageText:pdfFilePath :@"" :nPage+1 :0]];
-> crashes with unmapped memory exception deep in the Debenu code. Options 7 and 8 work, but return the text only partially, roughly half of the text is missing. 

[pdfText appendString:[DQPL ExtractFilePageText:pdfFilePath :@"" :nPage+1 :5]];

Returns only half of the text on the page. The last line of CSV file is only partially written to the string which makes me think it silently crashes, although the code apparently runs fine and outputs to the file. The same is true for options 3,4 and 6. 

Using the following code:

        int textblockID = [DQPL ExtractFilePageTextBlocks:pdfPath :@"" :1 :3];
        int count = [DQPL GetTextBlockCount:textblockID];        
        for(int i=0;i<count;i++){
            NSString *line = [DQPL GetTextBlockText:textblockID :i+1];
            NSLog(@"Page 1 block %d = %@", i+1, line);
        }
I can extracts only every alternate text block on the page, but manages to get to the end of the page. Therefore, again half is missing, but a different half than before! 
Any pointers on what I might be doing wrong would be greatly appreciated! The PDF renders fine, which makes me thing that the problem is the text extraction code.
On a related note: is it possible to get glyph information _before_ it is put into blocks. E.g. the CSV file on the individual glyph basis, without any processing? I think that would be very useful. 
Cheers, R.

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
rstojnic Members Profile Find Members Posts Beginner Joined: 23 Nov 13 Location: United Kingdom Status: Offline Points: 2	Post Options Post Reply Quote rstojnic Report Post Thanks(0) Quote Reply Topic: Problem with text extraction Posted: 23 Nov 13 at 8:56PM
	Hi all, I am trying to extract text from this PDF (on Mac): http://research.microsoft.com/pubs/145347/bodypartrecognition.pdf These are the results of my current efforts: [pdfText appendString:[DQPL ExtractFilePageText:pdfFilePath :@"" :nPage+1 :0]]; -> crashes with unmapped memory exception deep in the Debenu code. Options 7 and 8 work, but return the text only partially, roughly half of the text is missing. [pdfText appendString:[DQPL ExtractFilePageText:pdfFilePath :@"" :nPage+1 :5]]; Returns only half of the text on the page. The last line of CSV file is only partially written to the string which makes me think it silently crashes, although the code apparently runs fine and outputs to the file. The same is true for options 3,4 and 6. Using the following code: int textblockID = [DQPL ExtractFilePageTextBlocks:pdfPath :@"" :1 :3]; int count = [DQPL GetTextBlockCount:textblockID]; for(int i=0;i<count;i++){ NSString *line = [DQPL GetTextBlockText:textblockID :i+1]; NSLog(@"Page 1 block %d = %@", i+1, line); } I can extracts only every alternate text block on the page, but manages to get to the end of the page. Therefore, again half is missing, but a different half than before! Any pointers on what I might be doing wrong would be greatly appreciated! The PDF renders fine, which makes me thing that the problem is the text extraction code. On a related note: is it possible to get glyph information _before_ it is put into blocks. E.g. the CSV file on the individual glyph basis, without any processing? I think that would be very useful. Cheers, R.