Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - How to distinguish between hidden text and display
  FAQ FAQ  Forum Search   Register Register  Login Login

How to distinguish between hidden text and display

 Post Reply Post Reply
Author
Message
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Topic: How to distinguish between hidden text and display
    Posted: 25 Mar 22 at 8:45AM
Hi,
  In PDF, some text is hidden and not displayed, but can be obtained by “DAGetTextBlockText". How to distinguish which text is displayed and which is hidden?


thank!
Back to Top
tfrost View Drop Down
Senior Member
Senior Member


Joined: 06 Sep 10
Location: UK
Status: Offline
Points: 437
Post Options Post Options   Thanks (0) Thanks(0)   Quote tfrost Quote  Post ReplyReply Direct Link To This Post Posted: 26 Mar 22 at 12:22PM
It all depends on how it is hidden. For example you could retrieve the text block bounds and determine if it is outside the Media, Crop or Trim box (depending on the type of PDF), or it might be hidden by setting the text colour to the same as the background colour, or hidden by an overlaid image or shape. Unless you control the PDFs yourself, you might have to test multiple things.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 26 Mar 22 at 2:03PM
Hi,

first a CombineContentstream... then a NormalizePage... Then the extraction.
Try the the normal GetPageText with option "csv-extraction" - Then you can see the textboundings/-positions and then you'll know why the extract works or not.

Cheers and welcom here,
Ingo
Cheers,
Ingo

Back to Top
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Posted: 28 Mar 22 at 2:47AM
hello, Ingo

Using the method you mentioned, the text blocks DAGetPageText with option "csv-extraction" are also overlapped in position and have the same text color. There is no information that can distinguish extract works or not
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 28 Mar 22 at 4:28PM
You should upload the relevant pdf anywhere on a free webhoster-space, Dropbox or anywhere else and post the link here.
Then we can try ourself.
Before you should explain detailed what's the result you need and what's the one you get.

Cheers,
Ingo

Back to Top
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Posted: 29 Mar 22 at 9:05AM
Hello,
http://124.205.131.133:8084/test2.pdf
test. Pdf  cannot distinguish hidden text. the text blocks DAGetPageText with option "csv-extraction" are also overlapped in position and have the same text color.
When extracting text content (test.png), there is text content in the yellow area, and the Yellow blocks overlap.

test2.pdf can't get text content with DAGetPageText








thank!




Edited by BJhuizhi - 29 Mar 22 at 10:50AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 29 Mar 22 at 9:50AM
Sorry.
Nothing readable for me on the download-site.
Finally ... doing all blind ... i should install a baidu.exe.
Perhaps another one here is able to read it and act in a safe way.
Cheers,
Ingo

Back to Top
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Posted: 29 Mar 22 at 10:50AM
Hello,



http://124.205.131.133:8084/test2.pdf
test. Pdf  cannot distinguish hidden text. the text blocks DAGetPageText with option "csv-extraction" are also overlapped in position and have the same text color.
When extracting text content (test.png), there is text content in the yellow area, and the Yellow blocks overlap.

test2.pdf can't get text content with DAGetPageText








thank!
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 29 Mar 22 at 9:04PM
Hi,

my first quick shot shows me all these (for me) standard characters and the chinese characters as well.
Which extract-option are you using?
I think that's the problem.
There are many ... try the options 7 and 8 - result should be better.

Cheers,
Ingo

Back to Top
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Posted: 30 Mar 22 at 10:15AM
hello,

Because the PDF file is relatively large, use Direct access functionality to read the PDF file content

  QukPdf.DASetTextExtractionOptions(1,0); 
  QukPdf.DASetTextExtractionOptions(2,0); 
  QukPdf.DASetTextExtractionOptions(3,0); 
  QukPdf.DASetTextExtractionOptions(4,0); 
  QukPdf.DASetTextExtractionOptions(5,0); 
  QukPdf.DASetTextExtractionOptions(6,1); 
  QukPdf.DASetTextExtractionOptions(7,1); 
  QukPdf.DASetTextExtractionOptions(8,1); 
  QukPdf.DASetTextExtractionOptions(9,0); 
  QukPdf.DASetTextExtractionOptions(10,0);
  QukPdf.DASetTextExtractionOptions(11,1);
  QukPdf.DASetTextExtractionOptions(12,0);
  QukPdf.DASetTextExtractionOptions(13,1);
  QukPdf.DASetTextExtractionOptions(14,0);
  QukPdf.DASetTextExtractionOptions(15,1);

  qukpdf.DASetTextExtractionArea(cpdfXy.Left,cpdfXy.Top,cpdfXy.Width,cpdfXy.Height);
  Pid:=qukpdf.DAExtractPageTextBlocks(FileHandle,PgRef,4);
  qukPdf.DASetTextExtractionWordGap(0.2);
  Pm:=qukpdf.DAGetTextBlockCount(Pid);
  if Pm>0 then
    Begin
      SetLength(AXFInfoLst(InfS),Pm);
      Qid:=0;
      While Qid<Pm DO
        Begin    
          Inc(Qid);
          AXFInfoLst(InfS)[Qid-1].fnt.fSiz:=Round(qukpdf.DAGetTextBlockFontSize(Pid,Qid));
          AXFInfoLst(InfS)[Qid-1].fnt.fNam:=qukpdf.DAGetTextBlockFontName(Pid,Qid);
          AXFInfoLst(InfS)[Qid-1].XYPt.Left:=Round(qukpdf.DAGetTextBlockBound(Pid,Qid,7)/RateW);
          AXFInfoLst(InfS)[Qid-1].XYPt.Top:=Round(qukpdf.DAGetTextBlockBound(Pid,Qid,6)/RateH);
          AXFInfoLst(InfS)[Qid-1].XYPt.Right:=Round(qukpdf.DAGetTextBlockBound(Pid,Qid,5)/RateW);
          AXFInfoLst(InfS)[Qid-1].XYPt.Bottom:=Round(qukpdf.DAGetTextBlockBound(Pid,Qid,2)/RateH);
          AXFInfoLst(InfS)[Qid-1].WStr:=qukPdf.DAGetTextBlockText(Pid,Qid));
        End;
    End;


Edited by BJhuizhi - 30 Mar 22 at 10:18AM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 30 Mar 22 at 1:04PM
The PDF is small.
test1.pdf with 1 page, 722 embedded objects and 1,8 mb
test2.pdf with 1 page, 522 embedded objects and 0,8 mb

Cheers,
Ingo

Back to Top
BJhuizhi View Drop Down
Beginner
Beginner
Avatar

Joined: 25 Mar 22
Location: china
Status: Offline
Points: 7
Post Options Post Options   Thanks (0) Thanks(0)   Quote BJhuizhi Quote  Post ReplyReply Direct Link To This Post Posted: 31 Mar 22 at 7:19AM
In the test.pdf  in quick PDF library 18, how to make dagettextblocktext not extract the content that is not displayed?

In the test2.pdf in quick PDF library 18, how to make dagettextblocktext extract the display content or dagetpageimagelist extract as an image?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 01 Apr 22 at 2:25PM
I doesn't come to an end.
For my opinion test.pdf is okay.
There are different characters... standard and chinese... and extraction works.

test2.pdf has only this extractable content:
fontname;hex-color;font height;row;col;text
"AGXXCW+Arial-BoldMT";231F20;9;-520/769;"6"
"AGXXCW+HYb1gj";231F20;7;-495/766;"《税务研究》"
"BOHDQU+ArialMT";231F20;7;-453/767;"2020"
"AGXXCW+HYb1gj";231F20;7;-437/766;"年第"
"BOHDQU+ArialMT";231F20;7;-423/767;"12"
"AGXXCW+HYb1gj";231F20;7;-415/766;"期 "
The problem can be the negative row-values as well...

Starting both pdf-files with Sumatra, Foxit or Adobe shows me a long index on the left with more than 140 pages.
But the readers offers only 1 single page to read - the index goes into nowhere.

I'll stop here perhaps there's another guy to help. Sorry.


Cheers,
Ingo

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store