Maker Pro
Maker Pro

OT: Question for XP Users

J

Joerg

Jan 1, 1970
0
Hello Mike,
Naturally, the smaller app note loads faster. Browsing and searching are
also improved, so these vendors files tend to get stored on my hard disk.

Searching is a huge downside of PDF. You cannot do a disk search for,
say, "N-Channel" + "logic level". I guess that's why Google translates
them all into HTML. Else it wouldn't find anything.

Regards, Joerg
 
R

Roy L. Fuchs

Jan 1, 1970
0
Hello Mike,


Searching is a huge downside of PDF. You cannot do a disk search for,
say, "N-Channel" + "logic level". I guess that's why Google translates
them all into HTML. Else it wouldn't find anything.

That depends ENTIRELY on how the pdf gets constructed.

Try again.
 
M

mike monett

Jan 1, 1970
0
[...]
Searching is a huge downside of PDF. You cannot do a disk search for,
say, "N-Channel" + "logic level". I guess that's why Google translates
them all into HTML. Else it wouldn't find anything.

Regards, Joerg

So how do you find stuff on your hard disk? Do you store things in certain
directories, such as VCO, ADC, PWM, etc? If so, what do you do when an
article falls in more than one category?

I'll bet some people have huge archives, like Win or Jim. I wonder how they
ever manage to find things buried in pdf files.

Regards,

Mike Monett
 
K

kai-martin knaak

Jan 1, 1970
0
schrieb:
I'll bet some people have huge archives, like Win or Jim. I wonder how
they ever manage to find things buried in pdf files.

Tey might use indexing search engines like beagle. This is like google on
your hard disk.

---<(kaimartin)>---
 
J

Joerg

Jan 1, 1970
0
Hello Georg,

IIRC that operates similar to Google. Not on the pdf directly but it
will first strip and generate a text file for each pdf file. Then all
these will be stored (taking up extra space, of course) and these text
files are what's searched.

Might as well store all that as HTML, then there is no need to convert
for searches.

Regards, Joerg
 
M

mike monett

Jan 1, 1970
0
Georg Baum said:

Thanks for the reply, Georg.

Hmm... 78 megabytes of pdf files results in a 27 megabyte index. Not very
economical. Why index the entire pdf file? Why not just get some key words
and phrases that are unique to that pdf file. Also, how does it index
scanned image files that have no text? How does google search these pdf
files?

I wonder if xpdf handles recent pdf versions. I tried it a while ago and it
croaked on acrobat versions 4.0 and above.

It looks like there may be some room for improvement:)

Regards,

Mike Monett
 
M

mike monett

Jan 1, 1970
0
kai-martin knaak said:
schrieb:


Tey might use indexing search engines like beagle. This is like google on
your hard disk.

---<(kaimartin)>---

Thanks Kai,

A quick search located a wiki article that has more information and some
links:

http://en.wikipedia.org/wiki/Desktop_search

I'll check this out when I get some time.

Thanks!

Mike Monett
 
J

Joerg

Jan 1, 1970
0
Hello Mike,

Hmm... 78 megabytes of pdf files results in a 27 megabyte index. Not very
economical. Why index the entire pdf file? Why not just get some key words
and phrases that are unique to that pdf file. ...


That is risky. You never know in advance what you'd be searching for
some day. Sometimes you kind of remember a hint or maybe just a
reference quoted at the end of the text that you really want to find
back. With just a few stored keywords it's gone.

... How does google search these pdf files?

AFAIK they convert the whole thing to HTML and then index. That is why
instead of the "cached" link at the bottom you'll see a "View as HTML"
link on the 2nd line.

Text in graphics or images is often lost for indexing purposes.

Regards, Joerg
 
C

Chuck Harris

Jan 1, 1970
0
mike said:
I wonder if xpdf handles recent pdf versions. I tried it a while ago and it
croaked on acrobat versions 4.0 and above.

I was just using it on some files that were made for Acrobat 6, and it worked
just fine. The only place I have a problem with xpdf is it doesn't know what
to do with high density pdf pages that are sent to my postscript printer. The
file goes to the printer, but nothing gets printed. For those, I have to use
ghostview.

-Chuck
 
S

Spehro Pefhany

Jan 1, 1970
0
Hello Mike,




That is risky. You never know in advance what you'd be searching for
some day. Sometimes you kind of remember a hint or maybe just a
reference quoted at the end of the text that you really want to find
back. With just a few stored keywords it's gone.

Of course, if you have the software, you should run Catalog to create
searchable full-text indexes of groups of PDFs.


Best regards,
Spehro Pefhany
 
M

mike monett

Jan 1, 1970
0
Joerg said:
Hello Mike,




That is risky. You never know in advance what you'd be searching for
some day. Sometimes you kind of remember a hint or maybe just a
reference quoted at the end of the text that you really want to find
back. With just a few stored keywords it's gone.

Nah. Most articles are rehashes of info that has already been rehashed a
thousand times. There may be one paragraph or idea in the entire article
that catches your attention. I use a few key words or phrases from that
section for the search index. Much faster.

There are few writers with entirely new approaches to something, such as
Hajimiri and Lee on low noise colpitts oscillators. These people get a
whole directory to themselves where I can feast on their insights with no
distractions from lesser authors.
AFAIK they convert the whole thing to HTML and then index. That is why
instead of the "cached" link at the bottom you'll see a "View as HTML"
link on the 2nd line.
Text in graphics or images is often lost for indexing purposes.

That's what I'm talking about. Google has some way of extracting text from
scanned pdf files. They give a sentence or two containing your search terms
so you can see if you want to download the file. When you get it
downloaded, it is generally much larger than normal, and the text selection
tool won't work. It's a scanned image.

So how does google extract the text from these files? Do they have some
kind of super OCR software that handles fuzzy and broken characters scanned
at an angle in arbitrary fonts with flyspecks and dirt all over the page?
Where can I download this software?
Regards, Joerg

Best,

Mike Monett
 
M

mike monett

Jan 1, 1970
0
[...]
I was just using it on some files that were made for Acrobat 6, and it worked
just fine. The only place I have a problem with xpdf is it doesn't know what
to do with high density pdf pages that are sent to my postscript printer. The
file goes to the printer, but nothing gets printed. For those, I have to use
ghostview.

Thanks, Chuck. That's good news - I'll go download it and give it another
try.

Regards,

Mike Monett
 
S

Spehro Pefhany

Jan 1, 1970
0
That's what I'm talking about. Google has some way of extracting text from
scanned pdf files. They give a sentence or two containing your search terms
so you can see if you want to download the file. When you get it
downloaded, it is generally much larger than normal, and the text selection
tool won't work. It's a scanned image.

Try *search* on such a file. I'll bet you'll find that the scanned
image is overlayed over an uncorrected OCR file created by Adobe
Capture. I don't think it's anything to do with Google.


Best regards,
Spehro Pefhany
 
J

Joerg

Jan 1, 1970
0
Hello Mike,
Nah. Most articles are rehashes of info that has already been rehashed a
thousand times. ...


Not data sheets. Those are full of crucial information. Also some WDF
papers that I haven't found in any other format than pdf.
So how does google extract the text from these files? Do they have some
kind of super OCR software that handles fuzzy and broken characters scanned
at an angle in arbitrary fonts with flyspecks and dirt all over the page?
Where can I download this software?

Don't know but even Acrobat contains a graphics and a text tool. With
that you can highlight the desired range, do a CTRL-C and whoopdidou
plant it into your text file. After that it's searchable. I guess this
could be automated and they must have done something like that.

OCR is another option when the text tool won't work. Even back in the
early 90's I had an OCR package that was pretty good in deciphering
noisy trans-atlantic faxes. Often this was the only option to exchange
information with a co-author of a paper from a few thousand miles away.
Internet wasn't everywhere back then and modems weren't either. So it
was either OCR or typing it all off again.

Regards, Joerg
 
M

mike monett

Jan 1, 1970
0
Spehro Pefhany said:
Try *search* on such a file. I'll bet you'll find that the scanned
image is overlayed over an uncorrected OCR file created by Adobe
Capture. I don't think it's anything to do with Google.


Darn it, Speff, can't you wait til I get my chores done:)

I mentioned searching these files for text strings doesn't work. It tries
but quits instantly with no results.

Looking for an example, I tried searching google for a pdf article that I
downloaded long ago: Elaine Balliew, "The Challenge of Testing ADSL
Modems", Evaluation Engineering, October 1998. The copy I have is
definitely scanned with no text at all. Unfortunately, the only copy left
on the web is a text version stripped of all images. (Not that they were
that important.)

http://www.evaluationengineering.com/archive/articles/1098adsl.htm

So I'll never know why I downloaded the article - whether it was her wit
and charm, or the results of a google search, or if it was simply linked in
another long lost article:)

Anyway, I should never use google until all the dishes are washed and
everything is put away. it is simply too easy to get lost in very
interesting pages that you never seem to hit when you have lots of time to
spend. Here's one that might interest hams - low noise single-signal direct
conversion receivers

http://lekstutis.com/Artie/Ham/Projects/RT2.html

Lots of good links at the bottom, including one to Analog Devices AN-741,
"Little Known Characteristics of Phase Noise". Of course the link doesn't
work anymore - ADI has gone nuts with their file naming conventions.
Instead of simply calling it AN741.pdf, the file is now

http://www.analog.com/UploadedFiles/Application_Notes/54275316215330254506699244016AN741_0.pdf

Safely tucked away while I do the dishes...
Best regards,
Spehro Pefhany

Regards,

Mike Monett
 
G

Georg Baum

Jan 1, 1970
0
mike said:
Hmm... 78 megabytes of pdf files results in a 27 megabyte index. Not very
economical.

No. But still better than no search possibility at all.
Why index the entire pdf file? Why not just get some key words
and phrases that are unique to that pdf file. Also, how does it index
scanned image files that have no text?

That depends on the pdf. For example, some of the older papers you can get
at ieeexplore.ieee.com are scanned images of the original journal pages,
but they did run OCR over that when creating the pdfs and stored a
text-only version internally as well. I am not sure, but I think that
pdfsearch evaluates that.
How does google search these pdf
files?

I don't know, but I think it does not do that either.
I wonder if xpdf handles recent pdf versions. I tried it a while ago and
it croaked on acrobat versions 4.0 and above.

It does work here also with newer pdfs.
It looks like there may be some room for improvement:)

Definitely.


Georg
 
J

Joerg

Jan 1, 1970
0
Hello Mike,

Anyway, I should never use google until all the dishes are washed and
everything is put away. ...


Think about investing in a dishwasher. It's them big boxes that hiss and
whoosh and clang and buzz :)

SCNR, Joerg
 
Top