wget'ing complete web pages/sites

Don Y · Nov 18, 2013

Hi,

I've been downloading driver sets for various laptops,
etc. from manufacturer web sites. This is a royal PITA.
Why isn't there a "download all" button? (or, an ftp
server with directories per product that could be
pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim
to be able to do this -- with little success. With no
first-hand experience *building* web pages/sites, I can
only guess as to what the problem is:

Instead of static links on the page, everything hides
behind JS (?). And, the tools I have used aren't clever
enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist
to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing
effort for me for several different models of PC and laptop
so I'd love to have a shortcut -- ordering restore CD's
means a long lag between when I get the machines and when
I can finish setting them up. I'd like NOT to be warehousing
stuff for a nonprofit! (SWMBO won't take kindly to that!)

Thx,
--don

miso · Nov 18, 2013

Hi,

I've been downloading driver sets for various laptops,
etc. from manufacturer web sites. This is a royal PITA.
Why isn't there a "download all" button? (or, an ftp
server with directories per product that could be
pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim
to be able to do this -- with little success. With no
first-hand experience *building* web pages/sites, I can
only guess as to what the problem is:

Instead of static links on the page, everything hides
behind JS (?). And, the tools I have used aren't clever
enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist
to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing
effort for me for several different models of PC and laptop
so I'd love to have a shortcut -- ordering restore CD's
means a long lag between when I get the machines and when
I can finish setting them up. I'd like NOT to be warehousing
stuff for a nonprofit! (SWMBO won't take kindly to that!)

Thx,
--don

If wget doesn't work, you will need to write your own crawler. Some
websites are complicated enough that wget won't work. Wget isn't very
good for dynamic data.

Find someone competent in beautifulsoup or learn it yourself.

Don Kuenz · Nov 18, 2013

In sci.electronics.design Don Y said:
Hi,

I've been downloading driver sets for various laptops,
etc. from manufacturer web sites. This is a royal PITA.
Why isn't there a "download all" button? (or, an ftp
server with directories per product that could be
pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim
to be able to do this -- with little success. With no
first-hand experience *building* web pages/sites, I can
only guess as to what the problem is:

Instead of static links on the page, everything hides
behind JS (?). And, the tools I have used aren't clever
enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist
to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing
effort for me for several different models of PC and laptop
so I'd love to have a shortcut -- ordering restore CD's
means a long lag between when I get the machines and when
I can finish setting them up. I'd like NOT to be warehousing
stuff for a nonprofit! (SWMBO won't take kindly to that!)

FWIW I've gotten a fair bit of mileage out of Dell's hybrid HTTP/FTP
site: http://ftp.dell.com/published/Pages/index.html

OT - it surprised me to see Microsoft's FTP site still open after all
these decades: ftp://ftp.microsoft.com/

Don Y · Nov 18, 2013

On 11/17/2013 10:41 PM, Don Y wrote:

If wget doesn't work, you will need to write your own crawler. Some
websites are complicated enough that wget won't work. Wget isn't very
good for dynamic data.

Find someone competent in beautifulsoup or learn it yourself.

The implication is that this is an exercise I would have to repeat
for each manufacturer's site? :<

I think one of the tools I have will let me browse-and-save (at
least that saves a bunch of keyclicks for each download!)

Don Y · Nov 18, 2013

Hi Jeff,

Because if they did that, there would only be one opportunity to sell
you something via the ads. By forcing you to come back repeatedly to
the same page, there are more opportunities to sell you something.
Even if they have nothing to sell, the web designers might want to
turn downloading into an ordeal process so that the "click" count is
dramatically increased.

A wee bit cynical, eh Jeff? :>

[Actually, I suspect the problem is that such a Big Button would
end up causing folks to take the lazy/safe way out -- too often.
And, their servers see a bigger load than "necessary".

Besides, most vendors see no cost/value to *your* time! :-/ ]

Sorry, I don't have a solution. Javascript and dynamic web content
derived from an SQL database are not going to work. CMS (content
management system) are also difficult to bypass.

That's what I figured when I took the time to look at the page's
source. :<

I use WinHTTrack
instead of wget for downloading sites. It tries hard, but usually
fails on CMS built sites:
<http://www.httrack.com>

That was the first option I tried. It pulled down all the "fluff"
that I would have ignored -- and skipped over the meat and potatoes!

See the FAQ under Troubleshooting for clues:
<http://www.httrack.com/html/faq.html>
However, even if you find the manufacturers secret stash of drivers,
they usually have cryptic names that defy easy identification. I once
did this successfully, and then spent months trying to identify what I
had just accumulated.

In the past, I've just invoked each and cut-pasted some banner that the
executable displays in as the new file name (put old in parens).

HP's site is particularly annoying. E.g., all the *documents* that
you download are named "Download.pdf" (what idiot thought that would
be a reasonable solution? Download two or more documents and you
have a name conflict! :< )

Looks like I'll go the browse-and-save route I described in my other
reply :-/

If the manufacturer has an FTP site, you might try snooping around the
public section to see if the driver files are available via ftp.

Yes, but those sites tend to not present the "cover material" that goes
with each download. Release notes, versioning info, prerequisites,
etc. So, you have no context for the files...

Maybe I'll "take the survey" -- not that it will do any good!
("Buy our driver CD!" "Yes, but will it have the *latest* drivers
for the OS I want? And, all the accompanying text? And, do I
get a discount if I purchase them for 15 different models??")

miso · Nov 18, 2013

The implication is that this is an exercise I would have to repeat
for each manufacturer's site? :<

I think one of the tools I have will let me browse-and-save (at
least that saves a bunch of keyclicks for each download!)

Yes, the software has to be tweeked per site. Somebody good at
beautifulsoap can crank it out quickly. The code can be a few dozen
lines, but if you don't know the fu, it is a monumental task.

George Neuner · Nov 18, 2013

Instead of static links on the page, everything hides
behind JS (?). And, the tools I have used aren't clever
enough to know how to push the buttons?

I use HTTrack when I need to clone (parts of) sites. It's available
for both Windows and Linux. http://www.httrack.com/

I don't know exactly how smart it is, but I've seen it reproduce some
pretty complicated pages. It can scan scripts and it will grab files
from any links it can find. It's also pretty well configurable.

George

miso · Nov 19, 2013

Here is an example of a BS program, well actually python using BS. It is
14 line program to find every url in a website.

http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/

When you scrape, you don't want to see the website as it is presented to
the human viewing the browser, you just want the goodies. That is why
the program is somewhat tweaked to the website, once you figure out how
they store the data.

Tauno Voipio · Nov 19, 2013

I use HTTrack when I need to clone (parts of) sites. It's available
for both Windows and Linux. http://www.httrack.com/

I don't know exactly how smart it is, but I've seen it reproduce some
pretty complicated pages. It can scan scripts and it will grab files
from any links it can find. It's also pretty well configurable.

The OP has already lost the game. It is obvoius that the owner
of the website does not want automatic vacuuming of the data.

If there were a script-passing downloader, the website owners
will resort to CAPTCHAs, which are intended to thwart automats.

Don Y · Nov 19, 2013

Hi Tauno,

I think the problem is that the page is "built" on the fly.

The OP has already lost the game. It is obvoius that the owner
of the website does not want automatic vacuuming of the data.

I don't think that's the case.

Imagine you were producing some number of PC's that support
some number of OS's in some number of languages with some
number of updates to some number of subsystems...

Undoubtedly, you would store this information in a configuration
management system/DBMS somewhere. It would allow you to
"mechanically" indicate the interrelationships between
updates, etc.

An update would have a specific applicability, release notes, etc.

Why create hundreds (more like tens of thousands when you
consider each download has its own descriptive page -- or 5)
of wb pages when you can, instead, create a template that you
fill in based on the user's choices? (model, OS, language)

Issue a bunch of specific queries to the DBMS/CMS, sort the
results (by driver category) for the given (OS, language, model)
and stuff the results into a specific form that you repeat on
every page!

I.e., a driver that handles 4 particular languages would
magically appear on the pages for those four languages -- and
no others. To be replaced by something else as appropriate
on the remaining pages.

Sure, you could generate a static page from all this and present
*that* to the user. But, why bother? What do *you* gain?

R.Wieser · Nov 19, 2013

Hello Don,

And, the tools I have used aren't clever enough to know
how to push the buttons?

You could take a look at a program called AutoIt, though I'm not sure its
available on all platforms (have been using it on Windows).

Although it origionated as a simple "record and replay" tool it has become
quite versatile, enabeling you to script mouse-clicks dependant on data you
read from webpages (as long as the browser has an interface to do so
ofcourse).

Hope that helps,
Rudy Wieser

-- Origional message:

Don Y said:
Hi,

I've been downloading driver sets for various laptops,
etc. from manufacturer web sites. This is a royal PITA.
Why isn't there a "download all" button? (or, an ftp
server with directories per product that could be
pulled down in one shot!)

Anyway, I have tried a couple of utilities that claim
to be able to do this -- with little success. With no
first-hand experience *building* web pages/sites, I can
only guess as to what the problem is:

Instead of static links on the page, everything hides
behind JS (?). And, the tools I have used aren't clever
enough to know how to push the buttons?

[or, they don't/can't capture the cookie that must exist
to tell the site *what* I want (product, os, language)]

Is there a workaround for this? It will be an ongoing
effort for me for several different models of PC and laptop
so I'd love to have a shortcut -- ordering restore CD's
means a long lag between when I get the machines and when
I can finish setting them up. I'd like NOT to be warehousing
stuff for a nonprofit! (SWMBO won't take kindly to that!)

Thx,
--don

George Neuner · Nov 19, 2013

The OP has already lost the game. It is obvious that the owner
of the website does not want automatic vacuuming of the data.

Not necessarily.

Unless there are protected directories, HTTrack can grab everything:
HTML, scripts, linked files/resources ... everything. It will even
work cross-site [though not by default].

Sometimes you have to do a bit of work figuring out the site structure
before you can configure HTTrack to clone it. More often than not,
the difficulty with HTTrack is that it grabs *more* than you want.

George

Don Y · Nov 19, 2013

Hi Jeff,

That hasn't been a problem for me because I only read the docs after I
screw things up.

What I am *most* interested in is the "cover page" for each download.
It's the simplest way to map some bogus file name (x1234h66.exe) to
a description of the file ("nVidia graphic driver for FooTastic 123,
version 978 11/15/2013"). Too mch hassle to type or cut/paste
this sort of stuff, otherwise!

(and file headers often aren't consistent about presenting these
"annotations" -- so, you end up having to *invoke* each file,
later, to figure out what it is supposed to do...)

Don Y · Nov 20, 2013

Hi Don,

FWIW I've gotten a fair bit of mileage out of Dell's hybrid HTTP/FTP
site: http://ftp.dell.com/published/Pages/index.html

OT - it surprised me to see Microsoft's FTP site still open after all
these decades: ftp://ftp.microsoft.com/

The problem is finding a "list" of pertinent downloads (along
with descriptions instead of just file names) for a specific
product/os/language.

I.e., "these are ALL the files you will (eventually) need
to download if you are building this product to run this os in
that language. Then, ideally, fetch them all for you!

For an FP directory PER PRODUCT/OS/LANGUAGE, this would be easy;
just copy the entire directory over! (IE sems to be able to
do this easily -- along with many other products -- Firefox
seems to insist on "file at a time")

E.g., Ages ago, a web page would list individual files and have
static links to the files, their descriptions, release notes,
etc. So, you could point a tool at such a page and say, "resolve
every link on this page and get me whatever is on the other end!

This doesn't appear to be the case, anymore.

E.g., MS has all (most!) of their "updates" individually accessible,
with documentation, WITHOUT going through the update service.
Nothing "secret", there.

But, finding them all listed in one place (so you could "get all")
is a different story!

I had found a site that had done this -- static links to each
update/knowledge base article. All pointing to PUBLIC urls on
MS's servers. I figured this would be an excellent asset to
use to pull down ALL the updates to, e.g., XP before it goes
dark!

Unfortunately, appears the feds didn' like the site! Nothing
that *I* can see wrong with the page I was looking at (all
LINKS and all to MS, not some pirate site). But, I have no idea
what else may have been on the site; or, hosted by the same *server!

<frown>

I'll now see if I have a local copy of the page and see if I can
trick HTTrack into fetching them from a File: URL!

Don Y · Nov 20, 2013

Yes, the software has to be tweeked per site.

Possibly "per page" -- if there is any inconsistency from page (product)
to page (product)?

Somebody good at
beautifulsoap can crank it out quickly. The code can be a few dozen
lines, but if you don't know the fu, it is a monumental task.

Not a viable option, then. Seems like the easies would be a
tool that you can use in "follow me" mode -- *point* to stuff
and let it worry about the downloads.

Don Y · Nov 20, 2013

Here is an example of a BS program, well actually python using BS. It is
14 line program to find every url in a website.

But, that really isn't much better than just grep(1)-ing the HTML.

When you scrape, you don't want to see the website as it is presented to
the human viewing the browser, you just want the goodies. That is why
the program is somewhat tweaked to the website, once you figure out how
they store the data.

And, you need to be smart enough NOT to pull down the same content
multiple times! E.g., if there are multiple links to a 200MB file,
you only want *one* copy of it.

Apparently, the sites I've been hitting recently synthesize the
URL;s dynamically. Anything wanting to scrape the page would
have to invoke the JS methods for each "button", dropdown, etc.

I think that would be asking a lot from a tool. And, probably
result in *getting* more than you really want from the site!
(do I *really* want every language variant of every OS supported
on a particular product?? :< )

HTTrack does a reasonably good job with "traditional" pages
(though getting the "depth" right is tricky)

An amusing trick -- after the fact -- is to right-click the
"object" in the "Downloads" list (i.e., after or during the
download) and select "Copy URL". This allows you to examine
the URL that was ultimately invoked/transfered.

(Speaking in terms of Firefox, here)

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

wget'ing complete web pages/sites

wget'ing complete web pages/sites

Don Y

miso

Don Kuenz

Don Y

Don Y

miso

George Neuner

miso

Tauno Voipio

Don Y

R.Wieser

George Neuner

Don Y

Don Y

Don Y

Don Y

Similar threads