e l a b o r a e t

go back to resources report an issue

commands to mirror entire websites effectively

To mirror virtually any site neatly and completely, use the below command variations, depending on the situation.

> wget

Full-speed crawl, when the host doesn't care or is powerful enough (most cases):

wget --mirror --convert-links --adjust-extension --page-requisites -r -p -e robots=off -U mozilla [URL WITH HTTP/HTTPS AND WWW]

Randomly-timed crawl; when host has potential to blacklist you

wget --mirror --random-wait --convert-links --adjust-extension --page-requisites -r -p -e robots=off -U mozilla [URL WITH HTTP/HTTPS AND WWW]

Note that the -e robots=off option, to my understanding, may cause the site to be downloaded differently than how it is on the server hosting it, which, in turn, has the potential to render some of its downloaded pages unusable because of link problems.


Explanations and Sources

Basic command structure is from: here.

"-r -p -e robots=off -U mozilla" parameters are from: here.

Both links contain explanations.


why mirror websites?

You might be confused as to why one would want to mirror a website, let alone an entire website with all of its files. We've got WIFI, right? Can't we just go to the website when we need it?

While websites can contain amazingly useful information, they are inherently volatile. They're run by servers that need to stay turned on 24/7, creating the illusion of it being available on-demand. This, of course, means that they might be here one day and gone forever the next day, hence they cannot ever completely be relied on to be available at every given moment. The reason you would want to mirror a website, then, especially if it contains important or obscure information, is to preserve its information and put the availability of that information into your own control. This inevitably applies to all of the media on the Internet.


notable alternatives

viable alternative: httrack

Information on HTTrack can be found by clicking its link above.

The command below for HTTrack (via terminal) proved to be just as good if not better and more stable than wget for mirroring complicated websites, because wget can at times fail to convert some links inside html files.The issue, however, is that it's much slower.

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -A100000000 -#L500000000 '[URL]'

Explanation of the command and its origin can be found here.