Posted on 09-03-2007 under howto, software, web crawling

Some time back, I was working on a project involving the ARM microcontroller, and I really needed resources on the ARM. During one of my extensive webcrawls, I found this wonderful site. This was the most comprehensive collection of ARM related technical information I had seen at that time. I had to have it.

Problem? How to download the entire site. The site is basically a mirror of one of the CD-ROMS that the Digital Systems Laboratory at the University of New South Wales hands out to their students. Who has the time to descend through all those directories and download each and every file? Even if I used Flashgot, I’d still have to click on each and every folder. (Here’s another example of this sort of problem, I needed the API documentation on my hard drive to browse offline.) Short of outsourcing this job to some sweat shop in South East Asia, I needed to find some sort of automated solution.

A few friends suggested some software that could work. But none of them did. They always had some sort of limit. Some would not let you descend more than an arbitrary number of folders. Others would refuse to resume over the previous night’s work. There had to be something I could use.

As a huge Linux/Unix fanboy, I decided to give the commandline a try. Logging on to Mandriva (my Linux distro at that time), I pulled up the command line. A few mistrials and some googling later, I found wget. This gem of a tool can descend into directories upto infinity, and can resume downloads where it left off, easily. It can work in the background, even when you’re logged off. You can also run it occasionally to keep your local downloaded copy synchronized. The man page of this tool shows the power you can wield; if you understand how to use it.

$man wget

Ok. Hold on just a minute. You dont have Linux. How the heck are you supposed to do this? You use cygwin. Its a fully functional Unix command line that runs over Windows. Interested people can go to Lifehacker and read their “Introduction to Cygwin” parts I, II and III, but this is not needed for what I’m about to show you. All you need is a Cygwin installation with wget. What you’ll need to do is download the Cygwin installer from their site, and install it (but remember to select the wget command when you see the package options).

Once you’ve got that done, you can run wget

$wget www.google.com

will get you the first page on www.google.com. Meh. I could have done that with Firefox. Here comes the good part. (Warning. This can flood your harddrive, if you try to download the Internet.)

$wget -r -l inf -nc -c -k -p <insert website address here>

The -r option is for using recursive mode. This will make wget descend into directories. -l inf means that wget should go to a maximum of infinity levels of depth. Which is what we need. -nc and -c together make wget not overwrite existing files, but incomplete files are resumed. -k edits links in the html files downloaded to point to local copies, so that the webpages become usable for local browsing. Finally -p makes wget get all parts of the webpage to make it complete (eg. images etc.)

See, its as simple as typing one line. You could also copy and paste it directly from this page, if you’re too lazy to type. And the result is that you get a complete locally browsable copy. There are lots more options for you to explore. wget is definitely a very powerful tool. Try it out and you’ll see that the command line is very frickin’ powerful.

Now you know.