2009-10-17

Batch Image Download With wget

wget is often used to download single files from the command line, but it can also mirror a website locally or just download part of a website. By specifying the right parameters we can make wget act as batch downloader, retrieving only the files we want.

In this example we assume a website with a sequence of pages, where each page links to the next in the sequence and they all contain a JPEG image. We want to download all the images to the current directory. The following command line does this:

$ wget --recursive --level=inf --no-directories --no-parent --accept *.jpg URL

Or if you prefer, the shorter but more obscure:

$ wget -r -l inf -nd -np -A *.jpg URL

Let’s take a look at the parameters:

--recursive
    Makes wget follow links from the start page.

--level=inf
    This allows for infinite recursion. In combination with another option that limits the
    recursion depth, like –no-parent, we don’t need to know the necessary depth. Otherwise
    you should specify a number to set a limit.

--no-directories
    Default behaviour is to recreate the directory structure of the website. This option
    makes wget put all files in the same directory.

--no-parent
    Do not follow links to pages above the starting page in the hierarchy.

--accept *.jpg
    Here we specify what kind of file to download. The parameter can be a comma-separated list.

For other options and details about those listed here, check the wget man page.


In one scenario I used wget the files had to be zero padded (like img-01.jpg instead of img-1.jpg). For a single directory this did the job (for filenames with two digits):

$ rename 's/-([0-9])\./-0$1\./' *.jpg

For a directory tree this was used:

$ find . "*.jpg" -type f -print0 | xargs --null rename 's/-([0-9])\./-0$1\./'