Types of Files - GNU Wget 1.10 Manual

Next: Directory-Based Limits, Previous: Spanning Hosts, Up: Following Links

4.2 Types of Files

When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading gifs, you will not be overjoyed to get loads of PostScript documents, and vice versa.

Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in .wgetrc.

-A acclist

--accept acclist

accept = acclist

The argument to --accept option is a list of file suffixes or patterns that Wget will download during recursive retrieval. A suffix is the ending part of a file, and consists of “normal” letters, e.g. gif or .jpg. A matching pattern contains shell-like wildcards, e.g. books* or zelazny*196[0-9]*.

So, specifying wget -A gif,jpg will make Wget download only the files ending with gif or jpg, i.e. gifs and jpegs. On the other hand, wget -A "zelazny*196[0-9]*" will download only files beginning with zelazny and containing numbers from 1960 to 1969 anywhere within. Look up the manual of your shell for a description of how pattern matching works.

Of course, any number of suffixes and patterns can be combined into a comma-separated list, and given as an argument to -A.

-R rejlist

--reject rejlist

reject = rejlist

The --reject option works the same way as --accept, only its logic is the reverse; Wget will download all files except the ones matching the suffixes (or patterns) in the list.

So, if you want to download a whole page except for the cumbersome mpegs and .au files, you can use wget -R mpg,mpeg,au. Analogously, to download all files except the ones beginning with bjork, use wget -R "bjork*". The quotes are to prevent expansion by the shell.

The -A and -R options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. wget -A "*zelazny*" -R .ps will download all the files having zelazny as a part of their name, but not the PostScript files.

Note that these two options do not affect the downloading of html files; Wget must load all the htmls to know where to go at all—recursive retrieval would make no sense otherwise.