Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget can't tell that the local X.html file corresponds to remote URL X (since it doesn't yet know that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k and -K so that the original version of the file will be saved as X.orig (see Recursive Retrieval Options).
basic
(insecure) or the
digest
authentication scheme.
Another way to specify username and password is in the url itself
(see URL Format). Either method reveals your password to anyone who
bothers to run ps
. To prevent the passwords from being seen,
store them in .wgetrc or .netrc, and make sure to protect
those files from other users with chmod
. If the passwords are
really important, do not leave them lying in those files either—edit
the files and delete them after Wget has started the download.
Caching is allowed by default.
Set-Cookie
header, and the client responds with the same cookie
upon further requests. Since cookies allow the server owners to keep
track of visitors and for sites to exchange this information, some
consider them a breach of privacy. The default is to use cookies;
however, storing cookies is not on by default.
You will typically use this option when mirroring sites that require that you be logged in to access some or all of their content. The login process typically works by the web server issuing an http cookie upon receiving and verifying your credentials. The cookie is then resent by the browser when accessing that part of the site, and so proves your identity.
Mirroring such a site requires Wget to send the same cookies your browser sends when communicating with the site. This is achieved by --load-cookies—simply point Wget to the location of the cookies.txt file, and it will send the same cookies your browser would send in the same situation. Different browsers keep textual cookie files in different locations:
If you cannot use --load-cookies, there might still be an alternative. If your browser supports a “cookie manager”, you can use it to view the cookies used when accessing the site you're mirroring. Write down the name and value of the cookie, and manually instruct Wget to send those cookies, bypassing the “official” cookie support:
wget --no-cookies --header "Cookie: name=value"
Since the cookie file format does not normally carry session cookies, Wget marks them with an expiry timestamp of 0. Wget's --load-cookies recognizes those as session cookies, but it might confuse other browsers. Also note that cookies so loaded will be treated as other session cookies, which means that if you want --save-cookies to preserve them again, you must use --keep-session-cookies again.
Content-Length
headers, which makes Wget
go wild, as it thinks not all the document was retrieved. You can spot
this syndrome if Wget retries getting the same document again and again,
each time claiming that the (otherwise normal) connection has closed on
the very same byte.
With this option, Wget will ignore the Content-Length
header—as
if it never existed.
You may define more than one additional header by specifying --header more than once.
wget --header='Accept-Charset: iso-8859-2' \ --header='Accept-Language: hr' \ http://fly.srk.fer.hr/
Specification of an empty string as the header value will clear all previous user-defined headers.
As of Wget 1.10, this option can be used to override headers otherwise
generated automatically. This example instructs Wget to connect to
localhost, but to specify foo.bar in the Host
header:
wget --header="Host: foo.bar" http://localhost/
In versions of Wget prior to 1.10 such use of --header caused sending of duplicate headers.
basic
authentication scheme.
Security considerations similar to those with --http-password pertain here as well.
The http protocol allows the clients to identify themselves using a
User-Agent
header field. This enables distinguishing the
www software, usually for statistical purposes or for tracing of
protocol violations. Wget normally identifies as
Wget/version, version being the current version
number of Wget.
However, some sites have been known to impose the policy of tailoring
the output according to the User-Agent
-supplied information.
While this is not such a bad idea in theory, it has been abused by
servers denying information to clients other than (historically)
Netscape or, more frequently, Microsoft Internet Explorer. This
option allows you to change the User-Agent
line issued by Wget.
Use of this option is discouraged, unless you really know what you are
doing.
Specifying empty user agent with --user-agent="" instructs Wget
not to send the User-Agent
header in http requests.
--post-data
sends string as data,
whereas --post-file
sends the contents of file. Other than
that, they work in exactly the same way.
Please be aware that Wget needs to know the size of the POST data in
advance. Therefore the argument to --post-file
must be a regular
file; specifying a FIFO or something like /dev/stdin won't work.
It's not quite clear how to work around this limitation inherent in
HTTP/1.0. Although HTTP/1.1 introduces chunked transfer that
doesn't require knowing the request length in advance, a client can't
use chunked unless it knows it's talking to an HTTP/1.1 server. And it
can't know that until it receives a response, which in turn requires the
request to have been completed – a chicken-and-egg problem.
Note: if Wget is redirected after the POST request is completed, it will not send the POST data to the redirected URL. This is because URLs that process POST often respond with a redirection to a regular page, which does not desire or accept POST. It is not completely clear that this behavior is optimal; if it doesn't work out, it might be changed in the future.
This example shows how to log to a server using POST and then proceed to download the desired pages, presumably only accessible to authorized users:
# Log in to the server. This can be done only once. wget --save-cookies cookies.txt \ --post-data 'user=foo&password=bar' \ http://server.com/auth.php # Now grab the page or pages we care about. wget --load-cookies cookies.txt \ -p http://server.com/interesting/article.php
If the server is using session cookies to track user authentication, the above will not work because --save-cookies will not save them (and neither will browsers) and the cookies.txt file will be empty. In that case use --keep-session-cookies along with --save-cookies to force saving of session cookies.