HTTP Options - GNU Wget 1.10 Manual

Next: HTTPS (SSL/TLS) Options, Previous: Directory Options, Up: Invoking

2.7 HTTP Options

-E

--html-extension

If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget can't tell that the local X.html file corresponds to remote URL X (since it doesn't yet know that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k and -K so that the original version of the file will be saved as X.orig (see Recursive Retrieval Options).

--http-user=user

--http-password=password

Specify the username user and password password on an http server. According to the type of the challenge, Wget will encode them using either the basic (insecure) or the digest authentication scheme.

Another way to specify username and password is in the url itself (see URL Format). Either method reveals your password to anyone who bothers to run ps. To prevent the passwords from being seen, store them in .wgetrc or .netrc, and make sure to protect those files from other users with chmod. If the passwords are really important, do not leave them lying in those files either—edit the files and delete them after Wget has started the download.

--no-cache

Disable server-side cache. In this case, Wget will send the remote server an appropriate directive (Pragma: no-cache) to get the file from the remote service, rather than returning the cached version. This is especially useful for retrieving and flushing out-of-date documents on proxy servers.

Caching is allowed by default.

--no-cookies

Disable the use of cookies. Cookies are a mechanism for maintaining server-side state. The server sends the client a cookie using the Set-Cookie header, and the client responds with the same cookie upon further requests. Since cookies allow the server owners to keep track of visitors and for sites to exchange this information, some consider them a breach of privacy. The default is to use cookies; however, storing cookies is not on by default.

--load-cookies file

Load cookies from file before the first HTTP retrieval. file is a textual file in the format originally used by Netscape's cookies.txt file.

You will typically use this option when mirroring sites that require that you be logged in to access some or all of their content. The login process typically works by the web server issuing an http cookie upon receiving and verifying your credentials. The cookie is then resent by the browser when accessing that part of the site, and so proves your identity.

Mirroring such a site requires Wget to send the same cookies your browser sends when communicating with the site. This is achieved by --load-cookies—simply point Wget to the location of the cookies.txt file, and it will send the same cookies your browser would send in the same situation. Different browsers keep textual cookie files in different locations:

Netscape 4.x.: The cookies are in ~/.netscape/cookies.txt.
Mozilla and Netscape 6.x.: Mozilla's cookie file is also named cookies.txt, located somewhere under ~/.mozilla, in the directory of your profile. The full path usually ends up looking somewhat like ~/.mozilla/default/some-weird-string/cookies.txt.
Internet Explorer.: You can produce a cookie file Wget can use by using the File menu, Import and Export, Export Cookies. This has been tested with Internet Explorer 5; it is not guaranteed to work with earlier versions.
Other browsers.: If you are using a different browser to create your cookies, --load-cookies will only work if you can locate or produce a cookie file in the Netscape format that Wget expects.

If you cannot use --load-cookies, there might still be an alternative. If your browser supports a “cookie manager”, you can use it to view the cookies used when accessing the site you're mirroring. Write down the name and value of the cookie, and manually instruct Wget to send those cookies, bypassing the “official” cookie support:

          wget --no-cookies --header "Cookie: name=value"

--save-cookies file

Save cookies to file before exiting. This will not save cookies that have expired or that have no expiry time (so-called “session cookies”), but also see --keep-session-cookies.

--keep-session-cookies

When specified, causes --save-cookies to also save session cookies. Session cookies are normally not saved because they are meant to be kept in memory and forgotten when you exit the browser. Saving them is useful on sites that require you to log in or to visit the home page before you can access some pages. With this option, multiple Wget runs are considered a single browser session as far as the site is concerned.

Since the cookie file format does not normally carry session cookies, Wget marks them with an expiry timestamp of 0. Wget's --load-cookies recognizes those as session cookies, but it might confuse other browsers. Also note that cookies so loaded will be treated as other session cookies, which means that if you want --save-cookies to preserve them again, you must use --keep-session-cookies again.

--ignore-length

Unfortunately, some http servers (cgi programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.

With this option, Wget will ignore the Content-Length header—as if it never existed.

--header=header-line

Send header-line along with the rest of the headers in each http request. The supplied header is sent as-is, which means it must contain name and value separated by colon, and must not contain newlines.

You may define more than one additional header by specifying --header more than once.

          wget --header='Accept-Charset: iso-8859-2' \
               --header='Accept-Language: hr'        \
                 http://fly.srk.fer.hr/

Specification of an empty string as the header value will clear all previous user-defined headers.

As of Wget 1.10, this option can be used to override headers otherwise generated automatically. This example instructs Wget to connect to localhost, but to specify foo.bar in the Host header:

          wget --header="Host: foo.bar" http://localhost/

In versions of Wget prior to 1.10 such use of --header caused sending of duplicate headers.

--proxy-user=user

--proxy-password=password

Specify the username user and password password for authentication on a proxy server. Wget will encode them using the basic authentication scheme.

Security considerations similar to those with --http-password pertain here as well.

--referer=url

Include `Referer: url' header in HTTP request. Useful for retrieving documents with server-side processing that assume they are always being retrieved by interactive web browsers and only come out properly when Referer is set to one of the pages that point to them.

--save-headers

Save the headers sent by the http server to the file, preceding the actual contents, with an empty line as the separator.

-U agent-string

--user-agent=agent-string

Identify as agent-string to the http server.

The http protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the www software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as Wget/version, version being the current version number of Wget.

However, some sites have been known to impose the policy of tailoring the output according to the User-Agent-supplied information. While this is not such a bad idea in theory, it has been abused by servers denying information to clients other than (historically) Netscape or, more frequently, Microsoft Internet Explorer. This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

Specifying empty user agent with --user-agent="" instructs Wget not to send the User-Agent header in http requests.

--post-data=string

--post-file=file

Use POST as the method for all HTTP requests and send the specified data in the request body. --post-data sends string as data, whereas --post-file sends the contents of file. Other than that, they work in exactly the same way.

Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to --post-file must be a regular file; specifying a FIFO or something like /dev/stdin won't work. It's not quite clear how to work around this limitation inherent in HTTP/1.0. Although HTTP/1.1 introduces chunked transfer that doesn't require knowing the request length in advance, a client can't use chunked unless it knows it's talking to an HTTP/1.1 server. And it can't know that until it receives a response, which in turn requires the request to have been completed – a chicken-and-egg problem.

Note: if Wget is redirected after the POST request is completed, it will not send the POST data to the redirected URL. This is because URLs that process POST often respond with a redirection to a regular page, which does not desire or accept POST. It is not completely clear that this behavior is optimal; if it doesn't work out, it might be changed in the future.

This example shows how to log to a server using POST and then proceed to download the desired pages, presumably only accessible to authorized users:

          # Log in to the server.  This can be done only once.
          wget --save-cookies cookies.txt \
               --post-data 'user=foo&password=bar' \
               http://server.com/auth.php
          
          # Now grab the page or pages we care about.
          wget --load-cookies cookies.txt \
               -p http://server.com/interesting/article.php

If the server is using session cookies to track user authentication, the above will not work because --save-cookies will not save them (and neither will browsers) and the cookies.txt file will be empty. In that case use --keep-session-cookies along with --save-cookies to force saving of session cookies.