So I recently needed to download all files of a certain type out of a moderately complex HTML directory listing. I figured "Oh this should be easy, just use
wget -r and simple as that"
Well, then I discovered the moderately complex HTML directory required a list of about 12 reject patterns to keep from downloading duplicates. Then I hit a brick wall. The files I am downloading are quite numerous and about 5-10Mb each. There are two different set of links in the listing that point to the same file, making it extra fun. One is
index.php?path=/blah/whatever&download=file.dat and the other is just
/blah/whatever/file.dat they both point to the same thing. So then I realized that wget sucks.
Basically the command line goes something like this:
--2011-03-18 23:41:24-- http://www.host.com/foo/index.php?path=/foo/whatever&download=file.dat Reusing existing connection to www.host.com:80. HTTP request sent, awaiting response... 200 OK Length: 4901125 (4.7M) [application/x-download] Saving to: `www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat' 100%[======================================>] 4,901,125 183K/s in 27s 2011-03-18 23:41:51 (180 KB/s) - `www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat' saved [4901125/4901125] Removing www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat since it should be rejected.
Did you just say what I said? Why the hell did it download and save the file and then remove it because it's rejected.
Sigh. I loathe thee, wget.