wget, I hate you

So I recently needed to download all files of a certain type out of a moderately complex HTML directory listing. I figured "Oh this should be easy, just use wget -r and simple as that"

Well, then I discovered the moderately complex HTML directory required a list of about 12 reject patterns to keep from downloading duplicates. Then I hit a brick wall. The files I am downloading are quite numerous and about 5-10Mb each. There are two different set of links in the listing that point to the same file, making it extra fun. One is index.php?path=/blah/whatever&download=file.dat and the other is just /blah/whatever/file.dat they both point to the same thing. So then I realized that wget sucks.

Basically the command line goes something like this:

--2011-03-18 23:41:24--  http://www.host.com/foo/index.php?path=/foo/whatever&download=file.dat
Reusing existing connection to www.host.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 4901125 (4.7M) [application/x-download]
Saving to: `www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat'

100%[======================================>] 4,901,125    183K/s   in 27s

2011-03-18 23:41:51 (180 KB/s) - `www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat' saved [4901125/4901125]

Removing www.host.com/foo/bar/index.php?path=foo%2Fwhatever%2F&download=file.dat since it should be rejected.

Did you just say what I said? Why the hell did it download and save the file and then remove it because it's rejected.

Sigh. I loathe thee, wget.

Tags: wget rant
Posted: 3/19/2011 4:46:45 AM