Wget off the leash
As we all know, to grab a website with wget, we'll use the "-r" option to "recurse" through all the links. There is also the '-H' option, means that wget won't restrict itself to just one host. In other words, with '-r -H' together, it'll try to spider the entire Internet. So I did that to see what would happen.Well, for a 32-bit bit process, what happened is that after more than a month, it ran out of memory. It maintained an ever growing list of URLs that it has to visit, which can easily run in the millions. At a hundred bytes per URL and 2-gigabytes of virtual memory, it'll run out of memory after 20 million URLs -- far short of the billions on the net. That's what you see below, where 'wget' has crashed exhausting memory. Below that I show the command I used to launch the process, starting at cnn.com as the seed with a max timeout of 5 seconds.
How much data did I download from the Internet? According to 'du', the answer is 18-gigabytes, as seen in the following screenshot:
It reached 79425 individual domains, far short of the millions it held Continue reading