==Phrack Inc.== Volume 0x0b, Issue 0x39, Phile #0x0a of 0x12 |=-------------=[ Against the System: Rise of the Robots ]=--------------=| |=-----------------------------------------------------------------------=| |=-=[ (C)Copyright 2001 by Michal Zalewski ]=-=| -- [1] Introduction ------------------------------------------------------- "[...] big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem." -- Sergey Brin, Lawrence Page (see references, [A]) Consider a remote exploit that is able to compromise a remote system without sending any attack code to his victim. Consider an exploit which simply creates local file to compromise thousands of computers, and which does not involve any local resources in the attack. Welcome to the world of zero-effort exploit techniques. Welcome to the world of automation, welcome to the world of anonymous, dramatically difficult to stop attacks resulting from increasing Internet complexity. Zero-effort exploits create their 'wishlist', and leave it somewhere in cyberspace - can be even its home host, in the place where others can find it. Others - Internet workers (see references, [D]) - hundreds of never sleeping, endlessly browsing information crawlers, intelligent agents, search engines... They come to pick this information, and - unknowingly - to attack victims. You can stop one of them, but can't stop them all. You can find out what their orders are, but you can't guess what these orders will be tomorrow, hidden somewhere in the abyss of not yet explored cyberspace. Your private army, close at hand, picking orders you left for them on their way. You exploit them without having to compromise them. They do what they are designed for, and they do their best to accomplish it. Welcome to the new reality, where our A.I. machines can rise against us. Consider a worm. Consider a worm which does nothing. It is carried and injected by others - but not by infecting them. This worm creates a wishlist - wishlist of, for example, 10,000 random addresses. And waits. Intelligent agents pick this list, with their united forces they try to attack all of them. Imagine they are not lucky, with 0.1% success ratio. Ten new hosts infected. On every of them, the worm does extactly the same - and agents come back, to infect 100 hosts. The story goes - or crawls, if you prefer. Agents work virtually invisibly, people get used to their presence everywhere. And crawlers just slowly go ahead, in never-ending loop. They work systematically, they do not choke with excessive data - they crawl, there's no "boom" effect. Week after week after week, they try new hosts, carefully, not overloading network uplinks, not generating suspected traffic, recurrent exploration never ends. Can you notice they carry a worm? Possibly... -- [2] An example --------------------------------------------------------- When this idea came to my mind, I tried to use the simpliest test, just to see if I am right. I targeted, if that's the right word, general-purpose web indexing crawlers. I created very short HTML document and put it somewhere. And waited few weeks. And then they come. Altavista, Lycos and dozens of others. They found new links and picked them enthusiastically, then disappeared for days. bigip1-snat.sv.av.com: GET /indexme.html HTTP/1.0 sjc-fe5-1.sjc.lycos.com: GET /indexme.html HTTP/1.0 [...] They came back later, to see what I gave them to parse. http://somehost/cgi-bin/script.pl?p1=../../../../attack http://somehost/cgi-bin/script.pl?p1=;attack http://somehost/cgi-bin/script.pl?p1=|attack http://somehost/cgi-bin/script.pl?p1=`attack` http://somehost/cgi-bin/script.pl?p1=$(attack) http://somehost:54321/attack?`id` http://somehost/AAAAAAAAAAAAAAAAAAAAA... Our bots followed them exploiting hypotetical vulnerabilities, compromising remote servers: sjc-fe6-1.sjc.lycos.com: GET /cgi-bin/script.pl?p1=;attack HTTP/1.0 212.135.14.10: GET /cgi-bin/script.pl?p1=$(attack) HTTP/1.0 bigip1-snat.sv.av.com: GET /cgi-bin/script.pl?p1=../../../../attack HTTP/1.0 [...] (BigIP is one of famous "I observe you" load balancers from F5Labs) Bots happily connected to non-http ports I prepared for them: GET /attack?`id` HTTP/1.0 Host: somehost Pragma: no-cache Accept: text/* User-Agent: Scooter/1.0 From: scooter@pa.dec.com GET /attack?`id` HTTP/1.0 User-agent: Lycos_Spider_(T-Rex) From: spider@lycos.com Accept: */* Connection: close Host: somehost:54321 GET /attack?`id` HTTP/1.0 Host: somehost:54321 From: crawler@fast.no Accept: */* User-Agent: FAST-WebCrawler/2.2.6 (crawler@fast.no; [...]) Connection: close [...] But not only publicly available crawlbot engines can be targeted. Crawlbots from alexa.com, ecn.purdue.edu, visual.com, poly.edu, inria.fr, powerinter.net, xyleme.com, and even more unidentified crawl engines found this page and enjoyed it. Some robots didn't pick all URLs. For example, some crawlers do not index scripts at all, others won't use non-standard ports. But majority of the most powerful bots will do - and even if not, trivial filtering is not the answer. Many IIS vulnerabilities and so on can be triggered without invoking any scripts. What if this server list was randomly generated, 10,000 IPs or 10,000 .com domains? What is script.pl is replaced with invocations of three, four, five or ten most popular IIS vulnerabilities or buggy Unix scripts? What if one out of 2,000 is actually exploited? What if somehost:54321 points to vulnerable service which can be exploited with partially user-dependent contents of HTTP requests (I consider majority of fool-proof services that do not drop connections after first invalid command vulnerable)? What if... There is an army of robots, different species, different functions, different levels of intelligence. And these robots will do whatever you tell them to do. It is scary. -- [3] Social considerations ---------------------------------------------- Who is guilty if webcrawler compromises your system? The most obvious answer is: the author of original webpage crawler visited. But webpage authors are hard to trace, and web crawler indexing cycle takes weeks. It is hard to determine when specific page was put on the net - they can be delivered in so many ways, processed by other robots earlier; there is no tracking mechanism we can find in SMTP protocol and many others. Moreover, many crawlers don't remember where they "learned" new URLs. Additional problems are caused by indexing flags, like "noindex" without "nofollow" option. In many cases, author's identity and attack origin wouldn't be determined, while compromises would take place. And, finally, what if having particular link followed by bots wasn't what the author meant? Consider "educational" papers, etc - bots won't read the disclaimer and big fat warning "DO NOT TRY THESE LINKS"... By analogy to other cases, e.g. Napster forced to filter their contents (or shutdown their services) because of copyrighted information exchanged by their users, causing losses, it is reasonable to expect that intelligent bot developers would be forced to implement specific filters, or to pay enormous compensations to victims suffering because of bot abuse. On the other hand, it seems almost impossible to successfully filter contents to elliminate malicious code, if you consider the number and wide variety of known vulnerabilities. Not to mention targeted attacks (see references, [B], for more information on proprietary solutions and their insecuritities). So the problem persists. Additional issue is that not all crawler bots are under U.S. jurisdiction, which makes whole problem more complicated (in many countries, U.S. approach is found at least controversial). -- [4] Defense ------------------------------------------------------------ As discussed above, webcrawler itself has very limited defense and avoidance possibilities, due to wide variety of web-based vulnerabilities. One of more reasonable defense ideas is to use secure and up-to-date software, but - obviously - this concept is extremely unpopular for some reasons - www.google.com, with unique documents filter enabled, returns 62,100 matches for "cgi vulnerability" query (see also references, [D]). Another line of defense from bots is using /robots.txt standard robot exclusion mechanism (see references, [C], for specifications). The price you have to pay is partial or complete exclusion of your site from search engines, which, in most cases, is undesired. Also, some robots are broken, and do not respect /robots.txt when following a direct link to new website. -- [5] References --------------------------------------------------------- [A] "The Anatomy of a Large-Scale Hypertextual Web Search Engine" Googlebot concept, Sergey Brin, Lawrence Page, Stanford University URL: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [B] Proprietary web solutions security, Michal Zalewski URL: http://lcamtuf.coredump.cx/milpap.txt [C] "A Standard for Robot Exclusion", Martijn Koster URL: http://info.webcrawler.com/mak/projects/robots/norobots.html [D] "The Web Robots Database" URL: http://www.robotstxt.org/wc/active.html URL: http://www.robotstxt.org/wc/active/html/type.html [E] "Web Security FAQ", Lincoln D. Stein URL: http://www.w3.org/Security/Faq/www-security-faq.html |=[ EOF ]=---------------------------------------------------------------=|