AI crawlers are making even the most generous websites miserable
Wikipedia to bots: "Can you please chill for a minute?"
I really hope at least one of you appreciates a Hoobastank reference.
The Wikimedia Foundation — the organization that runs Wikipedia and its associated websites — says it is using 50% more bandwidth for multimedia since January 2024, which it attributes to automated traffic from AI web crawlers.
Background: Web crawlers are bots that visit web pages and browse their content. They’re typically used by search engines to make an index of web pages so they can be retrieved when a user makes a search. But they have also been used by developers to scrape the troves of data needed to train AI models.
Websites have a file called robots.txt that tells crawlers what they can and cannot access. But this is less of a line of code that bots have to follow and more of a polite request — one that some AI companies have been known to ignore.
What’s happening: Wikipedia caches pages that are particularly popular or seeing a spike in traffic due to public interest. The page will be served to users from the data centre closest to them. But web crawlers access everything, including obscure, rarely-visited pages that have to be served from Wikimedia’s core data centre, which uses more resources and is more expensive.
Wikimedia says 63% of its “most expensive” traffic comes from bots.
Wikimedia has been relatively AI-friendly compared to other website publishers, acknowledging that its content makes up a significant part of AI training data. In a way, its mission of making knowledge openly available to all would suggest ChatGPT and the like
But Wikimedia content, from the text in articles to the massive cache of photos people make available, is generally licensed under Creative Commons, meaning material is free to share and redistribute — so long as it is properly attributed, which Wikimedia says hasn’t been happening.
In addition to being a license violation, Wikimedia says this makes it harder to attract new users (be they volunteers who maintain pages and provide multimedia, or would-be donors).
Why it matters: An uptick in AI bot traffic typically raises copyright concerns. But even for people that want the stuff on their website to be used by any person or bot that can access it, AI is causing expensive and annoying problems. Some have compared the uptick in automated bot visits to DDoS attacks, where hackers overwhelm a web service and cause it to crash.
As AI companies try to make models capable of coding, open source developers are seeing their projects and code repositories slow down and crash, as well as increasing their bandwidth costs.
In January, Triplegangers — a company that sells pre-made 3D models for use in art and games — had so many visits from OpenAI’s crawlers that its ecommerce site went down.
Why it’s tricky: Some developers have made tools to block crawlers, but those risk blocking legitimate traffic. Plus, blocking a bot is a great way to flag whoever owns it to come up with some tactic that gets around whatever hurdle has been put in front of them.
Don’t block bots — mess with them: Networking and cybersecurity company CloudFlare has developed a new service for customers that wastes an AI crawler's time. When a bot is detected, CloudFlare creates an AI-generated series of fake web pages that look real enough to make a crawler scrape it, but end up getting it stuck in a “labyrinth” of links. The initial link maze is also hidden so only bots will be able to find it, ensuring no human visitors get trapped.