Dataset Curation and Robots.txt

Intersecting AI #19: Training GenAI requires lots of data. Robots.txt can help control access...if AI companies comply.

and

Jul 09, 2024

Robots.txt

As Wikipedia puts it, Robots.txt is the filename used to implement the Robots Exclusion Protocol, which instructs web crawlers/indexers/scrapers (collectively, “bots”) which portions of the website those bots are allowed to visit.1

When a site owner wishes to instruct bots, they place a text file called robots.txt in the root of the website hierarchy (e.g. https://www.example.com/robots.txt). This text file contains the instructions in a specific format (shown in example below). Robots that follow the instructions try to fetch this file and read the instructions before fetching any other file from the website. If the file does not exist, bots assume that the website owner does not wish to place any limitations on crawling the entire site.

The following image shows the robots.txt. The text after the # in each line is a “comment,” which explains in human-readable terms what the text before the # is doing.

Robots.txt files are particularly important for web crawlers from search engines such as Google. Without the bots, Google can’t index the webpage and serve it as a search result.

However, this relies on voluntary compliance. That is, despite the use of the terms "allow" and "disallow," the protocol is purely advisory and relies on the compliance of the bot; robots.txt cannot enforce any of what is stated in the file. Not all bots comply with the standard. In fact, email harvesters, spambots, malware, and bots that scan for security vulnerabilities may intentionally start with the portions of the website they have been asked (by the Robots Exclusion Protocol) to avoid. Robots.txt and LLMs Following robots.txt is infinitely better than not following it if one believes following the wishes of the website owner is important. However, merely following robots.txt may not be sufficient when it comes to gathering data for LLMs.

The Conundrum

Robots.txt is generally an all-or-nothing proposition: either (a) allow your site to be discoverable on the web by allowing it to be indexed (i.e., cataloging the site), or (b) make your site virtually impossible to find by making it private.

By allowing a site to be discoverable for indexing, the site also inherently allows scraping (i.e., actually downloading the content and keeping a copy). Bots were initially meant to help with search discovery, so both the search engine and the site owner benefited from the arrangement. But downloaded content can also be used for other purposes, such as academic research and training LLMs.

Historically, sites didn’t mind being used for research. But the benefit of generating content that is then used to train a private entity’s GenAI seems to disproportionately favor the GenAI creator, giving the site owner little or no benefit if there is no prominent attribution in GenAI outputs.

AI’s current frothy economic nature means the calculation of following the unenforceable gentleman’s agreement of obeying robots.txt and the mad rush to gather as much data as possible to build the largest, most sophisticated LLMs has tilted in favor of the latter approach. Perhaps no company will gain billions in valuation for following robots.txt, but they might see such benefit from using all the data they can find.

Additionally, there are significant privacy concerns where web crawlers can acquire large amounts of information from websites. When corporations use web crawlers, the content they collect could violate user privacy. In some cases, a lack of secure practices, or even technical mistakes, could lead to data breaches and other personal data leaks. Likewise, information collection through web crawlers can be used (some might say exploited) without the knowledge and consent of the people that the information belongs to.2

For sites that are especially opposed to being scraped for AI training, there isn’t much in the way of recourse. In the past, someone might be able to claim a violation of trespass to chattels if the scraping interfered with their site functioning, but this claim is generally no longer viable because technology has advanced to the point that overloaded servers are rarely a problem. The site owners’ desire of blocking certain scraping hasn't changed, but the tech that enables it has, and some courts are finding that technological advances make behavior that would have been unacceptable in the past fine today.

Robots.txt and Opt-Out

Another way to view the issue is that if an entity ignores robots.txt, offering an option for users to opt-out of having their data used in training later on is a poor remedy. Robots.txt is itself essentially a form of opt-out request. It makes little sense to ignore someone deliberately saying “don’t scrape me” and then put the onus on those same people to again say “don’t use my material” after they were scraped against their wishes.

This seems to be emphasized by the fact that the default state of a website is to not have robots.txt and to allow everything to be scraped, so if the site goes out of its way to tell bots not to scrape something, it's a strong signal of non-consent.

There may be an important caveat, though. Scraping sites where terms of service (ToS) forbid scraping to train AI may be different from robots.txt forbidding scraping in an important way: it’s not easy to check to see if the site has terms, open the terms, and understand if the terms allow or forbid such scraping. But robots.txt is relatively simple and automated (i.e., it’s not burdensome to the scraper–it actually relieves them of the task of scraping). Therefore, if nothing else, one may argue, scrapers should follow robots.txt if they have to follow anything.

Bot Proliferation

Some GenAI developers have created new scraping bots (e.g., OpenAI’s GPTbot). Some older bots have enabled new functionality(e.g., Google’s Googlebot) to give site owners more nuanced control of their sites. For the latter, site owners can tell these bots to not scrape their content for AI training material while still allowing scraping for search indexing. A drawback, however, is that the burden for ensuring the “do not scrape” message for each bot is properly configured is entirely on the website owner.

The proliferation of entity-specific bots (Amazon has Amazonbot; Bing has Bingbot; and Pinterest, LinkedIn, and Twitter each have their own, for example), the site owners will have to remain not just aware but vigilant about configuring settings to accommodate each bot’s specifications to protect their site’s content.

More information can be found here: https://www.rfc-editor.org/rfc/rfc9309

For more information on the constitutional rights that might be violated through web crawling and the implications of the legal and ethical considerations for web crawling and scraping, a thorough publication in the Washington Journal of Law, Technology and Arts is an extremely helpful resource to better understand this issue. Robots Welcome? Ethical and Legal Considerations for Web Crawling and Scraping

The following students from the University of Texas at Austin contributed to the editing and writing of the content of LEAI: Carter E. Moxley, Brian Villamar, Ananya Venkataramaiah, Parth Mehta, Lou Kahn, Vishal Rachpaudi, Chibudom Okereke, Isaac Lerma, Colton Clements, Catalina Mollai, Thaddeus Kvietok, Maria Carmona, Mikayla Francisco, Aaliyah Mcfarlin

Intersecting AI