“Data laundering” or “AI laundering” is typically where an entity (like a university) gets an exemption (either under a law or in an agreement with the data provider) or otherwise collects data for a nonprofit and/or research use, but then that same entity licenses the data in such a way that others (particularly for-profits) can use it, so the end result is the same as if the for-profit had collected the data itself. In a way, the entity is just doing free labor and assuming some risk for the for-profit.
This approach allows a company to say “Look, we didn’t scrape the data or violate any terms of use” even if everyone understands that the intention of the data provider was not for the data to be used or provided for any other purpose other than what was permitted under the exemption.
The most frequently cited example is about Stability AI:
From waxy.org
Stability AI created the popular text-to-image AI generator Stable Diffusion, but they funded its development by the Machine Vision & Learning research group at the Ludwig Maximilian University of Munich. In their repo for the project, the LMU researchers thank Stability AI for the “generous compute donation” that made it possible. The massive image-text caption datasets used to train Stable Diffusion, Google’s Imagen, and the text-to-image component of Make-A-Video weren’t made by Stability AI either. They all came from LAION, a small nonprofit organization registered in Germany. Stability AI directly funds LAION’s compute resources, as well.
In scenarios were the government wants to say that something is exempt for nonprofits or for research purposes, it doesn't seem to make sense that the government also intended the data laundering alternative where the exempt entity could just license the data to a for-profit entity (or any other non-exempt entity) such that the end result is the same as the for-profit having done the data collection itself.
While there may be strong financial incentives for a nonprofit/research org to partake in such licensing (the for-profits might make a “donation” to the exempted org, for example), some will likely view that as subverting the will of the government and therefore the intention of society by doing so. Perhaps it’s okay legally, but it strikes many as ethically and reputationally risky.
While the Stability AI example above focuses primarily on copyright, the loophole affects other laws as well. The same concern pops up with privacy, where being a nonprofit (under CCPA) or conducting scientific research (GDPR) grants exemptions. It’s not clear how/why it’d be okay for the qualifying entity to license the artifacts created under reliance on those qualifications to other entities that don’t qualify for those exemptions.
Another Version
More recently, Wired ran an article about how the nonprofit EleutherAI scraped YouTube in apparent violation of YouTube’s Terms of Service to make a dataset of subtitles. EleutherAI then made that dataset freely available to everyone and Apple, Nvidia, Anthropic, and many others have used it. Anthropic’s quote in the article is representative of many AI companies: “YouTube’s terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to the Pile authors.”
In other words, Anthropic doesn’t dispute that it could be problematic if Anthropic had scraped from YouTube, but using a dataset created by someone else who scraped YouTube? Totally fair game. For those following along, this is the same issue with the Books3 dataset that includes nearly 200,000 books. EleutherAI helped create it, and entities like Meta used it to train their models.
Common Crawl
Even common sources of AI training material that are widely considered to be the good guys can be suspect under an AI laundering perspective. Common Crawl, for instance, was created for researchers, not for for-profit companies to have easier access to training data.
Benefits of Common Crawl
It handles web crawling, which can be expensive and complicated and many researchers would not otherwise be able to collect the data.
The data allows for an infinite number of research areas: toxicity and bias on the internet, languages used, language translation, cultural and political trends, and more.
The findings of this research helps inform policy decisions and help us better understand what is on the internet and how people are using it. Without large datasets we might risk having sample sizes too small or that underrepresent the internet, making findings less reliable or accurate.
According to Common Crawl’s website, “Common Crawl maintains a free, open repository of web crawl data that can be used by anyone… We make wholesale extraction, transformation and analysis of open web data accessible to researchers.”
Their homepage seems to suggest a conflict between who the data is for (“anyone” or “researchers”). But setting that aside, Common Crawl includes material taken against explicit requests not to take it (like books), without consent, and without paying for the data even though Common Crawl claims to adhere to robots.txt. How? Likely because the books were taken from behind a paywall or other protective technological measure and put on a publicly accessible part of the public internet by someone else, and that’s where Common Crawl collected it.
This is not to say that Common Crawl wants this to occur. But it is what happens, and it’s not clear that doing little or nothing about it other than following an ineffectual robots.txt protocol is the best approach.
An Alternative Approach(?)
But maybe we don’t need to throw the baby out with the bath water. Perhaps the laws and regulations just need to be massaged to incentivize good behavior. After all, not all nonprofits serve the public good. Some do little work while paying their staff bloated salaries, for example, or are used by the wealthy to dodge taxes, or are used to promote a political view while shielding donors and therefore arguable undermining the transparency needed for a high-functioning democracy. 1
In addition, not all for-profit entities are bad. Some help push the boundaries of cancer research, climate research, vaccine research, and make sweeping technological innovations that improve communications and transportation. The case of these entities is especially strong where the research and development arm is separate from the main for-profit parent. Indeed, we may want to encourage these entities because they can attract more funding and talent than the alternatives (nonprofits and universities).
The trick is coming up with a principled way to determine whether a for-profit entity should be allowed to use what otherwise might be considered laundered data. Ideally, the criteria would include a clear and easy way to determine the types of orgs that should be explicitly noted as not AI laundering when they use nonprofit data, the types of orgs that would need to apply to use the nonprofit data (i.e., maybe they should be exempt, maybe not), and the types of orgs that would be partaking in AI laundering if they used the nonprofit data.
For instance, maybe the buckets would fall into something like this:
Counterargument: the anonymity allows people to participate in the democratic process who otherwise may have felt uncomfortable doing so.
The following students from the University of Texas at Austin contributed to the editing and writing of the content of LEAI: Carter E. Moxley, Brian Villamar, Ananya Venkataramaiah, Parth Mehta, Lou Kahn, Vishal Rachpaudi, Chibudom Okereke, Isaac Lerma, Colton Clements, Catalina Mollai, Thaddeus Kvietok, Maria Carmona, Mikayla Francisco, Aaliyah Mcfarlin
Classic Chad move, posting blog post during team meeting, I taught you well