AI and the "Why Now" of Data DAOs
Data DAOs represent one path to generating new high-quality data sets and overcoming the data wall in AI
Recent high-profile data licensing deals such as those between OpenAI and News Corp and Reddit underscore the need for high-quality data in AI. Frontier models are already trained on much of the internet—for example, Common Crawl, which indexes about 10% of all web pages, is used for LLM training and contains over 100 trillion tokens.
An avenue for further improvement in AI models is to expand and enhance the data they can train on. We’ve been discussing mechanisms for how data could be aggregated—particularly in a decentralized way. We’re especially interested in exploring how decentralized methods could help generate new datasets and economically reward contributors and creators.
One topic of discussion within crypto in the last few years is the idea of data DAOs, or collectives of individuals who create, organize, and govern data. The topic has been covered by Multicoin and others, but the rapid advancement of AI is a catalyst for a new “why now?” of data DAOs.
We wanted to share our thinking around the topic of data DAOs, in pursuit of the question: how can data DAOs accelerate AI development?
Data in AI Today
Today, AI models are trained on public data, either via partnerships like the News Corp and Reddit deals, or by data scraping on the open internet. For example, Meta’s Llama 3 was trained on 15 trillion tokens from publicly available sources. These approaches have been effective at aggregating large amounts of data quickly—but they have limitations, both in terms of what data they collect and how.
First, the what: AI development is bottlenecked by data quality and quantity. Leopold Aschenbrenner has written about the “data wall” that limits further algorithmic improvements: “Very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.”
One way to push out the data wall is to open up availability of new datasets. For example, model companies are unable to scrape login-gated data without violating most websites’ terms of service, and, by definition, don’t have access to data that has not yet been aggregated. There are also vast quantities of private data that are out of reach for AI training today: think enterprise Google Drives, company Slacks, personal health data, or private messages.
Second, the how: Under the existing paradigm, companies that aggregate data capture the majority of the value. Reddit’s S-1 features data licensing as a major anticipated revenue stream: “We expect our growing data advantage and intellectual property to continue to be a key element in the training of future LLMs.” End users who generate the actual content don’t see any kind of economic benefit from these licensing deals or from the AI models themselves. This misalignment may stifle participation—already there are movements to sue generative AI companies or opt out of training data sets. That’s not to mention the socioeconomic implications of concentrating revenue in the hands of model companies or platforms, without passing a share along to end users.
The Data DAO Effect
A common thread runs through the data problems outlined above: they benefit from scaled contributions from a diverse, representative sample of users. Any individual data point might be negligible in value to a model’s performance, but collectively, a large group of users can aggregate novel data sets that are valuable for AI training. This is where the idea of data DAOs can fit in. With data DAOs, data contributors could see economic upside from contributing data as well as govern how that data is used and monetized.
What are some gaps in the current data landscape that data DAOs could address? Below are some ideas—note that this list is non-exhaustive, and there are certainly other opportunities for data DAOs:
Real-world data
In the decentralized physical infrastructure (DEPIN) world, networks like Hivemapper aim to collect the world’s freshest global map data by incentivizing dashcam owners to contribute their data, as well as incentivizing users to contribute data via their app (for instance about road closures or repairs). One lens through which to view DEPINs is as real-world data DAOs, where the data set is generated from a network of hardware devices and/or users. That data is of commercial interest to various companies, with revenues accruing back to contributors in the form of token rewards.
Personal health data
Biohacking is a social movement in which individuals and communities take a DIY approach to studying biology, oftentimes by experimenting on themselves. For example, individuals may consume different nootropics to boost brain performance, or test different therapeutics or environmental changes to enhance sleep, all the way up to injecting oneself with experimental drugs.
Data DAOs could bring structure and incentives to these biohacking efforts by organizing participants around common experiments and methodically collecting results. Revenue that is earned by these personal health DAOs, for instance from research labs or pharma companies, could be passed back to the participants who contributed results in the form of their own personal health data.
Reinforcement learning with human feedback
Fine-tuning AI models with RLHF (reinforcement learning with human feedback) involves leveraging human input to improve the performance of AI systems. Often, the desired profile of the feedback giver is an expert in their field, who can effectively assess the model’s output. For example, labs may seek math PhDs to improve the math capabilities of their LLMs, etc. Token rewards can play a role in sourcing and incentivizing expert participation through their speculative upside, not to mention the global access afforded by using crypto payment rails. Companies like Sapien, Fraction, and Sahara are working in this space.
Private data
As publicly available data becomes exhausted for AI training, the basis of competition will likely shift towards proprietary datasets, including private user data. Vast amounts of high-quality data remain inaccessible behind login walls and in direct messages, private documents, etc. Such data could not only be effective in training personal AIs, but contain valuable information that isn’t accessible on the public web.
However, accessing and utilizing this data poses significant challenges, both legally and ethically. Data DAOs could offer a solution by enabling willing participants to upload and monetize their data and govern how it is used. For example, the Reddit data DAO allows users to upload their Reddit data exported from the platform itself and containing comments, posts, and voting history to a data treasury that can be sold or rented in a privacy-preserving way to AI companies. Token incentives allow users to earn not just as a one-time transaction for their data, but based on the value created by the AI models trained on their data.
Open Questions & Challenges
While the potential benefits of data DAOs are significant, there are several considerations and challenges.
Distortionary impact of incentives
If there’s one thing to be gleaned from the history of using token incentives in crypto, it’s that extrinsic incentives alter user behavior. That has direct implications for leveraging token incentives for data purposes: incentives could skew the participant base and the type of data being contributed.
The introduction of token incentives also introduces the potential for participants to seek to game the system, submitting low-quality or fabricated data to maximize their earnings. This matters because the revenue opportunity for these data DAOs is dependent on data quality. If contributions are skewed, it undermines the value of the dataset.
Data measurement and rewards
Core to a data DAO is the idea that contributors are rewarded for their submissions via token incentives, which over the long run converges to the revenue earned by the DAO. However, knowing exactly how much to reward various data contributions is challenging, given the subjective nature of data value. In the example above about biohacking, for instance: are some users’ data more valuable than others? If so, what are those determinants? And for mapping data: are some geographies’ mapping information worth more than others, and how would such a difference be quantified? (There is active research around measuring data value in AI by calculating its incremental contribution to model performance, but such methods can be computationally intensive.)
Moreover, establishing robust mechanisms to verify the authenticity and accuracy of data is crucial. Without such measures, the system could be susceptible to fraudulent data submissions (e.g. creation of fake accounts) or Sybil attacks. DEPIN networks attempt to resolve this by integrating at the hardware device level, but other types of data DAOs that would depend on user-driven contributions can be susceptible to manipulation.
Incrementality of new data
Most of the open web has already been utilized for training purposes, and so a data DAO operator must consider whether their data set, collected through a distributed effort, is truly incremental and additive to the existing data available on the open web—and whether researchers could license that data from platforms or procure it through other means. The ideas outlined above emphasize the importance of gathering net-new data that goes beyond what currently exists, which then leads to the next consideration: magnitude of impact and the revenue opportunity.
Sizing the revenue opportunity
Essentially, data DAOs are building a two-sided marketplace, connecting data buyers with data contributors. The success of data DAOs thus hinges on attracting a stable and diverse customer base willing to pay for data.
Data DAOs need to identify and validate their end demand and ensure that the revenue opportunity is large enough, both on an aggregate basis and on a per-contributor basis, to incentivize the quantity and quality of data needed. For example, the idea of creating a user data DAO to pool together personal preference and browsing data for the purposes of advertising has been discussed for years, but ultimately, the revenue that such a network would be able to pass along to users is likely minimal. (As a comparison, Meta’s global ARPU at the end of 2023 was $13.12.) With AI companies planning to spend trillions of dollars on training, the revenue per user for their data could be compelling enough to induce large-scale contribution, posing an interesting “why now” for data DAOs.
Overcoming the Data Wall
Data DAOs represent one potentially promising path to generating new high-quality data sets and overcoming the data wall in AI. Exactly how that comes to fruition remains to be seen, but we’re excited to see this space develop.
If you’re a builder working in this space, please reach out—we’d love to hear from you.
Thanks to Matt Lim, Tom Hamer, Anastasios Angelopoulos, Nish Bhat, and Jason Zhao for their review, and to the Variant team for their conversations which contributed to these ideas!
Li, thanks for putting this together -- had an aha through it about the asymmetry in some contributed data value/reward from the "Equitable Valuation of Data for Machine Learning" paper you linked to. I am with you, this feels like a great investment thesis and hope to partner behind some great founders with you on it.
What about IP data for building a Web3 Disney Studio?