In February 2024, Reddit reached a $60 million deal with Google, allowing search giants to use data from the platform to train their AI models. Nothing worth noting in the discussion is Reddit users, whose data is being sold.
The deal reflects the reality of the modern Internet: Big tech companies actually own all of our online data and decide how to process that data. Not surprisingly, many platforms monetize their data, and the fastest way to achieve this today is to sell it to AI companies, which themselves are large tech companies using data to train increasingly powerful models.
Vana, a decentralized platform that began collective projects starting from MIT, has a mission to return power to users. The company has created a fully user-owned network that allows individuals to upload data and control how it is used. Artificial intelligence developers can provide users with ideas for new models and if users agree to provide data for training, they will gain proportional ownership in the model.
The idea is to provide everyone with a stake in an AI system that will increasingly affect our society while also unlocking new data pools to advance technology.
“This data is needed to create a better AI system,” said Anna Kazlauskas ’19, co-founder of Vana. “We have created a decentralized system to get better data (within today’s big tech companies) while still allowing users to retain final ownership.”
From Economics to Blockchain
Many high school students have pictures of pop singers or athletes on the bedroom wall. Kazlauskas has a photo of former U.S. Treasury Secretary Janet Yellen.
Kazlauskas is sure she will become an economist, but she ended up being one of five students who joined the MIT Bitcoin Club in 2015, an experience that has brought her into the world of blockchain and cryptocurrencies.
From her dormitory room in her MacGregor house, she began mining Ethereum’s cryptocurrency. She even occasionally searches for campus trash bins for discarded computer chips.
“It makes me interested in everything about computer science and the web,” Kazlauskas said. “From a blockchain perspective, this involves distributed systems and how they transfer economic power to individuals as well as artificial intelligence and econometrics.”
Kazlauskas met Art Abal, who was then at Harvard University, a former Media Lab Class Rement Ventures, and the two decided to adopt a new approach to obtaining data for training AI systems.
“Our question is: How do you contribute to these AI systems using more distributed networks?” Kazlauskas recalled.
Kazlauskas and Abal tried to solve the status quo, with most of the models being trained by scratching public data on the Internet. Large tech companies often also purchase large data sets from other companies.
Over the years, the founder’s attitude has developed and was informed by Kazlauskas’ experience working at the financial blockchain company Celo after graduation. But Kazlauskas attributes her time at MIT to her thinking about these issues, while emerging adventure coach Ramesh Raskar still helps Vana think about today’s AI research questions.
“It’s great to have an open opportunity to build, hack and explore,” Kazlauskas said. “I think the spirit of MIT is really important. It’s just building things, seeing what works and keep it going.”
Today, Vana takes advantage of lesser-known laws that allow users of most large technology platforms to directly export their data. The user can upload this information to an encrypted digital wallet in VANA and train it if it sees appropriate.
AI engineers can come up with ideas for new open source models, and people can pool data to help train that model. In the blockchain world, the database is called Data Daos, which represents a decentralized autonomous organization. Data can also be used to create personalized AI models and agents.
In VANA, data usage is used in a way that preserves user privacy because the system cannot disclose identifiable information. After the model is created, users maintain ownership so that every time they use it, they help train how much based on their data.
“From a developer’s perspective, now you can build these super personalized health applications that accurately consider what you eat, how to sleep, how to exercise,” Kazlauskas said. “These applications are not available today due to the walled gardens of large tech companies.”
Crowdsourcing, user-owned AI
Last year, a machine learning engineer who used VANA user data to train an AI model that can generate Reddit posts. More than 140,000 Vana users contributed their Reddit data, which contains posts, comments, messages, and more. The user decides to use the terminology of the model and retains ownership of the model after the model is created.
Vana enables similar programs by using user-based data from social media platform X; sleep data from sources like oura rings; and more. There are also collaborations that bring together data pools to create a wider range of AI applications.
“Suppose the user has Spotify data, Reddit data and fashion data,” Kazlauskas explained. “Usually, Spotify doesn’t work with these types of companies and actually regulates this. However, if the user can grant access, you can do so, so these cross-platform datasets can be used to create truly powerful models.”
Vana has over 1 million users and over 20 real-time data DAOs. Users have proposed more than 300 databases on Vana systems, and Kazlauskas said many will start production this year.
“I think there is a lot of hope for generalized AI models, personalized medicine and new consumer applications because it’s hard to combine all the data or access it in the first place.”
These databases allow the user community to complete the efforts of even the most powerful tech companies today.
“Now, big tech companies have built these data moats, so no one can use the best data set,” Kazlauskas said. “It’s a collective action problem, where my data on its own isn’t that valuable, but a data pool with tens of thousands or millions of people is really valuable. Vana allows those pools to be built. It’s a win-win: Users get to benefit from the rise of AI because they own the models. Then you don’t end up in scenario where you don’t have a single company controlling an all-powerful AI model. You get better technology, but everyone benefits.”