We explore how AI's data workers see their role evolving, and their ideas for what value they will add in the future.
The past year was filled with impressive examples of what AI can achieve, for the amusement of some, and the discouragement of others. You could be playing around with image creation or bringing your latest sketches to life, or even getting support as you write your blog post to make sure it’s catchy enough. People of all ages are experiencing and interacting with AI technology more seamlessly than ever. We celebrate the most innovative companies around the world and build on their technological advances. Everyone is fighting to post the fastest and brightest LLM, and the race is tight. Articles comparing Claude, Gemini and GPT keep coming, and the availability of models on open source platforms like Hugging Face makes it easy to start exploring what each has to offer for your business case.
In between all the press releases, it’s easy to lose sight of who made this possible. The crew behind the success of good models is large and diverse. Just like any blockbuster movie in Hollywood, there’s a huge amount of people who contributed. But the data workers that spent countless hours crafting datasets on which models are trained and evaluated very rarely get any credit.
It’s much more likely that the role humans play in dataset and data enrichment will evolve into something different, as Jason Corso, co-founder of the popular Computer Vision tool Voxel51, points out in his recent article. At Sama, we’ve been working hand in hand with our associate teams for more than 10 years. We see the talent, dedication and expertise of our workers. We went and chatted with ML team members and data enrichment workers about what they see in the future value they could bring. For them, we need to ‘’Put more resources into new/emerging fields annotators and their teams can move into, this way we place ourselves as solution providers and SMEs when other firms catch up and start moving to new job fields’’.
They anticipate that their roles will evolve and want to be ahead of the curve to be ready for future expertise they can bring in. Some also shared how we should start to “think of alternative ways to use human skill sets and start teaching/prepping our annotation teams in advance.” With this in mind, here is our exploration of future roles they could play:
Humans can add great value to red teaming exercises. Since red teaming has to be iterative and help map a world of vulnerabilities for your AI system, doing this jointly with humans can be more effective. By selecting a diverse team of red teamers, you can also ensure different approaches are explored to expose vulnerabilities. Red teamers can help select what areas are worth digging deeper in, and help orchestrate the testing.
Validation tasks can look very similar to data labeling; in the same way if you ask workers to simply look at your assets and either correct or confirm a label. But you can add much more value if you tap into human knowledge and data expertise.
With a two-step validation, you could get both high level insights and detailed and complete re-annotation of a diverse set of mistakes. You could also ask data workers to provide more context on the subjective mistakes by correcting those mistakes multiple times according to different interpretations. For example, if a shape is blurry and could both be a smaller bounding box, but also a much longer one, they could provide both corrections and explain the 2 equally valid perspectives.
Spending time validating models by correcting annotations means that data workers gain a good understanding of where your model is failing and where it’s doing well. They can actively participate in identifying gaps in data they’ve seen, and help create synthetic assets that can fill that gap. This is an exciting area for up-skilling: working more closely to the model outputs has the power to transform their role into a more creative one and give them an understanding of how AI evolves and where human intervention can have a positive impact.
Not all data is created equal, and with more and more data generated by AI, it will become harder to know what data holds the most value for your system, versus what is simply noise or repetition. Data enrichment workers can identify trends at a higher level in the data. One tangible example of this could be data curation, where workers leverage filters to identify trends when combining different dimensions of your data and spot-check your model performance on these assets.
More data generated by AI means it is getting harder to know if the content is authentic. We hear legal entities talking about the watermarking of AI-generated content. But to identify if an image is real, if an artist exists, if a speech took place, or if a photo was taken at a given time by a particular photographer, becomes much more complex. This type of detective work could be a good fit for data workers, given the proper research tools. This becomes perhaps the most important work with the rise of deep fakes to ward off fraud and protect artists, creators, and other professionals.
As AI continues to progress, we know humans will keep playing a key role, especially on the frontier of what we want to build next. They are and will remain the safeguard for any AI system, and will continue to help make today’s AI safer and more reliable.
We believe that giving data enrichment workers a larger role in the production chain of model development is a great opportunity, and hope that those new roles will lead to more visibility and respect for the work they do. Speaking with our team, they said how important it is to “expose people to the AI capabilities currently so they are caught up and aware of the way things are changing.”
Having workers involved from the initial stages, all the way to impacting the model monitoring and testing can make models more resilient to change and reduce vulnerabilities. With new regulations coming in, data traceability will become essential. This includes being able to identify who worked on your data and when, and following the value chain. We want to make sure we involve data workers in crafting the future roles they see for themselves and becoming advocates for the importance of their work.
This is the first comprehensive report on Sama’s impact since we transitioned to a private entity in 2018. It is also the first report in our 15 year history that shares our expanded vision on how we create a positive impact for our team members, communities, customers, and the planet.