We explore how AI's data workers see their role evolving, and their ideas for what value they will add in the future.
The past year was filled with impressive examples of what AI can achieve, for the amusement of some, and the discouragement of others. You could be playing around with image creation or bringing your latest sketches to life, or even getting support as you write your blog post to make sure it’s catchy enough. People of all ages are experiencing and interacting with AI technology more seamlessly than ever. We celebrate the most innovative companies around the world and build on their technological advances. Everyone is fighting to post the fastest and brightest LLM, and the race is tight. Articles comparing Claude, Gemini and GPT keep coming, and the availability of models on open source platforms like Hugging Face makes it easy to start exploring what each has to offer for your business case.
In between all the press releases, it’s easy to lose sight of who made this possible. The crew behind the success of good models is large and diverse. Just like any blockbuster movie in Hollywood, there’s a huge amount of people who contributed. But the data workers that spent countless hours crafting datasets on which models are trained and evaluated very rarely get any credit.
This article is not about emphasizing how data workers are not recognized enough for their work. Many articles and books already explored this topic in depth - Ghost Work by Mary L Gray is a must read on this. What we want to explore is how these workers see their role evolve, and their ideas for what value they add in the future.
If we look back at some of the AI progress mentioned at the beginning of this article, we have to acknowledge that this recent progress will change the role of data workers. Like many areas of the workforce, they see their expertise being replicated by models, and they have to still coach those models to improve.
In conversations with our colleagues, some mentioned they worry automated solutions might “lack empathy or special-case subjectivity that is usually a huge part in decision-making.” Or, they worry “that the human touch might not be replicated.” After dedicating years of their life annotating data, they have now taught models so well that the very work they did loses importance. It’s a fear shared by some, saying, “my skills will become limited because the machine is learning and adapting.” This adaptation is particularly visible when team members leverage AI tools like SAM or Grounding Dino to do pre-annotation. The labeling industry was very focused on annotation of data from scratch, a task that will be more automateable than before as models get better and better at this.
So what now?
It’s much more likely that the role humans play in dataset and data enrichment will evolve into something different, as Jason Corso, co-founder of the popular Computer Vision tool Voxel51, points out in his recent article. At Sama, we’ve been working hand in hand with our associate teams for more than 10 years. We see the talent, dedication and expertise of our workers. We went and chatted with ML team members and data enrichment workers about what they see in the future value they could bring. For them, we need to ‘’Put more resources into new/emerging fields annotators and their teams can move into, this way we place ourselves as solution providers and SMEs when other firms catch up and start moving to new job fields’’.
They anticipate that their roles will evolve and want to be ahead of the curve to be ready for future expertise they can bring in. Some also shared how we should start to “think of alternative ways to use human skill sets and start teaching/prepping our annotation teams in advance.” With this in mind, here is our exploration of future roles they could play:
Humans can add great value to red teaming exercises. Since red teaming has to be iterative and help map a world of vulnerabilities for your AI system, doing this jointly with humans can be more effective. By selecting a diverse team of red teamers, you can also ensure different approaches are explored to expose vulnerabilities. Red teamers can help select what areas are worth digging deeper in, and help orchestrate the testing.
Validation tasks can look very similar to data labeling; in the same way if you ask workers to simply look at your assets and either correct or confirm a label. But you can add much more value if you tap into human knowledge and data expertise.
With a two-step validation, you could get both high level insights and detailed and complete re-annotation of a diverse set of mistakes. You could also ask data workers to provide more context on the subjective mistakes by correcting those mistakes multiple times according to different interpretations. For example, if a shape is blurry and could both be a smaller bounding box, but also a much longer one, they could provide both corrections and explain the 2 equally valid perspectives.
Spending time validating models by correcting annotations means that data workers gain a good understanding of where your model is failing and where it’s doing well. They can actively participate in identifying gaps in data they’ve seen, and help create synthetic assets that can fill that gap. This is an exciting area for up-skilling: working more closely to the model outputs has the power to transform their role into a more creative one and give them an understanding of how AI evolves and where human intervention can have a positive impact.
Not all data is created equal, and with more and more data generated by AI, it will become harder to know what data holds the most value for your system, versus what is simply noise or repetition. Data enrichment workers can identify trends at a higher level in the data. One tangible example of this could be data curation, where workers leverage filters to identify trends when combining different dimensions of your data and spot-check your model performance on these assets.
More data generated by AI means it is getting harder to know if the content is authentic. We hear legal entities talking about the watermarking of AI-generated content. But to identify if an image is real, if an artist exists, if a speech took place, or if a photo was taken at a given time by a particular photographer, becomes much more complex. This type of detective work could be a good fit for data workers, given the proper research tools. This becomes perhaps the most important work with the rise of deep fakes to ward off fraud and protect artists, creators, and other professionals.
As AI continues to progress, we know humans will keep playing a key role, especially on the frontier of what we want to build next. They are and will remain the safeguard for any AI system, and will continue to help make today’s AI safer and more reliable.
We believe that giving data enrichment workers a larger role in the production chain of model development is a great opportunity, and hope that those new roles will lead to more visibility and respect for the work they do. Speaking with our team, they said how important it is to “expose people to the AI capabilities currently so they are caught up and aware of the way things are changing.”
Having workers involved from the initial stages, all the way to impacting the model monitoring and testing can make models more resilient to change and reduce vulnerabilities. With new regulations coming in, data traceability will become essential. This includes being able to identify who worked on your data and when, and following the value chain. We want to make sure we involve data workers in crafting the future roles they see for themselves and becoming advocates for the importance of their work.
Bryan Gachambi : Product Operations coordinator, supporting and empowering users through their journey to better efficiency and data acumen. Music and coffee is his fuel of choice.
Claudel Rheault : Human-AI interaction lead, exploring data enrichment work and how to improve workers day to day experience. Can make a good risotto, and loves geology.
Jobbic Jivetti, Keli Kelvin & Eunice Njoroge graciously agreed to share their perspectives and ideas with us for this article.
This is the first comprehensive report on Sama’s impact since we transitioned to a private entity in 2018. It is also the first report in our 15 year history that shares our expanded vision on how we create a positive impact for our team members, communities, customers, and the planet.