At Sama, we’ve developed tools that streamline this data curation process, ensuring every selected data sample aligns with your goals. Given the need for rapid experimentation and frequent configuration adjustments in our data curation pipeline - typically handled by an ML Applied Scientist - we leverage the Valohai platform to boost efficiency and reduce costs, all without requiring DevOps support.
In supervised learning, particularly when dealing with deep learning models like those used for object detection in computer vision, the quality of your training data is everything. However, with vast amounts of unlabeled data and limited budgets for annotation, the challenge is to make sure every dollar spent contributes to improving the model's performance.
At Sama, we've created data curation tools and processes that integrate seamlessly into the annotation process. These help to select the most relevant data samples, feeding them directly into our annotation workflows. Moreover, our system continually refines its data selection based on feedback from the annotation process, creating a loop of constant improvement.
Let’s take automotive video data as an example. Our pipeline starts by cleaning the data and the videos are then broken down into frames, which are processed using image embedding models and object detectors. These techniques help us identify the most relevant frames for annotation, especially those that align with the client’s specific goals, such as targeting certain object classes or ensuring scene diversity. To further refine the dataset, we apply frame de-duplication to eliminate frames that are too similar—such as consecutive frames where little to no change occurs. This ensures that only frames with high informational value are preserved.
Embeddings, which summarize the key features of an image into a set of numbers, allow our system to quickly identify frames that are most likely to contain the objects of interest. We use models such as OpenAI’s CLIP, which are adept at understanding both images and text, to generate these embeddings. In addition, object detection models help us locate and identify specific objects within each frame, such as cars, pedestrians or street signs.
This layered approach to data processing means that by the time we’re ready for annotation, we’ve curated a list of frames that align with the project’s objectives. Techniques like similarity searches, metadata filtering, and automatic clustering further refine this selection process.
A crucial part of this process is choosing the right models for generating embeddings and detecting objects. While off-the-shelf models can sometimes suffice, they often require fine-tuning to perform optimally. Fine-tuning these models is a delicate balance, requiring machine learning expertise and a willingness to experiment, mostly because they need to operate on low-data regimes. When we don’t have any data to start with, we often begin with an open-vocabulary model like Grounding Dino to gain an initial understanding of the compositions of the frames. This helps us identify key elements within the data early on, guiding our curation efforts. Our goal here isn’t to develop state-of-the-art models but to create models that efficiently support our data curation process, even with limited annotated data.
At Sama, we’ve found that while automation tools can help find the right model under the right configuration, there’s no substitute for the intuition and skill of an experienced ML Applied Scientist. That’s why we rely on Valohai, which not only speeds up our process but also ensures secure, scalable, and efficient management of big, unstructured data. Valohai is a technology-agnostic MLOps platform trusted by Fortune 500 companies and industry disruptors alike. Valohai enables CI/CD for machine learning and end-to-end pipeline automation on-premises, cloud providers, and hybrid environments.
With Valohai, we can quickly adapt to our clients' needs, reducing reliance on DevOps and IT, and keeping our data curation process both effective and flexible. This approach not only improves the quality of the data that gets annotated for our clients but also ensures that every step of the process is aligned with our client’s evolving objectives.
In one of our recent automotive data curation projects, our client provided us with tens of thousands of 2-minute videos at 24 frames per second for a total of 6TB of data, from which we successfully extracted and annotated 70 thousand frames containing dozens of classes of interest, such as pedestrian crossing signs and deer crossing signs (around 0.04% of the original data). We began by estimating the composition of these videos—essentially predicting the distribution of classes within the data—using off-the-shelf object detectors combined with manual curation from our launch team. This initial analysis revealed a highly imbalanced distribution, with certain objects significantly overrepresented compared to other low-frequency classes. To address this, we employed techniques like similarity searches using CLIP embeddings to uncover more instances of these rarer classes.
With this foundation, we fine-tuned our models using the accumulated and annotated data, which we then applied to a large portion of the dataset to extract frames likely to contain the objects of interest. These frames were then sent to our expert teams for detailed, high-quality annotation. While the initial automated curation wasn’t perfect—not every frame selected contained the predicted objects—it greatly streamlined the annotation process. As we continued gathering more annotated samples, we rapidly retrained our models using the Valohai platform, leading to increasingly accurate results.
This iterative process not only made our subsequent data curation more efficient but also allowed us to recalibrate our sampling distribution, focusing more on the low- and medium-frequency classes. By the project’s conclusion, we not only extracted the examples our client needed but also ensured a more balanced representation, particularly boosting the proportion of low- and medium-frequency classes. These classes hold more value than high-frequency ones because they provide richer, more diverse insights crucial for the robustness and adaptability of the final model.
To learn more about Valohai and its reproducibility-first MLOps platform, visit valohai.com