Which Off-the-Shelf LLM Is Best for Image Captioning?

We evaluated Llama 3.2, Gemini 1.5 Pro, and Claude 3.5 Sonnet for multi-modal image captioning. Which of these generative AI models came out on top?

Thank you! Your submission has been received.
We'll get back to you as soon as possible.

In the meantime, we invite you to check out our free resources to help you grow your service business!

Free Resources

Oops! Something went wrong while submitting the form.

Which Off-the-Shelf LLM Is Best for Image Captioning?

Table of Contents

Loading....

Talk to an Expert

Over the past few years, Generative AI models have continued to impress with their ever improving capabilities. Today, the technology has escaped the lab and is mainstream, with many consumers using it daily, and businesses integrating it into their operations or offerings. However, few organizations have the means to build their own Generative AI models given the prohibitive training costs. For most, it is more reasonable to look at applying existing models to their specific use case. This exercise comes with a host of challenges. Innovators must choose the existing model best suited to them, craft adequate prompts to achieve their objectives, set up an evaluation framework to track model performance, and more.

Model evaluation and selection can be particularly tricky. While a given use case may be clear, along with the model’s objective, how does one thoroughly evaluate the model’s ability to meet that objective? At Sama, we help enterprises evaluate and select models at-scale, allowing you to confidently solve business critical challenges. In this post, we demonstrate the process through an easy to understand example use case.

The Problem

We chose to focus on a use case where models show promise but are still flawed, namely image captioning. Because the model output is long-form text, it is also impossible to accurately measure performance programmatically. While using LLMs to evaluate LLMs is a popular technique for benchmarking, this approach has many flaws. In this workflow, we attempt to leverage generative models to write a detailed caption of an image in under 250 words, highlighting the main elements in the image and its overall feel. You can imagine how this could be useful to a social media platform, online retailer, or any other company where describing elements in an image is a costly but necessary exercise.

We used a sample of the validation set of the popular COCO dataset (a sample of images is shown below the prompt), and used the following prompt to obtain captions:

Describe the image in detail, allowing readers to visualize it clearly. Begin with a high-level overview of the scene. Then, focus on individual elements, moving from prominent features to smaller details. Conclude by describing the background. When mentioning people, include their approximate age and ethnicity. For any ambiguous elements, use phrases like "which could be" or "what seems to be" to indicate uncertainty. Provide a comprehensive yet concise description, limiting your response to a maximum of 250 words.

We evaluated three popular third party models: Llama 3.2, Gemini 1.5 Pro, and Claude 3.5 Sonnet. We used the same prompt for each model and had them caption just under 1500 images.

A team of 16 expert annotators then evaluated the models by comparing, for each image, the three model responses. They performed the following three activities:

Feedback. Annotators gave detailed feedback on the responses by highlighting specific negative or positive passages. They were asked to select the type of feedback given and to write a rationale supporting their feedback.
Score. Annotators scored all three responses from 0 to 5. The options are defined as follows:
- 0: response is nonsensical and completely useless.
- 1: response is mostly useless, with >90% of the caption containing mistakes
- 2: between 50-90% of the response contains mistakes
- 3: less than 10% of the response contains mistakes
- 4: response contains no explicit mistakes, but could still be improved
- 5: response is perfect
Preference. Annotators selected the most useful model response out of the three options.

‍

Results

Using Sama’s Reporting offering, let’s take a look at what our teams found. First, let’s examine their high level preference across the models. Llama was preferred least often, while Claude and Gemini were chosen as best almost an equal number of times, with Claude slightly edging out Gemini. Claude also came out marginally ahead with regard to the scores assigned. Here again, the results were quite close and not statistically different from each other, with the average score ranging from 2.79 (Llama) to 2.98 (Claude).

Model	Preferred	Average Score
Claude	38.7%	2.98
Gemini	38.5%	2.86
Llama	22.8%	2.79

‍

These high level results are extremely close, but thanks to the rich output of our annotation process, we can dig much deeper. Let’s next examine the distribution of scores assigned. As seen below, there does seem to be some variance. For instance, Claude clearly has the most high-quality responses (4 or 5) as well as the least number of low-quality responses (0, 1, or 2).

‍

Gemini had by far the highest proportion of responses which were perfect (a score of 5), with over 2.3% of its responses being evaluated as such as opposed to 0.1% for the competitors. However, it was the only model to be evaluated by the team as completely useless (a score of 0), and also had the most number of responses that provided very little value (a score of 1).

Upon inspection, it seems that Gemini's responses which scored a “0” and many of the ones that score “1” were those where the model response was truncated very early and terminated with a “[UNKNOWN_REASON]” tag. This was potentially due to the safety barriers of the model being unnecessarily triggered, as in a large proportion of these responses, the first sentences in the caption describe young girls. For example, we found the following image and caption pair which illustrates how too stringent safety guardrails could render a model useless in certain cases.

Two young girls are seated on gray fenders on the deck of a sailboat, facing away from the viewer towards the deep blue ocean. The older girl, approximately [UNKNOWN_REASON]

One slice of the data that is interesting to examine is the proportion of responses which would require a small number of edits (scores of 3, 4, or 5) to be fully accurate. On this metric, we can see that Claude significantly outperforms Gemini and Llama in its ability to produce these high value responses. Approximately 83% of its responses fall into this category, as opposed to 75% for Gemini and 72% for Llama.

*This table shows the cumulative distribution of scores. The numbers correspond to the percentage of captions which scored at least the score in the column.*

‍

To get to the deepest level of understanding of model performance, we can analyze the precise feedback flagged by annotators on the model responses. On average, our team flagged 2.2 errors per Claude caption, 2.8 per Gemini caption, and 2.7 per Llama caption. The distribution of errors for each model is shown below, and shows that Gemini had more outlier tasks with a high number of pieces of feedback.

‍

The most interesting finding surrounding the feedback was how similar it was across the three models. All three models struggled most heavily in the following areas:

Spatial relationships. The models struggled to correctly describe spatial relationships between objects, or with the viewer.
Text recognition. The models often did not correctly extract text from the images.
Hallucinations on details. The models hallucinated from time to time, mostly on details of elements, such as shape, color, or nuances of actions performed by people.
Counting. All models struggled to provide an accurate count when there were multiple instances of certain elements.
Ambiguity handling. The models often did not qualify their descriptions with markers of ambiguity where appropriate, as requested by the prompt.
Irrelevant information. The models added irrelevant information to captions, for example mentioning that the image did not contain specific things, or offering irrelevant interpretations of the images.

For this particular use case, the top performing models did not differ much in the type of mistakes they made, but more so in the frequency of errors made. The types of issues flagged can be customized according to the use case at hand. It is possible that if we iterated on the types of feedback by adding new classes or splitting up existing classes into more specific ones, the annotation team would have found significant differences in the errors made by the models.

Discussion

As we can see from the results, picking one model which clearly outperforms the others is not trivial and can depend on the business needs. At first glance, it seems that Claude and Gemini are neck and neck for the best performance. However, if I want to ensure that the model avoids worthless responses to diminish reputation risks and loss of user trust, then I clearly should not use Gemini. For many businesses, it may be reasonable to assume that they want to maximize the frequency of usable responses that require minimal edits. In its ability to output captions which score a 3 or above (requiring less than 10% editing), Claude is the clear winner, achieving this threshold over 82% of the time.

With the help of our annotation team, we were able to compare the three models to a granular level to gain a deep understanding of the tradeoffs. The team was trained to thoroughly understand the use case and main concerns that we wanted to evaluate so that they could handle edge cases properly. By using a diverse team, we were also able to source a broader set of opinions than if we used an off-the-shelf solution for evaluation.

Assessing and comparing models is critical as Generative AI starts to drive value within businesses. Robust evaluation is necessary to avoid catastrophic model failures, as well as ensuring that you drive maximum value from your investment into the technology. Comparing models can be useful when deciding which third party foundation model to leverage, assessing the results of a fine-tuning run, or even when making edits to the prompt or context sent to models. Building a robust understanding of model performance and flaws is essential to build user trust and ultimately drive the value provided by these tools within your company. As shown in this example, Sama can help you assess which models are best suited for your use case, and be confident in their performance before deploying them widely.

Author

RESOURCES