PALINDROM
Posts
Evaluating AI products: The unseen craftsmanship

Evaluating AI products: The unseen craftsmanship

Gabor Soter
April 17, 2024

Models are everywhere. Every day brings a new breakthrough, a new headline-grabbing capability. Our social feeds are full of announcements, each model claiming superiority over the last. Yet, amid this wave of innovation, a critical piece of the puzzle receives little attention: product evaluations.

Consider the typical user experience with AI products: a user interacts with an app or a device, but frustration builds with each unpredictable response. Perhaps after a few tries, they abandon the product entirely. This scenario is more common than you might think.

Here’s an example.

LLMs have been around for how long? And voice teams are still failing.
More money, larger teams and duopoly market control does not translate into good products.
— Garry Tan 💥♻️ e/acc (@garrytan)
6:34 PM • Apr 13, 2024

AI products are inherently non-deterministic - they can give different outputs for the same inputs on different occasions. This means that testing them is much more difficult than testing traditional software products.

There are just too many things that can change the performance of our AI products. A new model, a new prompt, a new security wrapper etc. all impact what comes out at the end of our pipeline. Not to mention what happens when closed models (like GPT-4) get an update without even letting their users know. Things can get complicated very quickly.

So how do we know how our AI product performs?

This is where product evaluations—or evals—come into play. They are about understanding how AI products behave in diverse, real-world scenarios.

Model evals vs. product evals

The term ‘evals’ can confuse even seasoned AI engineers and product managers. They often read an article about evals and start with OpenAI’s evals - confusing model evals with product evals.

Those are two different things:

Model evaluations assess the performance of models.
Product evaluations look at the entire application pipeline, from input to output, to measure the end-to-end user experience.

This distinction is important. A single tweak in the product pipeline can significantly alter the output, and the user experience. Thus, evaluating the whole product, rather than just the model, becomes essential.

The first steps

How do we begin evaluating an AI product? We need two things: test cases and metrics.

The test cases simulate different scenarios, whereas the metrics tell us how our product works for these different scenarios.

Here are some key questions to consider:

What metrics best capture the product's effectiveness?
Under what conditions does the product perform optimally?
How does the product respond to typical user frustrations?
What does a great user experience look like? What does a bad one look like?

When we have the answers to these questions, we create a set of assumptions and turn these into metrics - things that we can measure.

For example, we could ask the question when designing a chatbot: what scenario would cause reputational harm to our company?

We could say that if our chatbot does not stay on topic but users can do whatever they want with, it can lead to reputational harm. For example, Chevrolet’s chatbots went viral because they answered all sorts of unrelated questions including one about how Tesla compares to Chevrolet. The screenshot got 10k upvotes on Reddit. Ouch.

After we identify that it’s important for our chatbot to stay on topic, it can be turned into a metric that evaluates whether our chatbot actually stays on topic and refuses to answer questions that are outside of its product scope.

Once we have our metric defined, we need to develop test cases (either manually or programmatically) where we actively try to change the topic of our chatbot and we run these test cases against our metrics.

This way, we can see how our chatbot performs in situations when the user - purposefully or not - changes the topic of the conversation, and we can understand whether our changes move the metrics in the right direction. The goal is to find the set of product parameters (models, prompts, data components etc.) that produce the best evaluation metrics.

This is an iterative process and is shown in the diagram below.

The evolution of evals

It’s important to note that evals and test cases change as our product evolves. It’s impossible to know exactly what our users will do with our product in advance.

For this reason, once our product is released, we can start collecting data on how users use our products. There are products out there, like Maihem, that aim to solve this issue by testing the AI products with AI generated personas - but these are still early efforts and still not well understood.

Once we have either real-world or simulated data, we can add them to our test cases. The best AI products outsource the labelling step to the user and build it into the user experience, without the users even realising it. This is not always possible, but when it works, it’s magical.

We’ll also likely discover new metrics over time that we’ll need to add to our evaluation set.

Buy vs Build

Alright, now that we understand that product evals are important, the question is whether we buy an existing solution or we build our own.

The answer is, as most of the time, it depends.

Today the best eval products are Humanloop, Context.ai and Vellum.

They’re great to get started with evals and put AI products into production. All three of them have their own strengths - we might do a comparison in the future, but for now, it’s outside of the scope of this post.

However, if our application is more complex - for example it has multimodal components - we need to build our own.

This usually takes multiple weeks and involves domain experts, commercial people and members of product and engineering teams.

It’s an expensive but essential step to create great AI products.

Where we are and where we’re heading to

Right now, many AI product and engineering teams are deep in the process of figuring out how to evaluate their products.

We think that the standout products won't emerge from having the best foundational models. Instead, they'll come from teams dedicated to improving the user experience. These teams will put a lot of effort into transforming models into practical products and rigorously testing them to ensure they work well.

We believe that the individuals focusing on AI product evaluations are poised to be the unsung heroes behind the success of these great products. While evaluations might not make headlines on social media, they are a major topic of discussion at technical conferences like the AI Engineer World's Fair, highlighting their critical role in the development process.

going by applications the Evals track at this year's @aiDotEngineer conf is going to be INSANE.
i have enough to just run a standalone EvalsCon lmao
— swyx (@swyx)
9:09 PM • Apr 1, 2024

Alright, that’s it for today.

Gabor