4 steps to a well-defined AI experiment

Running experiments with AI means you can test a hypothesis and deliver value quickly, or fail fast if necessary

MIT published a report in August on how 95% of generative AI experiments fail to achieve the rapid revenue growth that was expected. 

Experiments often fail because they aren’t focussed on where there’s the greatest opportunity for value, like back-end transformation. And while generative AI tools can increase individual productivity, gains are often limited in the context of organisation-wide deployments if these tools aren’t integrated into actual workflows. It’s telling that pilots which focus on solving one well-defined problem are more likely to succeed.

Of course, a failed experiment is not a failure – provided the intention of that experiment was to learn. 

When it comes to implementing AI, it can be easy to fall into any of these typical traps: 

At dxw, we help organisations across the public sector and beyond solve real problems. We do it by starting small, testing with real users, and learning fast. With, or without AI. 

To experiment is to learn

AI experiments don’t need to be perfect. It can be easy to forget that experiments are learning vehicles when we are too focused on how to make the best use of limited resources. The goal of an experiment isn’t to build a finished product. The goal is to learn. 

Running an experiment means you can test a hypothesis and deliver value quickly, or fail fast if necessary. If it was small and cheap enough, a “failed” experiment that saves 6 months of wasted effort is a form of success.

To guide an experimentation approach, we can borrow principles from tried-and-tested product development approaches. Our experimentation framework has 4 steps:

  1. Find a real-world problem, rooted in real human need.
  2. Develop a testable hypothesis.
  3. Design and build the experiment.
  4. Evaluate and learn.
  1. Find a real-world problem, rooted in real human need

For a problem to stand a chance as an AI experiment, it should be potentially “AI-solvable” and be a well-understood issue that doesn’t require in-depth discovery research. AI is, for now, bad at understanding nuance, handling ambiguity, and knowing when it’s wrong. It is generally better at repetitive, data-heavy tasks, and activities that are mindless toil. Drudgery and friction make excellent starting points. 

Any of the following would be good problems:

Other important factors in selecting a project for an AI experiment include: 

  1. Define a testable hypothesis

We use a template to fast-track the problem you selected in Step 1 into a testable hypothesis. This template is a secret weapon because it forces us to be specific. 

The purpose of the hypothesis is to turn assumptions into something we can test. A good hypothesis is precise and distinct, measurable and outcome-specific. This format and process of hypothesis writing sets out what you believe in building (or something you can do), who it is for, and what you expect, or hope to happen.

There are other more complicated ways of doing this, or constructing hypotheses. Sometimes formulating “How Might We” statements are useful, or creating a theory of change. But at a fundamental level, these are the core elements that we care about.

Here’s an example of a hypothesis, using the customer service support ticket example that we mentioned earlier:

We believe that using GenAI to suggest 3 tags

for our customer service agents

will result in a 50% reduction in tagging time and 90% accuracy.

It’s not a statement that boils the ocean, and it’s not “build an AI-powered service desk.” “Suggest 3 tags” is small, testable and specific. It defines success even before we build anything; we’re testing for both speed and accuracy.

In some cases, hard numbers like “50% reduction in time” or “90% accuracy” may be too rigid, or these can be difficult when there are no baselines or pre-existing measures so we can’t know what 50% truly means. In these cases, we suggest using proxy measures or qualitative indicators, such as asking the team to provide their perceived reduction in time or accuracy, when we may not have a precise measure of time reduction.

Another important measure is satisfaction and desirability. Sometimes, as humans, we enjoy a dull, mindless task, and we should have the agency to decide that’s what we would rather do. It’s all very well if the data shows that something was faster, or more accurate, but do people trust and want it?  The best formed ideas fail when put into contact with humans. The job here is not to take away the work that people actually want to do, or find intrinsic value in. We want to solve the pain.

  1. Design and build an experiment

The team is the unit of delivery for an experiment, and it doesn’t need to be massive. The following skills are what we believe are most valuable:

This doesn’t necessarily mean 4 individuals; some of your team may have a blend of skills. In our case, we’ve successfully placed a 3-person team who have all these skills between them. 

The most important thing to remember is that an AI experiment does not need to be a production-ready system. We’re building a prototype that we might throw away; essentially we’re building just enough to learn about the context of the problem, and whether AI is a viable solution. This is why using tools already available to us is essential at this stage – use an off-the-shelf API, run a script on a spreadsheet. The front-end does not have to be beautiful (yet.)

  1. Evaluate and learn

Evaluation is a tricky process. Expect that it will be fuzzy, but you can use the measures as defined from your hypothesis as your starting point.Then, compare the results of your tests to your success criteria.

If it’s not met, take the time to understand why. For example, at the end of the evaluation stage, you may be able to state: “We learned that using GenAI to suggest 3 tags for our customer service agents resulted in a 30% reduction in tagging time and an accuracy of 70%.” This would mean we’re on the right track but may need to either refine the GenAI to improve accuracy, or revisit whether our measures were realistic or too ambitious, and why. 

This learning should lead to a decision: a choice to pivot, persevere, kill or scale the idea:

This entire process of experimentation is a form of de-risking. The goal is always to learn, to understand what work would be required to get it closer to your goal. Then, decide as a team if it’s worth pursuing.