4 steps to a well-defined AI experiment

Running experiments with AI means you can test a hypothesis and deliver value quickly, or fail fast if necessary
MIT published a report in August on how 95% of generative AI experiments fail to achieve the rapid revenue growth that was expected.
Experiments often fail because they aren’t focussed on where there’s the greatest opportunity for value, like back-end transformation. And while generative AI tools can increase individual productivity, gains are often limited in the context of organisation-wide deployments if these tools aren’t integrated into actual workflows. It’s telling that pilots which focus on solving one well-defined problem are more likely to succeed.
Of course, a failed experiment is not a failure – provided the intention of that experiment was to learn.
When it comes to implementing AI, it can be easy to fall into any of these typical traps:
- chasing hype (“we need an AI strategy”) instead of solving a pain point
- launching disconnected pilots without discernable business or user value
- treating experiments as vanity projects, not learning vehicles
At dxw, we help organisations across the public sector and beyond solve real problems. We do it by starting small, testing with real users, and learning fast. With, or without AI.
To experiment is to learn
AI experiments don’t need to be perfect. It can be easy to forget that experiments are learning vehicles when we are too focused on how to make the best use of limited resources. The goal of an experiment isn’t to build a finished product. The goal is to learn.
Running an experiment means you can test a hypothesis and deliver value quickly, or fail fast if necessary. If it was small and cheap enough, a “failed” experiment that saves 6 months of wasted effort is a form of success.
To guide an experimentation approach, we can borrow principles from tried-and-tested product development approaches. Our experimentation framework has 4 steps:
- Find a real-world problem, rooted in real human need.
- Develop a testable hypothesis.
- Design and build the experiment.
- Evaluate and learn.
- Find a real-world problem, rooted in real human need
For a problem to stand a chance as an AI experiment, it should be potentially “AI-solvable” and be a well-understood issue that doesn’t require in-depth discovery research. AI is, for now, bad at understanding nuance, handling ambiguity, and knowing when it’s wrong. It is generally better at repetitive, data-heavy tasks, and activities that are mindless toil. Drudgery and friction make excellent starting points.
Any of the following would be good problems:
- our senior engineers are spending ten hours a week answering the same 3 questions from junior members of the team
- our caseworkers spend 2 days a month copying data from 5 spreadsheets to write one report
- our customer service team spends 2 hours a day manually tagging and routing support tickets
Other important factors in selecting a project for an AI experiment include:
- Size: keep it small. If it takes 12 months and a team of consultants, it’s no longer an experiment.
- Feedback: make sure there are short feedback loops to understanding success. An experiment that lasts a few weeks is not useful if you can’t measure its impact for months or years afterwards. A potentially pragmatic setup is to choose a project where there are already some measures in place.
- Safety: choose something safe and non-destructive. Generative AI is fraught with risks and dangers. It can make things up, share private data, delete things, recommend breaking the law, or cause harm through sycophancy. Choose a problem that won’t have catastrophic consequences if it goes wrong.
- Human oversight: Because AI can get things wrong, it’s imperative to retain a human in the loop, or as we prefer: a human in the lead. This means involving the right people from the inception of the experiment – people who know what “good” looks like, who can check, review or validate the outputs from AI for quality and accuracy.
- Availability: use tools you already have access to, e.g. Copilot, Gemini, etc. It’s best not to spend a load of budget and time procuring new solutions when the objective is to learn quickly.
- Permission: establish a short accountability chain so that the experiment is not hampered by bureaucracy and red tape.
- Scope, time and cost. Once you select a specific, well-articulated problem, ringfence a budget and timebox the experiment. Be ruthless about keeping the scope tight to ensure a higher chance of success.
- Define a testable hypothesis
We use a template to fast-track the problem you selected in Step 1 into a testable hypothesis. This template is a secret weapon because it forces us to be specific.
The purpose of the hypothesis is to turn assumptions into something we can test. A good hypothesis is precise and distinct, measurable and outcome-specific. This format and process of hypothesis writing sets out what you believe in building (or something you can do), who it is for, and what you expect, or hope to happen.
There are other more complicated ways of doing this, or constructing hypotheses. Sometimes formulating “How Might We” statements are useful, or creating a theory of change. But at a fundamental level, these are the core elements that we care about.
Here’s an example of a hypothesis, using the customer service support ticket example that we mentioned earlier:
We believe that using GenAI to suggest 3 tags
for our customer service agents
will result in a 50% reduction in tagging time and 90% accuracy.
It’s not a statement that boils the ocean, and it’s not “build an AI-powered service desk.” “Suggest 3 tags” is small, testable and specific. It defines success even before we build anything; we’re testing for both speed and accuracy.
In some cases, hard numbers like “50% reduction in time” or “90% accuracy” may be too rigid, or these can be difficult when there are no baselines or pre-existing measures so we can’t know what 50% truly means. In these cases, we suggest using proxy measures or qualitative indicators, such as asking the team to provide their perceived reduction in time or accuracy, when we may not have a precise measure of time reduction.
Another important measure is satisfaction and desirability. Sometimes, as humans, we enjoy a dull, mindless task, and we should have the agency to decide that’s what we would rather do. It’s all very well if the data shows that something was faster, or more accurate, but do people trust and want it? The best formed ideas fail when put into contact with humans. The job here is not to take away the work that people actually want to do, or find intrinsic value in. We want to solve the pain.
- Design and build an experiment
The team is the unit of delivery for an experiment, and it doesn’t need to be massive. The following skills are what we believe are most valuable:
- someone who can manage the roadmap and the course of the experiment, define and measure success metrics, and engage stakeholders
- someone who can define AI capabilities, and build something if it needs development
- someone who can gather insights from users, monitor the user experience and design any interfaces
- someone who can evaluate and design end-to-end journeys and understand how the experiment may fit into existing workflows
This doesn’t necessarily mean 4 individuals; some of your team may have a blend of skills. In our case, we’ve successfully placed a 3-person team who have all these skills between them.
The most important thing to remember is that an AI experiment does not need to be a production-ready system. We’re building a prototype that we might throw away; essentially we’re building just enough to learn about the context of the problem, and whether AI is a viable solution. This is why using tools already available to us is essential at this stage – use an off-the-shelf API, run a script on a spreadsheet. The front-end does not have to be beautiful (yet.)
- Evaluate and learn
Evaluation is a tricky process. Expect that it will be fuzzy, but you can use the measures as defined from your hypothesis as your starting point.Then, compare the results of your tests to your success criteria.
If it’s not met, take the time to understand why. For example, at the end of the evaluation stage, you may be able to state: “We learned that using GenAI to suggest 3 tags for our customer service agents resulted in a 30% reduction in tagging time and an accuracy of 70%.” This would mean we’re on the right track but may need to either refine the GenAI to improve accuracy, or revisit whether our measures were realistic or too ambitious, and why.
This learning should lead to a decision: a choice to pivot, persevere, kill or scale the idea:
- Pivot: “The idea was wrong, but we found a new, more valuable problem.”
- Persevere: “We’re close. Let’s refine the hypothesis and test again.”
- Kill: “This is a dead end. Good. We only wasted 2 weeks.”
- Scale: “It works. Now we can talk about building it properly.”
This entire process of experimentation is a form of de-risking. The goal is always to learn, to understand what work would be required to get it closer to your goal. Then, decide as a team if it’s worth pursuing.