Experimenting with generative AI to support delivery officers at DfE

27 May 2026

The answer wasn’t a binary; there’s times where an assistant is useful, and times when it isn’t

My most recent project was a 3 month experiment at the Department for Education (DfE) where we were looking at the role generative AI might play in supporting internal delivery officers. A multidisciplinary team of dxw and DfE staff worked together to understand if an AI assistant would help expert team members respond accurately to questions from generalists faster.

The answer wasn’t a binary; there’s times where an assistant is useful, and times when it isn’t.

This has deepened our real-world understanding of when and why an agent can help free up time for tasks they can’t help with. We know where to expand the approach where it’s worked, and can identify similar use cases for further testing.

This test-and-learn approach aligns with the government’s strategy: building and testing an initial prototype quickly, and progressing in stages to wider roll-out where an approach is found to deliver.

Identifying the right use case

dxw has talked before about the 4 steps to a well-defined experiment:

find a real-world problem, rooted in real human need
develop a testable hypothesis
design and build the experiment
evaluate and learn

The Department identified a use case which would make a real impact; a central team is responsible for guidance in delivering policy. The team writes documentation, and spends a substantial amount of time answering questions from regional officers. Members of the team each have deep expertise in a specific subject area, and queries are passed to the best person available to answer them.

Some of these queries are answered in the guidance documentation, but there is a substantial amount of it, and traditional search isn’t up to the job of finding answers. Some answers may depend on nuance or local context which isn’t in the guidance documentation, and some need judgment. There’s a real need there – the less time the central team has to spend on answering questions which are in the guidance, the more they can spend improving those documents and answering the complex questions which arise when policy meets the real world.

In a 2 week Discovery phase we identified a handful of hypotheses we could test, and prioritised them based on estimates of the value of various outcomes and the effort required to test them. We proceeded with the hypotheses which would give the greatest benefit for the least cost.

We framed these experiments using CAST’s AI Experiment Canvas, “a simple canvas that helps us think through any use of GenAI” with space to consider how we’d know if the experiment had given a positive result and ensure we’d considered gotchas like data privacy and methods of engagement.

Early prototypes

On this basis, we developed our first prototypes. We gave ourselves a technical constraint: our assistant was to be an agent, built in Copilot Studio. This meant we could move fast, but were restricted by the features of the software. Specifically, Copilot Studio’s web connector didn’t support URLs that contain query parameters or URLs that have more than 2 levels of depth. We had to add these documents manually. Additionally, its Instructions feature wasn’t easy to use collaboratively or with version control. We’d like to see deeper integration with Github so we can manage our code.

However, it was perfectly adequate for prototyping, and for some sorts of wider use. Importantly, it got us an assistant in Teams very quickly, and Teams is where our users spend much of their time.

Throughout the project, we deepened our knowledge of the world in which the central team operated. Our experimental design was in 2 stages. First we ran moderated sessions where officers commented on the answers given by a prototype assistant, using it for historical queries. Then we improved the assistant, and asked officers to use it on queries as they came into the team, in a fortnight-long unmoderated diary study.

Learning and iterating

By the end of the project, we’d designed, built and tested 2 agents. Both answered queries on the basis of a set of knowledge, documents and intranet pages we’d identified and configured. They followed a set of instructions based on the CRISPE framework to define the agent’s Capacity and Role, the Insight it needed, the Statement that would form the core of its response, and its Personality. Together, that context forms the guardrails that make an agent more likely to be useful.

The last – the agent’s Personality – gave us an interesting piece of learning. We’d been led to deliver traceable guidance, with links to specific sections of relevant documents. The personality that gave us that authoritative advice, was suitable to deliver accurate answers to the central team experts, but wasn’t so useful as the basis for useful answers in a voice the generalists could quickly understand. That was one of our iterations; a later version of our agent used different voices for different parts of an answer.

When is an agent most likely to be useful?

In our final report, we distilled our advice about where an agent is most likely to be useful into 5 questions.

Are incoming questions answered in the documentation?

Agents cannot give answers requiring judgement, context, or discussion with other teams.

Is the documentation extensive?

Agents are good at finding and referencing relevant parts of long documents and guidance spread across multiple documents.

Is the documentation about one subject?

Agents seem to produce less useful answers when their source documents cover many parts of a complex problem domain.

Do questions take some time to answer?

Agents will not save time responding to simple queries where generalists can already recall an accurate response.

Are potential users knowledgeable, and reasonably confident in using AI?

They may need to edit incoming queries to make them easier for an AI to understand, and they will need to recognise inaccurate answers.

The human is still the decision maker

Nowadays, a search for ‘AI “3 pillars”’ returns over 50 million results in Google. Those 3 pillars are variously for business adoption, “Governance. Execution. Adoption“; for agents, “Context, Cognition, and Action.”; or even for AI search optimisation, “SEO, AEO, and GEO”.

Back when GenAI was young in 2025, I knew them from the work of the pioneers at Citizens Advice Stockport, Oldham, Rochdale and Trafford in the context of digital service delivery, and they were:

roll your own, don’t use a general service
train on trusted data, don’t rely on general knowledge
keep a human in the loop, don’t expose a tool to general use

This experiment adds some nuance to each of those pillars:

don’t use a general service, but a well-configured agent with good guardrails can work
don’t rely on general knowledge, but you can augment an agent with documents it can refer to to guide its answers to good effect
don’t expose a tool to general use, and human in the loop is evolving as a principal. Increasingly, we are finding situations where an LLM is just as capable as a human. What’s interesting is that these situations are small parts of larger journeys

To take a specific example, we found an AI Agent was not up to the task of giving a nuanced answer to questions which depended on knowledge of local situations. However, it was very helpful in equipping the expert who had that knowledge with the specific parts of national guidance that would set that local knowledge in context. We equipped our AI assistant with relevant documentation, made it point to the specific sources for its answer, and enabled expert officers to exercise their judgement in full command of the detail.

The human is still the decision maker, this AI is their researcher. The time saving is the length of time a human would need to spend combing the guidance for the specific detail they needed. In other cases, we found an AI couldn’t grasp the national guidance in sufficient detail to be of much help.

What next?

These findings enable us to make evidenced recommendations – being realistic where we need to be, and enthusiastic where we see there are real gains to be made. I’d like to thank DfE for giving us the opportunity to follow this approach. The next step is to roll out what worked, and scope another experiment.

Experimenting with generative AI to support delivery officers at DfE

Identifying the right use case

Early prototypes

Learning and iterating

When is an agent most likely to be useful?

The human is still the decision maker

What next?

London

Leeds

Other links

Experimenting with generative AI to support delivery officers at DfE

Identifying the right use case

Early prototypes

Learning and iterating

When is an agent most likely to be useful?

The human is still the decision maker

What next?

Next

Previous

Keep in touch