ConsultationXML: getting reusable data out of horrid PDFs

Over the last few months, we’ve been working with Steph Gray of the Department for Innovation, Universities and Skills on making consultation documents easier to reuse.

DIUS are doing some fantastic things with consultations. Typically, a formal consultation is a pretty tedious process: a department will write up a big PDF document, print it, send it to some people, stick it on their website and wait for people to respond. The whole process is pretty dated: it doesn’t really take advantage of the web, and is pretty inaccessible to most people.

DIUS have started to make this process better. In July last year, they launched a consultation that tried a bit harder to involve people. They used a WordPress plugin, CommentPress, to allow people to comment on individual paragraphs in the consultation. They published a nice HTML version of the consultation document, with links and all. They even made a widget generator, so that people could embed questions from the consultation in their blogs.

Doing these things doubled the number of people who responded to the consultation, with very little extra marketing. Unfortunately, they were also pretty time consuming: turning a PDF into nice HTML is pretty labourious. They wanted to automate as much of this process as possible, to make it cheaper to deploy similar consultations in the future, and they asked us to help.

Creating all these consultation tools would be quite easy, if the data existed in a format that could easily be reused. Unfortunately, PDF is certainly not that format. It is is designed for print, and is difficult to repurpose. To make this easier, we wrote some tools to convert PDFs into very basic XML, and to allow people to extend that XML into something useful.

This human intervention is really important. It allows semantic information to be added to these documents: questions and their possible answers can be identified, and explanatory paragraphs can be linked to questions. It also allows formatting and images lost during conversion to be added back into the document, and extra formatting like links to be added.

So, with that in mind, we produced a web-based XML editor for staff in web publishing departments. The idea was to create an editor customised to the XML schema we’re using, so that people who are only just XML-literate can still use it. The editor automatically converts PDF documents to basic XML and then presents it for marking up, tweaking and generally-making-better. The result is awesome XML, usable by other tools to do neat things.

ConsultationXML is about to be deployed within DIUS, where it’ll be used by real people so we can get feedback and make it better. We’re hosting an installation here, so that you can play with it and give us your thoughts. We hope to make it better — it’s not quite finished yet — but it’s finished enough, so we’re getting it out there for people to try. It’ll be open source just as soon as the lawyers have done their thing.

Have a play with the beta ConsultationXML editor here.
Update: Steph has posted his writeup.