ConsultationXML: getting reusable data out of horrid PDFs

Over the last few months, we’ve been working with Steph Gray of the Department for Innovation, Universities and Skills on making consultation documents easier to reuse.

DIUS are doing some fantastic things with consultations. Typically, a formal consultation is a pretty tedious process: a department will write up a big PDF document, print it, send it to some people, stick it on their website and wait for people to respond. The whole process is pretty dated: it doesn’t really take advantage of the web, and is pretty inaccessible to most people.

DIUS have started to make this process better. In July last year, they launched a consultation that tried a bit harder to involve people. They used a WordPress plugin, CommentPress, to allow people to comment on individual paragraphs in the consultation. They published a nice HTML version of the consultation document, with links and all. They even made a widget generator, so that people could embed questions from the consultation in their blogs.

Doing these things doubled the number of people who responded to the consultation, with very little extra marketing. Unfortunately, they were also pretty time consuming: turning a PDF into nice HTML is pretty labourious. They wanted to automate as much of this process as possible, to make it cheaper to deploy similar consultations in the future, and they asked us to help.

Creating all these consultation tools would be quite easy, if the data existed in a format that could easily be reused. Unfortunately, PDF is certainly not that format. It is is designed for print, and is difficult to repurpose. To make this easier, we wrote some tools to convert PDFs into very basic XML, and to allow people to extend that XML into something useful.

This human intervention is really important. It allows semantic information to be added to these documents: questions and their possible answers can be identified, and explanatory paragraphs can be linked to questions. It also allows formatting and images lost during conversion to be added back into the document, and extra formatting like links to be added.

So, with that in mind, we produced a web-based XML editor for staff in web publishing departments. The idea was to create an editor customised to the XML schema we’re using, so that people who are only just XML-literate can still use it. The editor automatically converts PDF documents to basic XML and then presents it for marking up, tweaking and generally-making-better. The result is awesome XML, usable by other tools to do neat things.

ConsultationXML is about to be deployed within DIUS, where it’ll be used by real people so we can get feedback and make it better. We’re hosting an installation here, so that you can play with it and give us your thoughts. We hope to make it better — it’s not quite finished yet — but it’s finished enough, so we’re getting it out there for people to try. It’ll be open source just as soon as the lawyers have done their thing.

Have a play with the beta ConsultationXML editor here.
Update: Steph has posted his writeup.

24 comments

  1. Comment by Podnosh Blog » Archive » New small tools for better government, horrid pdf’s and the Power of Information Report in Beta posted

    […] Harry Metcalfe on a simple site which should make future online consultation easier. It turns a “horrid” pdf into xml and html – which makes it much easier to use these cumbersome documents for much […]

  2. Comment by Podnosh Blog » Archive » New small tools for better government, horrid pdf’s and the Power of Information Report in Beta posted

    […] Harry Metcalfe on a simple site which should make future online consultation easier. It turns a “horrid” pdf into xml and html – which makes it much easier to use these cumbersome documents for much […]

  3. Comment by Freeing data, reducing pain at Helpful Technology posted

    […] Harry says in his write up: Typically, a formal consultation is a pretty tedious process: a department will write up a big PDF […]

  4. Comment by Freeing data, reducing pain at Helpful Technology posted

    […] Harry says in his write up: Typically, a formal consultation is a pretty tedious process: a department will write up a big PDF […]

  5. Comment by Public Policy Engagement with Commentariat « OUseful.Info, the blog… posted

    […] it seems like the DIUS folks are also trying to open things up at the document level? ConsultationXML: getting reusable data out of horrid PDFs. But I’m too tired to chase this through just now to find out exactly what they’re up […]

  6. Comment by Public Policy Engagement with Commentariat « OUseful.Info, the blog… posted

    […] it seems like the DIUS folks are also trying to open things up at the document level? ConsultationXML: getting reusable data out of horrid PDFs. But I’m too tired to chase this through just now to find out exactly what they’re up […]

  7. Comment by New small tools for better government, horrid pdf’s and the Power of Information Report in Beta « Test Site posted

    […] Harry Metcalfe on a simple site which should make future online consultation easier. It turns a “horrid” pdf into xml and html – which makes it much easier to use these cumbersome documents for much […]

  8. Comment by New small tools for better government, horrid pdf’s and the Power of Information Report in Beta « Test Site posted

    […] Harry Metcalfe on a simple site which should make future online consultation easier. It turns a “horrid” pdf into xml and html – which makes it much easier to use these cumbersome documents for much […]

  9. Comment by Neil Williams posted

    Great work Harry and Steph. I’ve been keeping an eye on this one as you know.

    But, not being that much of a data geek (through lack of application not any avoidance of geekhood!) I think I need help with the next bit: things you can do with the xml that deliver better experiences for the user.

    Is your intention to throw it open to the community and see what people come up with, or are you working on a phase II?

    (Incidentally, please install the subscribe to comments plugin, it would be handy to get alerts when you reply…) Ta

  10. Comment by Neil Williams posted

    Great work Harry and Steph. I’ve been keeping an eye on this one as you know.

    But, not being that much of a data geek (through lack of application not any avoidance of geekhood!) I think I need help with the next bit: things you can do with the xml that deliver better experiences for the user.

    Is your intention to throw it open to the community and see what people come up with, or are you working on a phase II?

    (Incidentally, please install the subscribe to comments plugin, it would be handy to get alerts when you reply…) Ta

  11. dxw staff member Comment by Harry posted

    Hi Neil — will do.

    The answer is that we’re doing both. The XML Schema is public, so anyone who wants to can take the XML and do stuff with it. I’ll be making somthing soon, and I think the Cabinet Office are having a crack as well.

    Certainly we want to be as open as possible, and let people do more or less as they like. Hopefully we can make it open source soon!

  12. Comment by Harry posted

    Hi Neil — will do.

    The answer is that we’re doing both. The XML Schema is public, so anyone who wants to can take the XML and do stuff with it. I’ll be making somthing soon, and I think the Cabinet Office are having a crack as well.

    Certainly we want to be as open as possible, and let people do more or less as they like. Hopefully we can make it open source soon!

  13. Comment by ConsultationXML is now Open Source - Archive - The Dextrous Web posted

    […] pleased to announce that after a bit of wrangling, Steph Gray and I are able to release ConsultationXML as open source software under the GNU Affero license. The recent report on open source software in […]

  14. Comment by ConsultationXML is now Open Source - Archive - The Dextrous Web posted

    […] pleased to announce that after a bit of wrangling, Steph Gray and I are able to release ConsultationXML as open source software under the GNU Affero license. The recent report on open source software in […]

  15. Comment by Barcamp London 6 « Zoombu’s Blog posted

    […] care about and it is easy to find people with shared interests. For instance during a session on “PDF to XML” not only did I learn from the speaker’s expertise, I met another guy who was also interested […]

  16. Comment by Barcamp London 6 « Zoombu’s Blog posted

    […] care about and it is easy to find people with shared interests. For instance during a session on “PDF to XML” not only did I learn from the speaker’s expertise, I met another guy who was also interested […]

  17. Comment by ConsultationXML: the mashups have landed - Archive - The Dextrous Web posted

    […] have already started doing interesting things with ConsultationXML. I have to admit — I couldn’t be more […]

  18. Comment by ConsultationXML: the mashups have landed - Archive - The Dextrous Web posted

    […] have already started doing interesting things with ConsultationXML. I have to admit — I couldn’t be more […]

  19. Comment by Selected links for 20th April through 27th April | Gavin Wray posted

    […] ConsultationXML: getting reusable data out of horrid PDFs – Archive – The Dextrous Web “Making consultation documents easier to use” […]

  20. Comment by Selected links for 20th April through 27th April | Gavin Wray posted

    […] ConsultationXML: getting reusable data out of horrid PDFs – Archive – The Dextrous Web “Making consultation documents easier to use” […]

  21. Comment by Crowdsourcing policy, visualising debate and evolving consultation « Observations posted

    […] Metcalfe and Steph Gray of the Department for Business, Innovation & Skills (DBIS) came up with ConsultationXML, a way for web publishers to convert “horrid” PDF data into meaningful XML. The resulting XML […]

  22. Comment by Crowdsourcing policy, visualising debate and evolving consultation « Observations posted

    […] Metcalfe and Steph Gray of the Department for Business, Innovation & Skills (DBIS) came up with ConsultationXML, a way for web publishers to convert “horrid” PDF data into meaningful XML. The resulting XML […]

  23. Comment by Climbing the mountain at Helpful Technology posted

    […] ConsultationXML: Harry Metcalfe’s development of a practical tool to convert PDFs into semantically-rich data. Phase 2 is underway, looking at what XML can be turned into for practical benefit (think: WordPress plugins) […]

  24. Comment by Climbing the mountain at Helpful Technology posted

    […] ConsultationXML: Harry Metcalfe’s development of a practical tool to convert PDFs into semantically-rich data. Phase 2 is underway, looking at what XML can be turned into for practical benefit (think: WordPress plugins) […]

Comments are closed.