How to parse structured language and turn it into code

Parslet is “a small Ruby library for constructing parsers”

This is the first of 2 fairly technical blog posts…

We’ve been working with the Crown Commercial Service (CCS) on their Report Management Information project (CCS RMI) which, in brief, will allow suppliers who sell services via government frameworks to report their business to CCS and calculate the management charge they have to pay (if any).

Russell Garner and I have been working on a language (Framework Definition Language, or FDL) which will allow CCS to create new frameworks on RMI.

CCS need to be able to create their own frameworks on the RMI system. The way this is done currently is via dxw developers writing a representation of the framework (its name, management charge rate, required invoice fields etc) in Ruby and committing it to the codebase. The problem with this is that once dxw leaves the project, CCS will struggle to do this as they have no developers of their own.

The solution, initially devised by Russell, is to create a language for CCS to write frameworks in. The idea is that we create a domain-specific language for CCS, which is still fairly technical but easy enough for someone who is not a developer to get their head around. This language can then be consumed by the RMI platform and turned into frameworks.

How to create a language parser with Parslet

When I joined the RMI project, Russell had already presented his spike on how the team could get CCS creating their own frameworks in the RMI system. The solution he went with was using Parslet to parse a structured language file, developed for CCS, and turning it into a Ruby class (the model representation of the framework). Over the next couple of sprints I worked with Russell to turn his spike into something fully-formed that CCS could use.

Before we get into the nitty-gritty of the language design for CCS, I will try to explain how Parslet works and how you can use it to parse almost any structured text into Ruby.

Parslet

Parslet is “a small Ruby library for constructing parsers”, to quote its docs. It’s small and lightweight but very powerful. To get started with Parslet, just install it as you would any gem.

gem install parslet

Let’s start with a “Hello World” example. Say we want a parser which recognises the string “Hello World” but not any other string.

require 'parslet'
class MyParser < Parslet::Parser
  root(:helloworld)
  rule(:helloworld) { str('Hello World') }
end
MyParser.new.parse("Hello World") # => "Hello World"@0

This is simple and a bit pointless. The parser is looking for a string “Hello World” finds “Hello World” and returns “Hello World”.

Here’s what happens if you pass the parser another string:

MyParser.new.parse("Goodbye") # => raises Parslet::ParseFailed

So what exactly is happening here? The parser consists of two things: a named rule (helloworld) and a root instruction. The root instruction tells Parslet which rule it should apply first to the document.

The @0 on the output tells you where in the document (or, the string) the parser matched the rule – in this case, position 0.

Now let’s say we want to match more than the exact string “Hello World”. Maybe we want to match any string containing letters and spaces.

require 'parslet'
class MyParser < Parslet::Parser
  root(:string)
  rule(:string) { match(/[a-zA-Z\s]/).repeat(1) }
end
MyParser.new.parse("Hello World") # => "Hello World"@0
MyParser.new.parse("Goodbye") # => "Goodbye"@0
MyParser.new.parse("12345") # => raises Parslet::ParseFailed

The instruction *repeat(1)* on the match is telling the parser to expect at least one letter/space.

Why did the third example fail? Because the regular expression in the rule ‘string’ only includes letters and whitespace. Parslet follows the Ruby rules for regular expressions – Rubular is a great resource for composing regular expressions in Ruby. This is fun, but a parser which simply consumes a string and spits it out again isn’t very useful.

Our ultimate aim here is to create a language that can be used to describe things, so we need a way to parse a structured string and understand the information contained within it.

In this example, let’s write a parser which will take in a string like “Hello <name>” and return the greeting and the person’s name.

require 'parslet'
class MyParser < Parslet::Parser
  root(:greeting)
  rule(:greeting) { str('Hello ') >> name }
  rule(:name) { match(/[a-zA-Z]/) }
end
MyParser.new.parse('Hello Laura') # => "Hello Laura"@0
MyParser.new.parse('Hello Laura!') # => raises Parslet::ParseFailed

The instruction >> can be interpreted as “followed by”. Also note how I have used a rule (‘name’) in another rule.

The second example fails because the parser is not expecting an exclamation mark. This parser is still a bit pointless because it’s very strict (maybe sometimes I’ll forget that I can’t have an exclamation mark) and it still only returns the input string.

Also, look at the match for “Hello” – it contains a trailing space. This is because the input string will contain a space between “Hello” and the person’s name. The match isn’t particularly nice.

Let’s improve all these things in our next iteration:

require 'parslet'
class MyParser < Parslet::Parser
  root(:sentence)
  rule(:sentence) { (greeting >> space >> name >> exclamation.maybe).as(:result) }
  rule(:greeting) { (str('Hello') | str('Goodbye')).as(:greeting) }
  rule(:name) { match(/[A-Za-z]/).repeat(1).as(:name) }
  rule(:space) { match(/\s/).repeat(1) }
  rule(:exclamation) { str('!') }
end

There’s a lot going on in this version. Let’s break it down from the bottom upwards.

The rule ‘exclamation’ is looking for a literal exclamation point.

The rule ‘space’ matches a space character (as per the regular expression), and the modifier repeat(1) means it is expected at least once.

The rule ‘name’ matches any word consisting of the letters a-z (upcase or downcase) at least once, and the resulting match is annotated as ‘name’.

The rule ‘greeting’ looks for exactly the string “Hello” OR “Goodbye”, and annotates it as ‘greeting’.

The rule ‘sentence’ looks for a greeting, followed by a space, followed by a name, followed by an exclamation mark – maybe. So there may be an exclamation mark, or there may not – either is fine. This sentence is annotated as ‘result’. I’ve named it ‘result’ because it is the top-level output from the parser.

And here is the final output:

{:result=>{:greeting=>"Hello"@0, :name=>"Laura"@6}}

Now we have a hash with the annotations (“result, greeting, name”) as keys and the parsed string as the values.

The question now is what can we do with this? The answer is anything! For RMI, we have used the hash parsed from the FDL files to build classes, so let’s try building a simple class with this result tree.

Imagine we have a class called “Person” with two attributes – the person’s name (a string), and whether they’re not they’re in the building (a boolean).

Person(name: string, here: boolean)

We could use our parser to build a Person from the string “Hello Laura!”

result = MyParser.new.parse("Hello Laura!")[:result]
here = result[:greeting] == 'Hello' ? true : false
person = Person.new(result[:name], here)
if person.present?
  puts person.name + ' is here'
else
  puts person.name + ' is not here'
end
# => Laura is here

This works but line 2 is a bit clunky. It would be ideal if the result from the parser was something we could plug straight into the ‘Person’ object without fiddling with its values. Fortunately, Parslet also has a Transform engine to transmogrify “intermediary” results from the parser.

class Transformer < Parslet::Transform
  rule(name: simple(:n), greeting: simple(:g)) do
    { name: n.to_s, presence: (g == 'Hello' ? true : false) }
  end
end

This is a bit harder to understand, and to be honest I still find the Transform step very complicated. But, this takes the intermediary tree, i.e:

{:result=>{:greeting=>"Hello"@0, :name=>"Laura"@6}}

And the rule looks for a pattern (name: something, greeting: something) and then applies an action block to it (transforming the name to a string, and the greeting to a boolean). The result is:

{:result=>{:name=>"Laura", :presence=>true}}

The match position numbers have disappeared because we’ve now moved one step beyond the original input. We can now use the transformed result to build a ‘Person’ object without fiddling with the parser output:

raw = MyParser.new.parse("Hello Laura!")
transformed = Transformer.new.apply(raw)[:result]
# => {:name=>"Laura", :presence=>true}
person = Person.new(transformed[:name], transformed[:presence])

If you’re interested in seeing this class in full, I’ve put it in a gist

Now you (hopefully) have an idea of what Parslet is and how it can be used to parse structured sentences into – well – almost anything. In my next post, we’ll look at the language we developed for RMI to build framework definitions with.