A Summary of Scale Summit

4 April 2018

Recently I attended my first Scale Summit, which is mostly a day of Open Spaces where people discuss how to run things at scale without necessarily having to cover the basics. This is because most attendees will already be working at scale. All of the discussions are under Chatham House rules, so the Summit is more like “Operators Anonymous” than other conferences.

We started with a keynote by Meri Williams about how to scale both people and culture since it’s not all about technology.

My main takeaways were :

Humans need repetition in communication. While code shouldn’t repeat itself, you need to communicate an idea to humans seven times before they remember it.
Written communication is talking to the future. Yourself and others. Use Architecture Design Records to communicate your decisions to the future.
There are several inflection points in your company’s growth.
- When you need more than two pizzas to feed the team
- Dunbar’s Number
- When you have more people than working days in a year.
Hire for cultural add not cultural fit
You have to work to craft an inclusive workplace

We then moved onto the meat of the conference, the open spaces. I proposed sessions on Scalable Build Pipelines and Chaos Engineering. I also attended sessions on Infrastructure Testing, Container Orchestration in 2018 and Terraform in Automation. Participating in sessions can be hard work which is why I skipped one and spent time discussing issues with various people outside the different sessions.

Scalable Build Pipelines

No one really felt that any of the current tools are that scalable and that people pick the fastest and then make it work for them. People were having success using Buildkite and scaling the worker nodes themselves. CircleCI, Travis CI, Codeship, Jenkins and Concourse were also mentioned.

At dxw we are currently experimenting with AWS CodePipeline and AWS CodeBuild to do our CI and CD for us on some of our new work.

Infrastructure Testing

In the infrastructure testing open space we discussed how, why, and what the differences to code testing are, and the difference to monitoring of infrastructure testing. About 30% of the room were doing infrastructure using a variety of tools. Inspec, ServerSpec and goss were what most people were using to test actual running infrastructure. Several people were testing their infrastructure as code with things like test-kitchen.

Monitoring is different to testing, in that it is constantly running whereas testing only happens at certain points e.g. when a Pull Request is made. Your monitoring may make use of your testing to check the state of the system.

Testing of infrastructure does not really lend itself to the fast feedback loops of software testing especially when you are creating and destroying infrastructure to test it. Infrastructure testing lends itself better to behavior-driven development tests. A machine being up is less useful than the system doing what you want.

Chaos Engineering

Very few people in the session were actually doing chaos engineering. I think this is because most people are afraid of the consequences. Causing a production outage by testing that you can survive things going wrong would be an interesting discussion to have with your bosses. However, if you have been doing it all along you shouldn’t have the fear of the consequences. As several people said your supplier is probably being an agent of chaos anyway at least if you do chaos engineering you might be ready for it and panic less.

One way to lessen the fear, is to run it against a non production environment which hopefully has production traffic being replayed against but even then it might not reflect what actually happens.

Terraform in Automation

My last session of the day was probably the most relevant to the work I am doing at dxw. It was a session discussing how to automate your Terraform. We are currently building a new hosting platform which makes use of some of things I picked up here.

To ensure that your Terraform code is consistent, it was recommended that you run `terraform fmt`, `terraform validate` and tflint over your code. `terraform fmt` makes sure that your code is formatted in a consistent style. In fact one of my first pull requests at dxw was to apply `terraform fmt` to terraform in the teacher-vacancy-service repo.

To make use of external modules people suggested the use of a Terrafile and using python-terrafile to install the modules. In practice we found that python-terrafile didn’t do what we want so we actually implemented the rake task in the original blog post.

In conclusion

I enjoyed myself a lot and had interesting discussions all day. I was reminded about several tools that we are now using to help us a build a new scalable hosting platform.