Building a resilient system: operational ownership and observability at Gearset

Gearset writes...As part of our focus to deliver the leading Salesforce DevOps platform, we’re always looking for ways to improve our internal processes and continue to build a resilient system that serves our users. One of the ways we’re doing this is through our new observability team.

Our internal DevOps team does a great job supporting our platform, and has been working on these kinds of problems for some time. The new observability team was set up to increase our capability in this area. Folks from different areas of engineering joined the team, and they continue to rotate in and out of the team to gain skills, and contribute to this important service. Read on to learn more about what observability means at Gearset, and the impact it has.

What do we mean by observability at Gearset?

For us, observability centres around the ability to understand the inner workings of our system solely based on observing and interrogating it with external tools. That means we want to be able to understand the state of our application at any time, even if it’s in a state that we hadn’t previously predicted. Ideally, that would mean at any given time we don’t need to ship new custom code to diagnose an issue that we’re already seeing in the application.

What’s the goal of the observability team at Gearset?

The goal of the team is to increase our capability to operate our systems at scale, while handling internal code updates and external changes to the environment our code operates in (based on new customers or customers changing their usage patterns).

We roughly view the four main challenges to keeping a system stable under the force of internal and external changes as:

Detecting there’s a problem with the system
Triaging the problem and assigning an owner
Figuring out the root cause of the issue
Fixing the problem

Asking why the problem happened, and how we can make sure it doesn’t happen again, is an explicit part of our write-up of each incident. For us, it’s part of “fixing the problem”, and it’s the way we do engineering at Gearset, but we realise that’s maybe not the way every company does it!

Our success at all four challenges can be improved by enhancing our observability. And so, a large part of the work the team has started doing is focussing on and improving our observability, therefore, aiming to improve our effectiveness when coping with those four problems.

We also have a remit to tackle all of the sociotechnical issues that arise when we’re tackling these problems. After all, there’s no point building a great observability tools if nobody knows how to use it! So, we engage with other teams to make sure we’re building the right things, help them get up to speed with it, share knowledge, and implement any feedback from using the platform.

How does the team work?

The team is made up of a mixture of temporary and permanent roles. This allows us to look long term and think about what the team might be doing in a year or two, and adapt as the business continues to grow.

Engineers from other teams do a three to six month secondment in the observability team, which is a great way to encourage learning and development in new areas for the rest of the engineering team.

Developers who join this team bring new ideas and experiences from their current roles, and take the new skills they’ve learned back to their teams to make even more of a difference. Working in this collaborative transparent way is one of the things that makes Gearset’s engineering culture unique, and encourages professional development in the engineering team.

Any engineer can ask to join the team – it’s an explicit goal to help spread some of the knowledge from the observability team, so you definitely don’t need to be an expert already. However, we do have to make sure there’s a balance on the team at any one time, and across the rest of the engineering teams.

What projects have the observability team worked on?

One of the big projects we’ve worked on is to start using OpenTelemetry tracing to improve our observability. We’ve started using a tool called Honeycomb to complement our existing observability stack, and this has already led to lots of small improvements in our operational capacity. For example:

We’re quickly able to identify that a specific code change had led to a massive increase in latency on one of our key endpoints.
We’ve been able to apply a number of small optimisations that allows the app to fundamentally operate faster.
We now have more confidence around our use of queues, and where the slow points are.

The team has also been able to look over some of our database usage, and put on better observability tooling and techniques to improve how we look at, and optimise our database usage.

All these projects help us achieve our overall goal, to help the whole team have more confidence in their understanding of the system. Over time, this will enable us to not only make small fixes like the ones we’re working on now, but recover faster from larger failures, and move faster with new functionality because it’ll be easier to verify at each stage of rollout that things are behaving as expected.

What’s it like working on the observability team?

[Julian Wreford]: “I’ve really enjoyed working on the operational team, it’s been a bit of a change from my prior experience doing just software development, but I’ve learnt so much by having the opportunity to really focus on the operational aspects behind running our application. It’s a cool mixture of ad-hoc response to immediate issues, and longer term planning around how we improve our general approach to running the application, and I’ve really enjoyed that mixture.”

[Oli Lane]: “Founding this team has been a real learning experience for us all – which is great. Having a team focused just on operational and observability concerns has given us the space to learn much more about how best to instrument, operate, and ultimately improve the services we run, and that’s been really fulfilling. From tackling issues, to paying down technical debt, to empowering other teams to get more insights into how the code they write runs in production, it’s easy to see the impact of the work we’re doing. And I often don’t know what I’ll be working on when I get in each morning, which definitely keeps things fresh!”

Want to join the observability team?

Then why not take a look at our engineering roles, and you could be part of making a real difference at Gearset and for our users.