Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.

Follow publication

Creating a Thriving On-Call Engineering Workflow by Embracing Healthy Team Habits

© Nuthawut— stock.adobe.com.

This post is the first part of a series in which we explore how we at Adobe Experience Platform continuously invest in engineering and operational excellence, and what organisational practices we embrace to meet our customers’ expectations amid increasing adoption of digitalised solutions.

An important contributing factor to the culture of engineering & operational excellence at Adobe is the continuous investment in making our systems reliable while enabling our customers to implement their use-cases at a dramatically increasing scale.

To enable this mindset, we have adopted a Unified Engineering framework, where each team is responsible for the systems they develop and maintain — from the very first line of code to healthy production deployment.

Adopting a “you build it, you maintain it” mindset has enabled us to thrive, implementing the best operational practices and creating space for a continuous learning journey. Yet, the overall process doesn’t come easy and tends to bring challenges even for teams with multiple years of experience in this operational model.

In this first post, we will outline a case study of how we tackled the on-call challenges within an engineering team responsible for one of the most crucial real-time services at Adobe. In the end, you will get a simple framework that we’ve used and could be a good start for teams experiencing similar challenges.

We will cover:

  • a general overview of on-call at Adobe
  • our team on-call story: how it started and what we did to ensure a happy ending
  • key learnings from our two-year journey by adopting an Atomic Habits mindset

On-call at Adobe

On-call at Adobe means each team has one dedicated engineer per established time frame, at a regular frequency, whose sole responsibility is to respond to incoming system alerts while continuously overseeing the healthiness of their teams’ systems.

Before adopting the Unified Engineering framework, we used to have dedicated operations teams that would be the first layer of on-call on behalf of other engineering teams, who had a lower priority in the escalation path. But with the rise of the digital era, the complexity of the systems and the level of customers’ expectations dramatically rose, leading to an increased number of multiple 9s services.

As such, we have learned that the best team to troubleshoot and overcome any production issue is no one other than the team that builds that system. By embracing this mindset, we created an organisational culture where each contributor gains an in-depth understanding of the system they develop and what value it provides to its end users, regardless of their affinity to either Operations or Development.

While each team leverages the same framework, we also benefit from a high degree of flexibility to ensure that we do whatever works best for our team. This means that, at Adobe, you might encounter different team implementations related to:

  • Daily on-call schedule — when the team is spread across multiple time zones, you can benefit from employing a “follow the sun” schedule, where the on-call duty is exercised only within the respective time-zone day schedule. Otherwise, a 24/7 on-call schedule might be required for overseeing services deployed worldwide.
  • On-call shift duration — some teams do weekly, less than a week, or even multiple weeks rotation. The choice of duration highly depends on the average frequency of on-call interventions relative to the outside working hours time windows.
  • On-call vs. development duty refers to how the on-call duty is embedded in the regular development rotations the team uses, whether sprints or any other frames. Teams with a high dynamic of on-call interventions usually do not combine the on-call responsibility with the usual development tasks — the on-call engineer is taken out of the sprint throughout the on-call shift duration.
  • Primary w/ or w/o secondary on-call — teams responsible for multiple 9s services always have a designated secondary on-call engineer, who gets paged if the primary on-call engineer cannot respond for natural reasons. Having a second person proves valuable when multiple issues happen at once, and the team needs distributed attention to each.

Living on the Edge, an on-call story

Our story showcases one of our Edge teams, who owns services that collect and process data in real-time, worldwide, with 99.9% availability at a scale of ~60B requests/day.

Our team also has a special organisational context, being the only engineering team that develops and maintains these edge services. As such, we need to have 24/7 on-call shifts to cover the worldwide traffic sources that our services handle.

Our systems provide core functionalities to our hundreds of customers and to all the Adobe Digital Experience solutions that rely on them to implement their core business functionalities.

It is needless to say that, should a production outage occur, all eyes would be on us.

How it started: our team’s on-call vibe meter

Two years ago, on-call was the hot topic that never ceased to appear in day-to-day discussions, 1:1 conversations, and team retrospectives. The internal vibe was in total contrast with the maturity of the systems and the best practices we have developed for CI/CD, immutable infrastructure, canary analysis, and wave rollout, helping other teams to adopt the same mindset toward engineering and operational excellence.

Needless to say, there was a particular context in the team setup and its environment:

  • The beginning of the pandemic brought significant sources of anxiety and completely changed our way of work, all of us switching to 100% remote work and collaboration
  • The dramatic increase of digitalisation forced by the pandemic brought a sudden increase in customer implementations and traffic volume in our systems, the latter almost doubling
  • Our group went through a complete reorganisation process so our team became the sole owner of the Edge services suite and infrastructure
  • Our team was in a forming phase, having half of its members in an onboarding phase, myself included. I was the new team leader.

Zooming into our on-call stats, every seven days of on-call would have accounted for four interrupted nights for the primary on-call engineer. These issues followed the classic on-call procedure pattern: analyse some dashboards, apply the recommended steps and verify that everything is back to normal. The overall process required an average of 15 minutes from open to solved.

While these numbers were small compared to the complexity and scale of our systems, there was one particular metric that ultimately influenced our team’s on-call vibe: the mean time to fall asleep again (MTFAA). This was the trickiest to measure and created the biggest source of anxiety within our team.

Ultimately, it seemed we had all the right reasons to prioritise whatever it took to reduce the on-call pain points. And we had all the right tools, practices, and organisational support to do so. Yet, this did not quite happen naturally in our team. And the ultimate question was Why?.

Start with why

As Simon Sinek states in his book, Start with why, emotions trump reason every time. As long as we have a strong sense of purpose and we know why we are doing what we’re doing, we tend to do our best.

The Golden Circle — Source: https://simonsinek.com/books/start-with-why/.

As far as the WHY is concerned, the extended literature also teaches us there are two fundamental needs behind our actions: security and satisfaction.

Our team’s WHY

Going back to our story, the team was experiencing a certain amount of fear hidden behind the on-call process.

Fear of lacking enough knowledge — at that time, half of the team consisted of new members joining from different areas and business domains. Our services and the business domain were so complex that even tenured members would have found themselves learning something new every day, even though they had years of experience.

Fear of being accountable for unknown territories — most of our codebase had been developed by other team members who were not part of the team anymore. Without having prior hands-on experience and proper deep knowledge, the anxiety created by potential alerts at night, which uncovered a long-time hidden code piece, was hard to dismiss.

Our team’s HOW

Even though we had a weekly rotation for primary engineers, the team ultimately seemed to rely on subject-matter experts to handle specific on-call issues. Whenever the team detected such a problem, the expert would voluntarily take over the problem or be asked for by the on-call engineer. This handover was not transparent enough to leave space for explicitly addressing the knowledge gaps as part of the team planning process, like updating the developer docs or running hands-on knowledge-transfer sessions.

While this continuous collaboration exercise fostered a healthy team culture, it also had a significant disadvantage — it ultimately seeded even more experience and expertise in the team members with a solid knowledge base in that area, and it created even a more significant knowledge gap among their peers.

There was also something special about brainstorming discussions. Whenever we started a session to explore what we could do better to address a hot source of on-call issues, there were always two sides:

  • The subject-matter expert would often highlight the need for re-architecting, refactoring, or even switching to a different set of technologies as the solution for the problems we were facing
  • The rest of the team, which, without having the proper knowledge base, would seldom challenge the status quo and proposals

As engineers, we are passionate about solving problems and sometimes more in love with the solution than the problem. And this is why most of these brainstorming sessions would conclude with ambitious project plans that would span over a minimum of three months, focused on a particular technology or architecture, but not enough on the immediate impact on the on-call process.

Our team’s WHAT

To prioritise an engineering-driven project that spans multiple months and does not have a substantial problem statement is always prone to fail. When hundreds of customers drive the definition of the same backlog, like in our case, this becomes even more challenging.

As a result, we were continuously falling victims to our golden circle:

Once we got through this visualisation exercise together, it became more apparent that we needed to transition to a greater sense of satisfaction and empowerment to be able to address the on-call pain points effectively.

We needed to switch our WHY from fear to inner power, from security to satisfaction.

The road to inner power and an effective on-call process

In retrospect of all the iterations we have gone through, we have managed to do one particular thing pretty well - we’ve built a set of healthy team habits. Habits that everyone internalised and felt empowered to act on, regardless of the context. Habits that made the continuous improvement of on-call duty a must for everyone, irrespective of the priorities and current roadmap.

And when it comes to building habits, my mind immediately goes back to a book I am genuinely fond of, Atomic Habits. In his book, James Clear describes the internals of a habit with a continuous loop, split into four different stages:

  1. Cue — it triggers your brain to initiate a behaviour because it predicts a reward
  2. Craving — the motivational force behind every habit, the desire for the predicted end state
  3. Response — the actual habit you perform
  4. Reward — the end goal of every habit

For me, the book’s highlight has been the simple process of reverse-engineering these stages into a simple framework that everyone can use to transition towards a desired state: The Four Laws of Behaviour Change. Building a habit requires four types of actions:

  • The 1st law (Cue) — Make it obvious
  • The 2nd law (Craving) — Make it attractive
  • The 3rd law (Response) — Make it easy
  • The 4th law (Reward) — Make it satisfying

The 1st habit law: make it obvious

For us, as a team, it was obvious that we were facing recurring issues that constantly impacted our team members. We needed to build a habit of continuously solving them. And according to the Atomic Habits framework, an excellent strategy to create a new habit is to pair it with an existing one. And so we did.

We already had one engineer to be primary on-call for an entire week. That would be our existing habit.

Since our systems already had a high degree of stability, first-class monitoring, and prevention mechanisms, the on-call week did not involve on-call operations as a full-time job. Yet, we still considered the on-call engineer out of the sprint to reduce the amount of context switching and ensure time for rest should a problem arise. Nonetheless, the on-call engineer would voluntarily take over smaller tasks from the sprint whenever there was no on-call activity to handle. And that revealed an opportunity to form a new habit.

We paired the habit of being primary on-call with the habit of being on-duty, transforming our on-call rules of engagement in the following way:

  • On-call activity — 1st priority, 24/7: respond and act upon systems alerts, monitor closely our dashboards and alerting channels
  • On-duty activity — 2nd priority, business hours: work on tasks that helped us improve the on-call rotation pain points

We no longer had subject-matter experts acting upon the actual issues but rather guiding the primary on-call engineer to handle them properly. Having all the team members equally involved in day-to-day operations made our recurring issues and their complexity even more apparent and reduced the amount of implicit operational knowledge.

Last but not least, we started to leverage a designated backlog that would contain all the tasks eligible for the on-duty activity. We called it nothing else but the on-duty backlog. We continuously updated this backlog to include the most relevant tasks we needed to implement to improve our on-call rotations, which could be addressed within a week.

Our team’s on-duty backlog, which captures operational improvements scheduled for half-a-year cycles.

The on-duty backlog and the on-duty role helped us go through a mindset shift: from constantly thinking of an ideal on-call role to working towards reaching that state.

The 2nd habit law: make it attractive

When we introduced the on-duty backlog and activity, we also introduced the role of the On-duty Product Owner, which each team member would play. Similar to the Scrum Product Owner, this new role had the goal of maximising the value of our teams’ work toward a smoother on-call process.

The team was the one to continuously plan and prioritise the on-duty items that would ultimately positively impact their on-call daily activities. And these items could vary from implementing specific changes in our systems, adding a new run-book, driving a knowledge-transfer session, or even reviewing and adjusting our alerts.

The 3rd habit law: make it easy

To be able to act as the on-duty engineer and product owner continuously, the team ultimately needed some simple ways to answer the following questions repeatedly:

  • What are current impactful on-call operations, and why?
  • How to constantly prioritise action items?
  • What is the proper time to act upon a particular thing considering the entire list?
  • When and how do we know we have done the proper prioritisation exercise?

Welcome the on-call handover

You can think of this as a recurring post-mortem ceremony, which the team holds internally to discuss the last week’s on-call rotation. The previous on-call engineer drives the discussion with the goal of having the next on-call engineer fully enabled to start their rotation.

Each Monday morning, when the on-call rotation ends, we meet for a half an hour with the following rules of engagement:

  • The previous on-call engineer presents the last week’s on-call highlights and the extracted action items, which are ultimately tasks added to the on-duty backlog. Depending on the complexity of the issue, the on-call engineer might priorly schedule a brainstorming session with the team to come up with a practical action item added to the on-duty backlog.
  • The current on-call engineer seeks to get clarity on the presented data and aligns with the team on what needs to be addressed next as part of their on-call rotation.

We consider an on-call handover successful when:

  • We have a list of on-duty backlog items that the next on-call (on-duty) engineer will work on and which is added to the current sprint (we use Scrumban)
  • The prioritisation is backed by the positive impact an item would bring to our on-call rotation

Data collection made simple

Needless to say, a 30’ handover session would not go as planned without the previous on-call engineer doing some prep work beforehand. This prep work requires more introspection of the last week and proposed next steps.

While we use several tools to manage the on-call alerts and incidents, we have chosen a classical way to gather data for facilitating the on-call handover. A simple and personalised survey that ultimately feeds a continuously evolving on-call log, as showcased in the following figure.

The survey our team uses to log on-call events, which are further reviewed during the on-call handover ceremony.

This new data-driven approach has proven to be a good reflection exercise, forcing us to synthesise the on-call activity such that it’s easy to review, extract action items from and even run some data analysis processes.

While some on-call operations are actions we need to take the same way, and we’re okay with that, others require further actions. And the survey outlines that — for each operation marked as a one-time operation (could be improved), the on-call engineer must associate either what they have done to address that issue (e.g., a PR, a link to an improved run-book, etc.) or a link the subsequent on-duty backlog the team needs to address to prevent this issue from recurring.

At the end of the day, we have a complete historical log that captures all survey responses and outlines what it means to do on-call in our time, what kind of effort we invest in on-call operations and what we can do next to make it smoother. This captured data is valuable whenever a prioritisation is required, or the need for celebration is called out after reaching a particular milestone.

When conflicted with priorities, let the data speak

There is no coincidence that the on-call handover survey contains a mandatory field to fill in - the service that triggered the alert. This property makes it straightforward to create a pivot chart showcasing the most impactful services in our day-to-day on-call operations. The following figure contains an example of such a chart we are looking at during our planning process.

A sample of our on-call chart that we leverage for prioritising the next on-duty backlog items to address

When the team goes through an on-duty backlog prioritisation exercise, the success criteria is to have the backlog rank reflect the top services showcased in the chart. This process makes our planning discussions so more straightforward, everyone being instantly aligned on the fact that we should choose to invest in the most painful areas.

The 4th habit law: make it satisfying

Going through a weekly review of the on-call historical log did not only enable us to decide on what to act further but also offered us occasions for celebration.

This data-driven process has switched the gears when approaching the on-call rotation showcased in the figure below. Each on-call rotation has turned into a quest where each team member was fully empowered to win, bringing the trend line of interventions down.

The ultimate on-call process we adopted and actively use today.

After two years of consistently going through all these process stages, the last on-call handover we had (at the time of writing this article) was kicked off with the following statement — “This on-call rotation was rather boring”. We took the time to call this out and celebrate as this was a huge milestone for us.

Conclusion

Looking back at how we started, I believe the conscious transition of our Why, from actions driven by a sense of insecurity to actions driven by a definite sense of empowerment, has completely reshaped our team’s culture or, better said, our golden circle:

Our systems will continuously evolve to meet our users’ expectations in increasingly adopting digitalised solutions. But so will we, to meet our individual needs, whether personal or professional.

Change is the only constant, they say. Yet, our story showcases that for engineering teams directly facing the dynamics of business while doing on-call, you need to define another constant — developing a healthy on-call process. While the on-call process depends on each team setup, the on-call pain is the same for everyone:

  • me, family, friends time interrupted
  • sleep interruption and deprivation
  • anxiety caused by not knowing what you’re getting paged on next, or whether you should go for that bike ride or not

To keep your systems healthy is to invest the same amount in keeping your team healthy. This is a mantra we, as a team, have been keeping each other accountable for. And we have learned that the best way to start addressing this is to first look at the data and then build healthy team habits to act upon it.

Additional reading

  1. Adobe Experience Platform: https://www.adobe.com/experience-platform.html
  2. Spinnaker usage within AAM and Experience Edge: https://medium.com/adobetech/experiences-with-spinnaker-on-adobe-experience-platform-bae6cf351f34
  3. Using Akka Streams to Build Streaming Activation in Adobe Real-Time Customer Data Platform (Part 1): https://medium.com/adobetech/using-akka-streams-to-build-streaming-activation-in-adobe-real-time-customer-data-platform-part-1-3acd554ad71
  4. Using Akka Streams to Build Streaming Activation in Adobe Real-Time Customer Data Platform (Part 2): https://medium.com/adobetech/using-akka-streams-to-build-streaming-activation-in-adobe-real-time-customer-data-platform-part-2-66d8353f1c00
  5. Redesigning Siphon Stream — A story about Spark performance improvements: https://medium.com/@jaeness/1eca98fc85ad
  6. 70 Billion Events per Day — Adobe & Kotlin: https://www.youtube.com/watch?v=kKPBWIKrSOU
  7. Bringing Metadata Faster to Edge Infrastructures with Hollow: https://medium.com/adobetech/bringing-metadata-faster-to-edge-infrastructures-with-hollow-2a9a3bd646ea

Published in Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.

Written by Bianca Costache (Teşilă)

Engineering Manager @Adobe | Emerging Tech Enthusiast | Space Nerd

Responses (1)

Write a response

Thanks for sharing, you might be interested in my post here re thr on call - https://www.infoq.com/articles/chaos-engineering-cloud-native/

--