Index / Writing / game-days-incident-management

Game Days: How to Prepare Engineers for Production Incidents

June 20, 2024

DevOpsincident managementGame DaysSREVolkswagenEV charging

"You built it; you run it!" That's one of the core mottos of DevOps culture. But there's a gap between adopting this philosophy and actually being ready for production incidents. At Elli, a provider of electric mobility solutions in the Volkswagen Group, we bridged that gap with Game Days.

The problem

We were running the backend for one of Europe's largest EV charging networks. Our engineers built the services, and they were on-call for them. But being on-call and being prepared for incidents are two different things.

Most engineers had never experienced a real high-severity incident. When one hit, the stress, the unfamiliar tools, and the unclear processes all compounded into slow response times and unnecessary escalations.

What are Game Days?

Game Days are controlled chaos. You deliberately inject failures into your systems and have teams respond to them as if they were real incidents. Think of it as a fire drill for your infrastructure.

The key difference from chaos engineering tools like Chaos Monkey is that Game Days are collaborative, educational events. The goal isn't just to test resilience, it's to train people.

How we ran them

We organized Game Days for 51 backend engineers across multiple teams. Here's what worked:

Planning phase:

Identified realistic failure scenarios based on past incidents and known risks
Prepared runbooks and ensured monitoring was in place
Set clear rules of engagement: what could be broken, what was off-limits
Assigned observers to document everything

Execution:

Teams responded to injected failures using real incident processes
Observers tracked response times, communication quality, and tool usage
We ran retrospectives immediately after each scenario

Results:

Uncovered 59 action items: gaps in monitoring, missing runbooks, unclear escalation paths, infrastructure vulnerabilities
Engineers gained hands-on experience with incident tools (PagerDuty, GCP Stackdriver)
Teams built confidence in their ability to handle real incidents
Cross-team communication improved significantly

Key lessons

Start small. Don't simulate a full datacenter outage on day one. Begin with single-service failures and build complexity over time.
Make it safe. Engineers need to feel comfortable making mistakes during Game Days. If people fear blame, they won't engage authentically.
Document everything. The action items from Game Days are gold. They reveal blind spots that no amount of code review or architecture discussion will surface.
Make it regular. One Game Day builds awareness. Regular Game Days build culture. We saw the biggest improvements in teams that participated multiple times.
Include everyone. Not just backend engineers, but also product managers, support teams, and leadership. Incidents are a whole-company concern.

The impact

After implementing the 59 action items, our mean time to detect (MTTD) and mean time to resolve (MTTR) improved measurably. More importantly, engineers wanted to be on-call. They felt prepared rather than anxious.

I presented these findings at BigTechDay 2021. If you're building a DevOps culture and struggling with incident readiness, Game Days are one of the highest-ROI investments you can make.

← All articles