Wargaming in the AI Era

AI continues to dominate our discussions, including in incident response and SRE. The emergence of AI assistants, AI-generated code and self-remediation agents shows that SRE is very much embracing the AI boom. Nevertheless, it’s worth reflecting how it evolves, or even eliminates, some traditional practices in fields such as incident response.

I’m fortunate enough to attend lots of developer events in this job, and therefore have lots of interesting conversations. The discussion of the future of Wargaming for SREs is one such discussion that popped up at a recent OOPs meetup, leading me to share my thoughts here.

In this piece I’ll explain why Wargaming isn’t dead quite yet, and in fact in the AI era is needed to make incident response routine to allow us to focus on rarer and more complicated problems.

What is Wargaming

Wargaming, or sometimes known in Amazon circles as Game Days (or indeed the Google SRE handbook refers to shorter versions known as The Wheel of Misfortune or Walk the Plank), is the practice of running human-centred exercises against systems with simulated issues to help SREs, management and other relevant humans practice and build their decision making and problem solving skills under a simulated pressurised environment. They can be based on theoretical or real-life problems in a large system, and can be modelled as either live exercises or table-top RPG-like simulations.

When you think about practicing these events, it does make a lot of sense. As I write this article, I am on a flight. The pilot in the cockpit goes through regular training scenarios using flight simulators to not only learn how to handle emergency situations such as a fault in 1 engine, but to maintain that knowledge so that they always know how to handle it.

Therefore, wargames help us maintain that knowledge in much the same way. They aim to:

Build human intuition of what might be wrong in a system for future events
Nurture team coordination and collaboration skills within a pressurised situation
Situational awareness — not just fault tolerance
Triaging skills
Formal communication skills, both on active bridges (or incident calls) and to relevant stakeholders on the progress of an outage

That’s Chaos Engineering, right?

Not quite. I have heard of Wargaming being covered as a Chaos Engineering practice. While there can be elements of chaos at times within a wargame scenario, Chaos Engineering in my mind quite different. Chaos Engineering refers to the practice of using technology to introduce faults (or as I like to call them, gremlins) into a live production system to assess if the system is able to self-correct. There are several different tools available that can be used to inject faults such as Gremlin (nice name).

It’s not intended to cause outages for humans to fix all the time. Instead we create experiments where faults are injected, and then we validate the results to see if the system remained stable as we gulp hoped, as highlighted in this article.

In fact, if teams are not careful and diligent with their Chaos Engineering practices, they run the risk of SRE burnout as individuals constantly fight fires and handle alerts in a system with poor fault tolerance. So be wary!

Why do we still need Wargaming?

In software engineering, we like to say that “X is dead” to (often) controversially claim that practice X is either no longer needed, or is superseded by Y. For me that clings to the hope of a single silver bullet to slay the software engineering werewolf, and much like Frederick P. Brooks in his seminal paper I’m not convinced one exists.

Currently I see three key challenges that wargaming can help us with.

Building muscle memory for incidents

The adoption of AI tools is leading to very different outcomes for junior and senior engineers. While one study points to a four percent increase in productivity for engineers, they are also finding a disparity in the results between junior and senior developers. In fact, they found that while junior developers are using AI up to thirty-seven percent more, the productivity gains are exclusively reaped among senior engineers.

One key advantage of wargaming is the ability to practice incident management practices outside of an incident. Be it practicing a musical instrument, gymnastics or sample exam papers, we gain comfort and expertise by repeatedly doing things in safe environments. This to me represents a potential opportunity to equalize experience and share knowledge.

I come from the traditional background of learning how to handle incidents in a real-life incident. As a junior engineer, I found my first incident to be terrifying. I had no idea what was going on (as practices were in typical fashion not well documented). I had support agents interrupting my train of thought as I tried to dig into code and logs to figure out the problem. I was high on a mixture of adrenalin and panic. I wouldn’t consider it to be a fantastic experience. But it was a strong learning experience that helped me build resilience and the skills I needed for triaging and handling incident bridges.

I wish wargaming was an established practice back in 2011, because I see it as a way to receive training on these incident management skills without the time pressure of fixing an issue for angry and frustrated users. But it also helps us:

Find out where important things are such as runbooks and production logs
Identify limitations and inaccuracies in processes and documentation
Develop problem solving skills and intuition to find issues where the error may be masked by something else
Practice using the incident management tools that we may not use everyday, including AI tools

Practicing using AI tools in an outage scenario is important to gain an understanding of how these tools work. In an outage scenario AI tools with access to runbooks, logs, telemetry history and proprietary documentation can help us identify potential issues and fixes more quickly. Yes we may be using LLMs and coding agents in building these systems. But we need to make sure we can ask the right questions of an LLM, detect problems where a self-healing agent may not be able to fix things, and detect potential hallucinations in outputs within AI assistants (which could be down to training data or even our own out of date runbook poisoning the LLM’s context).

The Leftover Principal

Wargaming is also important in ensuring we have the skills required to manage and rectify incidents because the use of AI will likely reduce the number of simple, straightforward incidents. Using agents to perform common remediation issues such as bouncing components, truncating old files from almost full disk locations, and performing deployments and rollbacks is very much a current rather than future thing. Teams are already doing this.

One talk at the OOPs meetup last week focused on the Leftover Principle. If you’re not familiar, this refers to agents handling the low hanging fruit, or simple common issues. Therefore, the remaining issues that SREs will be managing become inherently more complex as the agent easily fixes the common and easy to remediate problems.

Without wargaming in these scenarios, we’ll find that SREs, engineers and management become less versed in incident management as they don’t have as many incidents to train on as I (sadly) did on some systems that I’ve worked on. Therefore, they will become slower to fix issues, and possibly even slower as they forget the procedures for handling incidents, where to find the production logs, stakeholder communication and other issues that wargaming helps every team member build.

Code familiarity

Another aspect of handling these issues is we may be less familiar or comfortable with the code as engineers increasingly use AI to generate code. We can argue that makes us less familiar with the code that’s written, and struggle to keep track of changes due to the sheer amount of code that LLMs can rapidly generate.

As we use LLMs to write more code for us, teams are reporting an increase in the review burden as they struggle to understand code changes that often don’t have well documented justification for their approaches. There is a case to be made for context anchoring by Rahul Garg earlier this year. Regardless of that context, much like alert fatigue, review fatigue is very much a real thing too. With review fatigue, there’s a potential to let through changes with unknown defects. And if we also generate tests with AI, it’s possible the test suite may not pick them up.

Of course, testing can’t prove the absence of bugs in our code. But with the churn of AI code leading to increases in usage on platforms such as GitHub which has seen an uptick in incidents over 2025-2026 when compared to 2024, it’s possible that many software teams will find that more defects slip through the cracks in our testing harnesses.

Conclusion

In this piece I’ve made the case that wargaming is not going away immediately in the age of AI. You could argue that with repair agents and assistants we have access to great tools to help us manage and fix issues. But without maintaining our skills, the leftover principle and the code review burden means we’ll need to handle ever-more complex issues.

I am open to changing my mind. After all, committing to a career in software engineers is to commit to change and the need to adapt and change one’s mind over time. But for the moment, I would say that wargaming is here to stay.

Thanks for reading!