When AI Becomes Your SRE: How Incident.io Is Automating Incident Response

Listen to this episode on: Spotify | Apple Podcasts

When your site goes down, every second counts. For years, Incident.io has helped engineering teams coordinate through chaos—getting the right people in the room, keeping stakeholders informed, and restoring order fast.

Now, they’re building something new: an AI SRE that can actually help diagnose and respond to incidents.

In this episode, Teresa Torres talks with Lawrence Jones (Founding Engineer) and Ed Dean (Product Lead for AI) about how their team is teaching AI to think like a site reliability engineer. They share how they went from simple prototypes that summarized incidents to a multi-agent system that forms hypotheses, tests them, and even drafts fixes—all from within Slack.

You’ll hear how they:

Identify which parts of debugging can safely be automated
Combine retrieval, tagging, and re-ranking to find relevant context fast
Use post-incident “time travel” evals to measure how well their AI performed
Balance human trust and AI confidence inside high-stakes workflows

This is a masterclass in designing AI systems that think, reason, and collaborate like expert teammates.

Show Notes

Guests

Lawrence Jones, Founding Engineer at Incident.io
Ed Dean Product Lead for AI at Incident.io

Key Takeaways

AI’s biggest impact comes from compressing time—identifying causes minutes instead of hours.
Retrieval-augmented reasoning still benefits from simplicity: deterministic tagging and re-ranking often beat complex vector setups.
Post-incident “time travel” evals let teams score AI accuracy after they know what really happened.
Building trust in AI isn’t just about precision—it’s about showing reasoning and uncertainty in ways humans understand.

Mentioned Tools & Concepts

Slack as the interface for human-AI collaboration
PGVector and Postgres for retrieval experiments
RAG (Retrieval-Augmented Generation)
Multi-agent orchestration
“AI as your company’s immune system”

Chapters

00:00 Meet the Founders: Lawrence and Ed
00:41 Introduction to Incident.io
01:25 Evolution of Incident.io Products
02:14 Understanding SRE and Its Importance
04:01 Real-World Incident Management
05:51 The Role of AI in Incident Management
10:12 Challenges and Innovations in AI SRE
12:14 Prototyping and Iterating AI Solutions
16:25 Refining Retrieval Strategies
21:52 Balancing AI and Human Interaction
32:06 User Experience and Trust in AI Systems
36:08 Interactive Slack Integration
37:08 Understanding the AI Investigation Process
37:50 Parallel Checks and Data Sources
38:35 Building Hypotheses and Refining Findings
40:09 Human-Agent Collaboration
49:23 Evaluating AI Effectiveness
a01:04:13 Future Developments and Integrations

Full Transcript

Podcast transcripts are only available to paid subscribers.

When AI Becomes Your SRE: How Incident.io Is Automating Incident Response

Show Notes

Guests

Key Takeaways

Mentioned Tools & Concepts

Chapters

Full Transcript

Read next

Building Agent Studio: How Medable Is Using Agentic AI to Accelerate Clinical Trials

Building GitHub for Product Management: How Momental Uses AI to Find Merge Conflicts in Strategy

Building AI Sales Reps: How ShowMe Orchestrates Voice, Video, and Multi-Agent Workflows to Close Deals

Make better product decisions.

Show Notes

Guests

Key Takeaways

Mentioned Tools & Concepts

Chapters

Full Transcript

Read next

Make better product decisions.