IBM logo

Staff Site Reliability Engineer - Confluent Incident Management & Reliability

IBM
Department:Education
Type:REMOTE
Region:Cardiff, Wales
Location:Markham, Wales, United Kingdom
Experience:Mid-Senior level
Estimated Salary:£120,000 - £180,000
Skills:
SREINCIDENT MANAGEMENTRELIABILITY ENGINEERINGAWSGCPAZUREROOTLYPAGERDUTYJIRACONFLUENCESLACKSLOSLAERROR BUDGETSOBSERVABILITYMETRICSLOGGINGTRACINGKUBERNETESCI/CDKAFKAEVENT STREAMING
Share this job:

Job Description

Posted on: May 15, 2026

Introduction At IBM Software, we transform client challenges into solutions. Building the world’s leading AI-powered, cloud-native products that shape the future of business and society. Our legacy of innovation creates endless opportunities for IBMers to learn, grow, and make an impact on a global scale. Working in Software means joining a team fueled by curiosity and collaboration. You’ll work with diverse technologies, partners, and industries to design, develop, and deliver solutions that power digital transformation. With a culture that values innovation, growth, and continuous learning, IBM Software places you at the heart of IBM’s product and technology landscape. Here, you’ll have the tools and opportunities to advance your career while creating software that changes the world. With Confluent, data doesn’t sit still. We put information in motion, streaming in near real time so organizations can react faster, build smarter, and deliver experiences as dynamic as the world around them. Your Role And ResponsibilitiesAbout the Role: Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur. This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices. You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less. What You Will Do

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs; coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes

Preferred Education Master's Degree Required Technical And Professional Expertise

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Preferred Technical And Professional Experience

  • Advanced Cloud Knowledge: Experience with cloud-based infrastructure and its application in reliability and resiliency engineering.
  • Specialized Scripting Skills: Proficiency in scripting languages and automation tools to optimize system reliability and performance.
Originally posted on LinkedIn

Apply now

Please let the company know that you found this position on our job board. This is a great way to support us, so we can keep posting cool jobs every day!

JobsInUK.app logo

JobsInUK.app

Get JobsInUK.app on your phone!

SIMILAR JOBS
IBM logo

Staff Site Reliability Engineer - Confluent Incident Management & Reliability

IBM
Just now
Education
Remote (Cardiff, Wales)
Markham, Wales, United Kingdom
SREINCIDENT MANAGEMENTRELIABILITY ENGINEERING+19 more
Aberdeen City Council logo

Teacher of PE - Bucksburn Academy - ABC13338

Aberdeen City Council
Just now
Education
ON-SITE
Aberdeen, Scotland, United Kingdom
TEACHINGPHYSICAL EDUCATIONGTC REGISTRATION+1 more
Aberdeen City Council logo

Support for Learning Teacher - Gilcomstoun School - ABC13366

Aberdeen City Council
2 days ago
Education
ON-SITE
Aberdeen, Scotland, United Kingdom
TEACHINGGTC SCOTLAND REGISTRATIONPVG SCHEME MEMBERSHIP
Aberdeen City Council logo

Depute Head Teacher - Forehill School - ABC13406

Aberdeen City Council
2 days ago
Education
ON-SITE
Aberdeen, Scotland, United Kingdom
TEACHINGLEADERSHIPMANAGEMENT+1 more
Aberdeen City Council logo

Teacher of Business Studies - Harlaw Academy - ABC13356

Aberdeen City Council
2 days ago
Education
ON-SITE
Aberdeen City, Scotland, United Kingdom
TEACHINGGTC SCOTLAND REGISTRATIONPVG MEMBERSHIP