Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

As a Senior SRE on our Platform team, you'll ensure the reliability and performance of our cloud-native platform running entirely on Kubernetes. This role uniquely combines SRE excellence with data systems expertise, as you'll be responsible for everything from our core microservices to data pipelines and AI model serving infrastructure.

You'll lead the engineering of our data analytics and warehousing systems, both internal platforms and external integrations, while maintaining the operational excellence our enterprise customers demand. This is an opportunity to apply SRE principles across our entire modern stack - from Kubernetes workloads to real-time data streams - while building MLOps practices that enable rapid AI innovation.

What you’ll do

Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
Strong Python skills: You write clean automation code and can work with data processing frameworks
Database knowledge: Experience maintaining schema consistency and data integrity at scale
SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

Experience contributing to data warehouse or analytics platform projects
Experience with real-time data streaming (Kafka, Kinesis)
Experience with data quality monitoring and validation systems
Experience operating ML model serving infrastructure or MLOps practices
Experience implementing GitOps with ArgoCD
Experience with data governance and compliance requirements

Why you’ll love this role

Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
Technical variety: From optimizing Kubernetes performance to designing MLOps
AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.

View open positions

Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

What you’ll do

Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
Strong Python skills: You write clean automation code and can work with data processing frameworks
Database knowledge: Experience maintaining schema consistency and data integrity at scale
SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

Experience contributing to data warehouse or analytics platform projects
Experience with real-time data streaming (Kafka, Kinesis)
Experience with data quality monitoring and validation systems
Experience operating ML model serving infrastructure or MLOps practices
Experience implementing GitOps with ArgoCD
Experience with data governance and compliance requirements

Why you’ll love this role

Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
Technical variety: From optimizing Kubernetes performance to designing MLOps
AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

What you’ll do

Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
Strong Python skills: You write clean automation code and can work with data processing frameworks
Database knowledge: Experience maintaining schema consistency and data integrity at scale
SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

Experience contributing to data warehouse or analytics platform projects
Experience with real-time data streaming (Kafka, Kinesis)
Experience with data quality monitoring and validation systems
Experience operating ML model serving infrastructure or MLOps practices
Experience implementing GitOps with ArgoCD
Experience with data governance and compliance requirements

Why you’ll love this role

Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
Technical variety: From optimizing Kubernetes performance to designing MLOps
AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.