Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

As a Senior SRE on our Platform team, you'll ensure the reliability and performance of our cloud-native platform running entirely on Kubernetes. This role uniquely combines SRE excellence with data systems expertise, as you'll be responsible for everything from our core microservices to data pipelines and AI model serving infrastructure.

You'll lead the engineering of our data analytics and warehousing systems, both internal platforms and external integrations, while maintaining the operational excellence our enterprise customers demand. This is an opportunity to apply SRE principles across our entire modern stack - from Kubernetes workloads to real-time data streams - while building MLOps practices that enable rapid AI innovation.

What you’ll do

  • Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
  • Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
  • Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
  • Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
  • Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
  • Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

  • 4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
  • Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
  • AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
  • Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
  • Strong Python skills: You write clean automation code and can work with data processing frameworks
  • Database knowledge: Experience maintaining schema consistency and data integrity at scale
  • SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

  • Experience contributing to data warehouse or analytics platform projects
  • Experience with real-time data streaming (Kafka, Kinesis)
  • Experience with data quality monitoring and validation systems
  • Experience operating ML model serving infrastructure or MLOps practices
  • Experience implementing GitOps with ArgoCD
  • Experience with data governance and compliance requirements

Why you’ll love this role

  • Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
  • Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
  • Technical variety: From optimizing Kubernetes performance to designing MLOps
  • AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
  • Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.

Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

As a Senior SRE on our Platform team, you'll ensure the reliability and performance of our cloud-native platform running entirely on Kubernetes. This role uniquely combines SRE excellence with data systems expertise, as you'll be responsible for everything from our core microservices to data pipelines and AI model serving infrastructure.

You'll lead the engineering of our data analytics and warehousing systems, both internal platforms and external integrations, while maintaining the operational excellence our enterprise customers demand. This is an opportunity to apply SRE principles across our entire modern stack - from Kubernetes workloads to real-time data streams - while building MLOps practices that enable rapid AI innovation.

What you’ll do

  • Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
  • Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
  • Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
  • Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
  • Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
  • Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

  • 4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
  • Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
  • AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
  • Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
  • Strong Python skills: You write clean automation code and can work with data processing frameworks
  • Database knowledge: Experience maintaining schema consistency and data integrity at scale
  • SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

  • Experience contributing to data warehouse or analytics platform projects
  • Experience with real-time data streaming (Kafka, Kinesis)
  • Experience with data quality monitoring and validation systems
  • Experience operating ML model serving infrastructure or MLOps practices
  • Experience implementing GitOps with ArgoCD
  • Experience with data governance and compliance requirements

Why you’ll love this role

  • Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
  • Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
  • Technical variety: From optimizing Kubernetes performance to designing MLOps
  • AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
  • Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.

Senior Site Reliability Engineer: AI & Data Systems, Platform team

About the role

As a Senior SRE on our Platform team, you'll ensure the reliability and performance of our cloud-native platform running entirely on Kubernetes. This role uniquely combines SRE excellence with data systems expertise, as you'll be responsible for everything from our core microservices to data pipelines and AI model serving infrastructure.

You'll lead the engineering of our data analytics and warehousing systems, both internal platforms and external integrations, while maintaining the operational excellence our enterprise customers demand. This is an opportunity to apply SRE principles across our entire modern stack - from Kubernetes workloads to real-time data streams - while building MLOps practices that enable rapid AI innovation.

What you’ll do

  • Enable AI in production: Partner with our AI team to deploy models reliably, building monitoring systems and rollback mechanisms for ML services on Kubernetes
  • Own platform reliability: Lead incident response, implement SLOs/SLAs, and drive post-mortems across all platform services running on our multi-cluster Kubernetes environment
  • Implement ML deployments: Build CI/CD for ML artifacts, containerize models, and deploy as API endpoints on Kubernetes. Apply MLOps practices including feature stores, caching, A/B testing, and versioning
  • Maintain data systems reliability: Implement observability for data pipelines and databases, establish data quality monitoring, and collaborate on consistency patterns across environments
  • Build comprehensive observability: Deploy monitoring across microservices, data pipelines, and ML systems using OpenTelemetry and modern observability platforms
  • Automate everything: Write Python automation for infrastructure provisioning with Terraform, operational runbooks, and contribute to data pipeline orchestration

Your background

  • 4+ years of experience as an SRE, Platform Engineer, or similar role with a strong operational track record
  • Kubernetes expertise: You've operated production K8s clusters and understand cloud-native patterns
  • AWS proficiency: Deep hands-on experience with AWS services, especially EKS, RDS, S3, and data services
  • Data systems experience: You've built or operated data pipelines, worked with data warehouses, and understand data consistency challenges
  • Strong Python skills: You write clean automation code and can work with data processing frameworks
  • Database knowledge: Experience maintaining schema consistency and data integrity at scale
  • SRE mindset: You think in SLOs, error budgets, and building reliable systems from day one

Bonus points if you have

  • Experience contributing to data warehouse or analytics platform projects
  • Experience with real-time data streaming (Kafka, Kinesis)
  • Experience with data quality monitoring and validation systems
  • Experience operating ML model serving infrastructure or MLOps practices
  • Experience implementing GitOps with ArgoCD
  • Experience with data governance and compliance requirements

Why you’ll love this role

  • Modern stack: Everything runs on Kubernetes and AWS - no legacy systems to maintain
  • Massive impact: Your work will improve the lives of millions of workers and transform how huge enterprises operate
  • Technical variety: From optimizing Kubernetes performance to designing MLOps
  • AI-native platform: Apply SRE principles to cutting-edge AI systems built for scale
  • Real-world impact: Your reliability work directly affects operations at major venues and hospitals

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.