Lead Data SRE (Hybrid - Chennai, India)

Post Date

Mar 05, 2026

Location

Irvine,
California

ZIP/Postal Code

92612

Job Type

Contract

Job Description

Role Overview
The Data SRE Lead is responsible for ensuring the reliability, scalability, performance, and operational excellence of the organization’s data platforms and pipelines. This role bridges Data Engineering and Site Reliability Engineering practices, applying SRE principles to modern data ecosystems (batch, streaming, warehousing, and ML data infrastructure). This a hybrid role sitting in the clients Chennai, India location 3 days per week.

Key Responsibilities
Reliability & Operations
Define and own SLIs, SLOs, and SLAs for data platforms and pipelines
Design and implement monitoring, alerting, and observability solutions
Lead incident response, root cause analysis (RCA), and postmortems
Reduce toil through automation and self-healing infrastructure

Data Platform Stability
Ensure high availability of:
Data warehouses and lakehouses
Streaming systems
ETL/ELT pipelines
Orchestration frameworks
Implement capacity planning and performance tuning strategies
Improve data pipeline reliability, freshness, and latency metrics

Infrastructure & Automation
Manage infrastructure-as-code (IaC) frameworks
Improve CI/CD pipelines for data workflows
Implement automated testing and validation for data infrastructure
Drive resilience patterns such as retries, circuit breakers, and graceful degradation

Leadership & Strategy
Lead and mentor a team of Data SREs
Define operational standards and reliability roadmaps
Collaborate cross-functionally with Data, Engineering, and Product leadership
Drive a culture of reliability and operational excellence

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.

Required Skills & Experience

8+ years in Site Reliability Engineering, Platform Engineering, or Data Engineering
3+ years in a technical leadership role
Strong experience with:
Cloud platforms (AWS, GCP, or Azure)
Infrastructure as Code (Terraform, CloudFormation)
Monitoring tools (Prometheus, Datadog, Grafana)
Containerization & orchestration (Docker, Kubernetes)
Deep understanding of distributed systems and failure modes
Experience supporting large-scale data systems (batch & streaming)

Nice to Have Skills & Experience

Experience with modern data platforms (Snowflake, BigQuery, Databricks)
Experience with streaming systems (Kafka, Pub/Sub, Kinesis)
Knowledge of data quality frameworks and data observability
Familiarity with ML platform reliability

Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.