Job Description
Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms.
Job Description:
- Design and implement comprehensive SRE monitoring for distributed applications
- Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications
- Create drill-down Grafana dashboards with correlation between metrics, logs, and traces
- Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams
- Implement zero code instrumentation for monitoring and traceability
- Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc
- Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards
- Build service health dashboards with drill-down capabilities and error message analysis
- Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting
-Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience.
- Technical: Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure logging
- Logging & Tracing: Distributed tracing, W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applications
- Structured Logging: JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id)
- Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issues
- Infrastructure: CI/CD pipelines , AI tools like GIT copilot etc.
- Observability Tools & Query Languages: PromQL for querying metrics (Grafana)
- Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm)
- OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services
- Alerting and Incident management :Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues
Nice to Have Skills & Experience
- Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.)
- AI experience
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.