Lead and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and continuous learning
Ensure the availability, reliability, and performance of critical services through proactive monitoring, capacity planning, and performance tuning.
Design, implement, and maintain observability solutions using tools such as AppDynamics, Splunk, Prometheus, Grafana, or Open Telemetry.
Collaborate with software engineering, operations, and product teams to design and deploy scalable and resilient systems
Oversee incident management processes, ensuring timely resolution of incidents and minimizing downtime
Establish and monitor key performance indicators (KPIs) to measure system reliability and performance
Conduct post-incident reviews and implement lessons learned to prevent future occurrences
Stay current with industry trends and emerging technologies to continuously improve SRE practices
Manage budgets and resources effectively to support SRE initiatives and projects
Incident Management: Lead incident response efforts, perform root cause analysis (RCA), and drive post-mortem processes to improve system reliability
Automation & Infrastructure as Code (IaC): Develop automation to reduce manual operational tasks using Terraform, Ansible, or Kubernetes
CI/CD & Deployment Pipelines: Work closely with development teams to enhance deployment strategies and improve continuous integration/continuous deployment (CI/CD) workflows
Cloud & Kubernetes Operations: Manage and optimize cloud infrastructure (AWS, Azure, or GCP) and container orchestration platforms (Kubernetes, Docker)
Security & Compliance: Implement best practices for security, compliance, and cost optimization in cloud environments
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
HR@insightglobal.com. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
7+ years of experience in site reliability engineering, DevOps, or a related field
5+ years of experience of cloud computing platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes, Docker)
1+ year of experience in a leadership or management role, with a proven track record of managing high-performing teams
3+ years of experience in scripting and programming languages (e.g., Python, Go, Java)
3+ years of experience in monitoring and observability tools (e.g., Prometheus, Grafana, Splunk)
Familiarity with CI/CD pipelines and automation tools (e.g., Jenkins, GitLab CI etc)
Excellent communication and interpersonal skills, with the ability to collaborate effectively across teams
Strong problem-solving skills and a proactive approach to identifying and addressing issues
Ability to thrive in a fast-paced, dynamic environment and manage multiple priorities
Experience with Agile methodologies and DevOps practices
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.