Job Description
We are seeking a Senior Site Reliability Engineer (SRE) to play a leadership role in driving reliability, scalability, and observability across a modern, cloud-based application platform. This role will be instrumental in transitioning teams from reactive support to proactive engineering practices, establishing SRE standards, and mentoring junior team members.
The ideal candidate will bring deep expertise in cloud environments, application monitoring, and automation, combined with strong collaboration skills and the ability to influence engineering teams toward operational excellence.
Key Responsibilities
Design, implement, and continuously improve monitoring, alerting, and observability frameworks across production environments
Lead troubleshooting of complex production issues and drive thorough root cause analysis to prevent recurrence
Own and enhance cloud-based application and platform reliability at scale
Partner with and influence engineering teams to improve system performance, scalability, and resiliency through architectural guidance and best practices
Architect and implement automation of operational tasks and workflows to significantly reduce manual intervention
Drive improvements in incident response processes, playbooks, and on-call procedures to minimize downtime
Lead monitoring and administration of containerized environments (Kubernetes/AKS)
Establish and champion SRE best practices and standards across teams, mentoring mid-level and junior engineers in SRE principles
Help guide teams through the cultural and technical transition from reactive support to proactive SRE practices
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
~6–8+ years of experience in a Site Reliability Engineer or similar role (10+ years total IT experience)
Deep expertise working in Azure cloud environments at scale
Extensive hands-on experience with monitoring and observability tools (e.g., Elastic, Prometheus, Grafana, or similar), including designing and architecting monitoring strategies
Proven experience supporting production applications in complex, high-availability environments (application-focused SRE vs. infrastructure-only)
Strong knowledge of Kubernetes (AKS) for monitoring, alerting, administration, and troubleshooting
Ability to troubleshoot and debug applications at a deep level, including reading, understanding, and reviewing code
Solid experience with .NET/C# application environments
Experience with databases (SQL and/or NoSQL such as Cosmos DB, PostgreSQL, etc.)
Demonstrated ability to mentor engineers and drive SRE adoption across teams
Nice to Have Skills & Experience
Experience in multi-cloud or hybrid-cloud environments (Azure + AWS/GCP)
Exposure to EKS or non-Azure Kubernetes environments
Experience supporting single-page applications (e.g., Angular)
Strong background in automation and scripting (e.g., Python, Bash, PowerShell, Terraform)
Prior experience leading teams through the transition to SRE best practices
Experience working in global or distributed teams
Track record of reducing MTTR, improving SLOs/SLIs, and driving operational maturity
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.