Job Description
Role Overview
As a Senior Site Reliability Engineer, you will own the stability, reliability, and scalability of production systems. You’ll ensure services remain fast, resilient, and well-observed. This role demands a strong automation mindset, deep troubleshooting skills, and the ability to think like a developer while supporting infrastructure.
Day-to-Day
• Monitor and maintain system health using tools like New Relic, ensuring uptime and performance.
• Build and refine monitoring dashboards and alerts to improve observability.
• Troubleshoot production issues and lead incident response and postmortem processes.
• Automate infrastructure provisioning and deployment workflows using scripting and tooling.
• Define and enforce CI/CD pipeline standards to ensure consistent and reliable releases.
• Collaborate with developers and infrastructure teams to improve reliability and performance.
• Support release schedules and coordinate with QA and development teams.
• Participate in planning for scaling and future infrastructure needs.
• Contribute to building a developer-centric culture and internal tooling.
Document operational processes, runbooks, and incident reports for transparency and knowledge sharing.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
Must-Have
• 5+ years of experience in Site Reliability Engineering
• Hands-on experience with monitoring tools like New Relic and/or Datadog
• Strong expertise in Windows infrastructure
• Proficient in PowerShell and Python scripting
• Familiarity with AWS or Azure (team has more Azure) cloud platforms
• Experience with Infrastructure as Code (IaC) using Terraform and CDK for Terraform
Plusses
• Github experience
• Understanding of DevOps culture and practices
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.