Job Description
• Minimum of 5 years of experience in Site Reliability Engineering, IT operations, or related fields.
• Bachelor’s degree in computer science, engineering, or equivalent experience (2 additional years in lieu of degree).
• Technical expertise in system reliability, scalability, application design, and performance.
• Hands-on experience with observability and monitoring tools such as Grafana, AppDynamics, and Sumo Logic.
• Experience with automation platforms, particularly Ansible, for infrastructure and event-driven automation.
• Proven ability to mentor and guide engineers in adopting SRE practices and principles.
• Excellent communication and collaboration skills across diverse teams and vendors.
• Strong judgment and problem-solving capabilities.
• Experience working in multi-cloud environments.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
• Contribute to the SRE strategy and establish best practices for release management, automation, and system reliability.
• Mentor and guide SRE, Engineering, and Product teams in adopting core SRE principles such as service ownership, reducing toil, and continuous improvement.
• Lead initiatives across SLIs/SLOs, observability, incident management, and postmortem practices, ensuring insights and learnings are captured and acted upon.
• Champion SRE practices by implementing repeatable templates for logging, monitoring, and alerting frameworks.
• Drive observability and monitoring excellence using tools such as Grafana, AppDynamics (AppD), and Sumo Logic, ensuring proactive detection and resolution of issues.
• Partner with engineering to design reliable, fault-tolerant systems and reduce operational toil through automation.
• Implement and leverage the Ansible Automation Platform to help teams automate infrastructure provisioning, configuration management, and event-driven workflows.
• Enable teams to automate operational events and infrastructure changes, reducing manual intervention and improving system resilience.
• Exercise sound judgment to ensure operational compliance with security, privacy, audit, disaster recovery, and other company requirements.
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.