We are seeking a Site Reliability Engineer to join our team. As a member of the team, you will play a critical role in ensuring the reliability, scalability, and performance of our large-scale distributed systems. You will drive operational excellence by proactively identifying and solving problems, improving system performance, and ensuring that our production environments remain resilient and efficient. Your experience in orchestrating and automating complex systems, combined with a focus on improving software release processes and managing large cloud environments, will be key in our ongoing success.
Youll be working across multiple cloud platforms, leveraging cutting-edge tools such as Terraform, Ansible, Kubernetes, and Dynatrace, while contributing to the design and operational lifecycle of mission-critical applications. You will collaborate with engineering, development, and product teams to enhance the performance and stability of our production infrastructure, ensuring seamless and high-quality software delivery.
Our ideal candidate would sit in the Eastern Timezone.
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
Human Resources Request Form. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
5+ years in Site Reliability Engineering, DevOps, or related roles.
Strong focus on tactical operations and experience managing large-scale distributed software applications.
Solid experience with infrastructure as code (Terraform, Ansible).
Proven experience with cloud environments such as GCP and AWS.
Expertise in managing and optimizing Kubernetes clusters for large-scale deployments.
Proficiency in one or more programming languages, such as Python, Java, C/C++, Ruby, or JavaScript.
Experience with monitoring and observability platforms like Dynatrace, Prometheus, or similar tools.
In-depth experience managing dynamic, scalable cloud infrastructure and distributed systems.
Comfortable with ambiguity and complex systems, with the ability to handle challenges with confidence.
Experience in CI/CD pipelines and automation tools.
Familiarity with incident response processes and post-mortem analysis.
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.