Back to Search Results

Site Reliability Engineer

Post Date

Apr 30, 2024

Location

Sacramento,
California

ZIP/Postal Code

95814

Job Type

Contract,Perm Possible

Job Description

Insight Global is seeking a Remote Site Reliability Engineer to support a critical modernization effort with a large government healthcare customer. This individual will be a key member of stream-aligned teams responsible for ensuring the performance and reliability of applications developed by scrum teams. The role will focus on activities such as load/performance testing, monitoring, troubleshooting, error management, and other typical duties to enhance the stability and availability of applications. The ideal candidate will have a background in cloud native DevOps and platform engineering.

Additional responsibilities include:

Conduct load/performance testing to assess application scalability and performance under various conditions, identify bottlenecks, and optimize system resources.

Ensure applications can handle expected loads and maintain optimal performance levels.

Implement and maintain monitoring solutions leveraging the platform toolset to track application health, performance metrics, SLAs, and system behavior in real-time, proactively identifying and resolving issues before they impact users.

Ensure early detection and resolution of issues to minimize downtime and maintain high availability.

Investigate and troubleshoot incidents, outages, and performance issues, utilizing diagnostic tools and techniques to identify root causes and implement effective solutions.

Restore service functionality quickly and efficiently to minimize impact on users and business operations.

Design and implement error management strategies, including error handling, logging, and alerting mechanisms, to effectively capture and address application errors and anomalies.

Improve application stability and reliability by minimizing error rates and providing timely alerts for critical issues.

Work with platform and Scrum teams to develop and maintain automation scripts and tools to streamline repetitive tasks, automate deployment processes, and improve operational efficiency.

Increase operational efficiency, reduce manual intervention, and enhance consistency and reliability of deployment and configuration processes.

Lead incident response efforts during critical incidents, coordinating cross-functional teams, communicating updates to stakeholders, and conducting post-incident reviews to identify areas for improvement.

Minimize incidents' impact on business operations, ensure effective response and resolution, and drive continuous improvement in incident management processes.

Provide guidance and support during the development process to help coach developers on good software design patterns that will sustain proper site reliability and operations.

Reduce errors and outages due to improper functioning code and solutions.

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.

To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/ .

Required Skills & Experience

5+ years of experience in DevOps and/or Site Reliability Engineering

Proficiency in system-level testing tools (e.g., JMeter, Gatling) and techniques for assessing application scalability and performance.

Experience with monitoring tools (e.g., Datadog, Prometheus, Grafana, New Relic) for real-time monitoring and alerting.

Knowledge of management strategies and error handling, logging, and alerting techniques.

Experience with Helm and Terraform for automating deployment and infrastructure provisioning tasks

Experience with automation frameworks, including scripting languages (e.g., Go, Bash), and configuration management tools (e.g., Ansible, Terraform)

Ability to collaborate and communicate effectively within stream-aligned teams and coordinate with other stakeholders.

Understanding of Agile and DevOps principles, focusing on continuous improvement and delivery.

Nice to Have Skills & Experience

Experience with GitOps tools (ArgoCD, Flux, Jenkins X, etc.)

Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.