Insight Global is seeking a talented and passionate Mid-to-Senior Level Site Reliability Engineer to join our dynamic engineering team. You will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical systems and applications running on Google Cloud Platform. You will leverage your deep understanding of SRE principles and GCP services, along with tools like Datadog, PagerDuty, ChaosSearch, and HashiCorp Terraform, to proactively identify and resolve potential issues, automate operational tasks, and continuously improve our infrastructure and deployment processes. As the newest member of the SRE team, you'll have the opportunity to work alongside high performers in a small, fast-paced environment, applying your existing expertise while learning new skills.
Responsibilities:
Design, implement, and manage scalable and highly available infrastructure on GCP, utilizing services such as Compute Engine, Kubernetes Engine (GKE), Cloud Storage, BigQuery, and Spanner.
Develop and maintain comprehensive monitoring, alerting, and logging solutions using Datadog and GCP Cloud Logging to provide deep visibility into system health and performance.
Utilize PagerDuty for effective incident management, ensuring timely response and resolution of critical issues.
Proactively identify potential bottlenecks and failure points through capacity planning and performance testing, leveraging ChaosSearch for log analysis.
Automate repetitive operational tasks using scripting languages (e.g., Python, Bash) and infrastructure-as-code tools, primarily HashiCorp Terraform, within the GCP ecosystem.
Participate in incident response, root cause analysis, and post-mortem reviews to drive continuous improvement and prevent future occurrences.
Collaborate closely with development teams to ensure that new services and features are designed, deployed, and operated with reliability and scalability in mind on GCP, including our Spanner database.
Define and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and track system reliability.
Contribute to the development and maintenance of CI/CD pipelines leveraging GCP services like Cloud Build and Artifact Registry.
Stay up-to-date with the latest GCP services and best practices, as well as advancements in Datadog, PagerDuty, ChaosSearch, and HashiCorp Terraform, and advocate for their adoption where appropriate.
Pay Rate: $70/hr
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
Human Resources Request Form. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
Bachelor's degree in Computer Science, Engineering, or a related field.
5+ years of experience in a Site Reliability Engineering, DevOps, or similar role.
Significant hands-on experience designing, deploying, and managing applications and infrastructure on Google Cloud Platform, including experience with Google Cloud Spanner.
Strong understanding of core SRE principles and practices, such as toil reduction, automation, monitoring, and incident management.
Proficiency in at least one scripting language (e.g., Python, Bash).
Extensive experience with HashiCorp Terraform for infrastructure-as-code.
Experience with containerization and orchestration technologies, particularly Docker and Kubernetes (GKE preferred).
Proven experience with monitoring and logging tools, specifically Datadog and GCP Cloud Logging.
Experience with PagerDuty for incident management.
Experience with Linux operating systems and a solid understanding of core command-line utilities (e.g., terraform, kubectl, helm).
Excellent problem-solving and troubleshooting skills in complex distributed systems.
Strong communication and collaboration skills.
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.