Job Description
As a Site Reliability Engineer on our team, you’ll work with the US government and its affiliates on the development of more robust systems by building a resilient infrastructure. You’ll build in redundancy, implement monitoring tools, and automate wherever possible. You’ll reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in monitoring and observability while assisting junior engineers and broadening your knowledge base.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
• 5+ years of experience with information technology
• Experience with provisioning, operations, and management in AWS environments
• Experience with monitoring applications, network appliances, or infrastructure for Open Source and Enterprise or CloudNative, including Prometheus, Grafana, Splunk, ELK, Dynatrace, DataDog, CloudWatch, or OpenSearch
• Experience with applying core SRE practices and principles, monitoring instrumentation, defining SLIs, SLOs, and Error Budget, evaluating production readiness and post-mortems, and reducing toil
• Experience with Cloud infrastructure and automation tools, including Terraform or CloudFormation
• Experience with Git repositories and CI/CD concepts and leveraging Infrastructure as Code (IaC) to configure AWS Cloud environments
• Experience with programming or scripting languages, including Bash, Python, or JavaScript for automation purposes and building a scalable infrastructure in AWS
• Experience with Agile methodologies, SDLC, and working in an Agile development environment
• Bachelor's degree
Nice to Have Skills & Experience
• Experience with making systems fully observable, manipulating and transforming telemetry, and monitoring distributed complex architected systems across separate regions
• Experience with applying advanced SRE practices and principles, including capacity planning, cost optimization, chaos engineering, self-healing architecture, and advanced alerting techniques
• Experience with integrating monitoring to ITSM tooling, including Service Now, Jira Service Desk, Pager Duty, VictorOps, or OpsGenie, or Everbridge
• Experience with Gitlab CI for CI/CD deployments
• Experience with serverless computing and container orchestration technologies, including AWS Lambda and Amazon ECS and EKS Clusters
• Experience with Cloud computing concepts and AWS services, including network and security concepts and services, such as VPCs, Security Groups, VPNs, Firewalls, WAF, or TLS Certificates
• Ability to design, plan, and implement scalable and resilient systems and troubleshoot complex technical issues
• Possession of excellent documentation, problem-solving, and collaboration skills
• Possession of excellent verbal and written communication skills
• AWS Certification
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.