Job Description
Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development.
You'll work with the government and its affiliates on the development of more robust systems by building a resilient infrastructure. You'll build in redundancy, implement monitoring tools, and automate wherever possible. You'll reduce toil by scripting routine tasks and automating self-repair. This is your chance to leverage your expertise in monitoring and observability while assisting junior engineers and broadening your knowledge base.
Required Skills & Experience
You Have:
* 5+ years of experience with information technology
* 4+ years of experience with provisioning, operations, and management in AWS environments
* 2+ years of experience with monitoring applications, network appliances, or infrastructure for Open Source and Enterprise or CloudNative, including Prometheus, Grafana, Splunk, ELK, Dynatrace, DataDog, CloudWatch, or OpenSearch
* Experience with applying core SRE practices and principles, monitoring instrumentation, defining SLIs, SLOs, and Error Budget, evaluating production readiness and post-mortems, and reducing toil
* Experience with cloud infrastructure and automation tools, including Terraform or CloudFormation
* Experience with Git repositories, CI/CD concepts, and leveraging Infrastructure as Code (IaC) to configure AWS Cloud environments
* Experience with programming or scripting languages, including Bash, Python, or JavaScript for automation purposes and building a scalable infrastructure in AWS
* Experience with Agile methodologies, SDLC, and working in an Agile development environment
* Ability to obtain a security clearance
* HS diploma or GED
Nice to Have Skills & Experience
Nice If You Have:
* Experience with making systems fully observable, manipulating and transforming telemetry, and monitoring distributed complex architected systems across separate regions
* Experience with applying advanced SRE practices and principles, including capacity planning, cost optimization, chaos engineering, self-healing architecture, and advanced alerting techniques
* Experience with integrating monitoring to ITSM tooling, including Service Now, Jira Service Desk, Pager Duty, VictorOps, or OpsGenie, or Everbridge
* Experience with Gitlab CI for CI/CD deployments
* Experience with serverless computing and container orchestration technologies, including AWS Lambda, Amazon ECS, and EKS Clusters
* Experience with cloud computing concepts and AWS services, including network and security concepts and services, such as VPCs, Security Groups, VPNs, Firewalls, WAF, or TLS Certificates
* Ability to design, plan, and implement scalable and resilient systems and troubleshoot complex technical issues
* Possession of excellent documentation, problem-solving, and collaboration skills
* Possession of excellent oral and written communication skills
* AWS Certification
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.