Job Description
Build AI agentic flows that automate incident response and operational tasks.
Use LLMs to analyze alerts, logs, and SOPs, then decide the correct actions without human involvement.
Replace repetitive, manual incident work with automation that follows existing processes.
Improve system reliability through better alerting, observability, and automated remediation.
Integrate AI-driven automation with monitoring, logging, and cloud services.
Partner with SRE, DevOps, and platform teams to safely deploy and scale automation.
Continuously improve automation based on real production signals and outcomes.
$24-$28/hour
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
Strong understanding of LLMs and how they can be used to make decisions, trigger actions, and automate workflows.
Proven ability to turn manual SOPs and runbooks into automation, not just follow them.
Strong experience with automation using Python (Go is also acceptable).
Experience working in incident response, reliability, SRE, DevOps, or platform operations environments.
Comfort working in cloud-native systems, especially GCP.
Experience with production observability — knowing what signals matter, what’s breaking, and why.
Required Technical Experience
Google Cloud Platform (GCP)
Automation: Python (Go is acceptable)
Observability:
Google Managed Prometheus (GMP)
Grafana Enterprise
Log configuration and analysis
Google Cloud Services:
Kubernetes (GKE)
Cloud Logging
BigQuery
Pub/Sub
Google Cloud Storage
General understanding of Google networking
Developer Tools:
GitHub Copilot
GitHub Copilot for workflows
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.