In this role, you will be responsible for Event Monitoring & Detection, acting on alerts and events to prevent incidents and minimize business impact. You will work to reduce alert noise and implement and maintain event monitoring across data pipelines, ML model workflows, real-time data streams, and AI services. Integration with tools such as Azure Monitor, Datadog, AppDynamics, ELK, and Grafana will be essential to capture events across ingestion, transformation, model training, and inference. You will set up real-time alerts on SLO/SLI breaches, including data delays, model failures, prediction accuracy drops, and drift detection.
You will design Event Correlation & Analysis logic to reduce alert noise and surface actionable insights from thousands of daily events. Leveraging machine learning or rule-based anomaly detection, you will group related events from multiple sources, such as data pipeline latency and model scoring errors. Intelligent dashboards will be implemented to visualize the health and performance of AI/ML systems from an event-driven perspective.
In Event-Driven Incident Response, you will trigger and coordinate incident response workflows based on critical events impacting AI/ML services. Automation of escalation paths using ServiceNow Event Management or other tools will be key to reducing MTTR. You will lead post-event analysis (PEA) sessions for high-severity events and convert findings into long-term fixes or monitoring enhancements.
Proactive Observability Engineering will involve partnering with ML and Data Engineers to implement custom telemetry for jobs, feature stories, and batch/streaming data pipelines. You will continuously refine alert thresholds, runbooks, and automation scripts to pre-empt failures before they impact the business. You will document and maintain standard operating procedures (SOPs) for event triage, classification, and escalation.
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
HR@insightglobal.com. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
- 5-7+ years in event monitoring, detection, and incident response within data and/or AI/ML environments
- Proficiency in using monitoring tools such as Azure Monitor, Datadog, AppDynamics, ELK, and Grafana
- Experience with real-time data streams, data pipelines, and ML model workflows
- Experience with automating escalation paths using tools like ServiceNow Event Management
- Comprehensive knowledge of logs, traces, metrics, and alerts across cloud-native AI architectures.
- Ability to refine alert thresholds, runbooks, and automation scripts to pre-empt failures.
- AI/ML experience within observability and monitoring
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.