Job Description
Insight Global is seeking a Machine Learning Reliability Engineer for a large enterprise client modernizing and scaling its ML/AI platform. This role focuses on ensuring ML systems are reliable, observable, and cost-efficient at scale. The engineer will define SLOs, build robust Datadog monitoring, standardize incident response, and partner closely with FinOps and governance teams. This is a highly visible role critical to production ML success—ideal for an SRE who understands ML workloads and wants to own reliability, observability, and operational excellence across enterprise AI systems.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
• Strong background in Site Reliability Engineering (SRE) principles
• Hands-on Datadog experience (dashboards, metrics, logs, traces, alerting)
• Experience supporting ML/AI systems in production
• Ability to define and enforce SLOs / SLIs for distributed systems
• Monitoring of availability, latency, accuracy, drift, and pipeline health
• Experience operating in cloud environments (Azure strongly preferred)
• Proven skills in performance tuning and cost optimization
• Incident response ownership (alerts, runbooks, escalation paths)
Nice to Have Skills & Experience
• ML-specific observability (model performance, drift, LLM monitoring)
• AI / LLM observability experience
• Snowflake and modern data platform monitoring
• FinOps partnership experience
• ServiceNow integration (incident & change management)
• Enterprise audit, governance, and compliance exposure
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.