Were looking for a technically skilled and proactive Agentic & AI Tech Ops Engineer to join our AI Center of Excellence. This role ensures the reliability, scalability, and efficiency of AI and Agentic AI systems in production. Youll work closely with AI developers, architects, and data scientists to deploy, monitor, and maintain AI infrastructure, while driving automation and operational excellence.
Key Responsibilities:
Deployment & Infrastructure
Deploy and manage AI models and agentic systems across cloud (GCP, AWS, Azure) and on-prem environments.
Implement CI/CD pipelines and optimize cloud resources for scalability and cost-efficiency.
Monitoring & Reliability
Build monitoring, logging, and alerting systems to ensure high availability and performance.
Identify and resolve system issues and performance bottlenecks.
Incident Management
Provide operational support, conduct root cause analysis, and maintain runbooks and SOPs.
Participate in on-call rotations for critical AI services.
Automation & Best Practices
Automate deployments and maintenance using scripting and tools.
Enforce security, compliance, and operational best practices.
Collaboration & Documentation
Partner with cross-functional teams to ensure smooth production transitions.
Maintain clear documentation and provide feedback on system performance.
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
Human Resources Request Form. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
Bachelors in Computer Science or related field.
47+ years in Tech Ops, DevOps, SRE, or MLOps roles.
Experience with cloud platforms (especially GCP/Vertex AI), CI/CD tools, scripting (Python, Bash), and containerization (Docker, Kubernetes).
Strong troubleshooting skills and familiarity with monitoring tools (e.g., Prometheus, Grafana).
Masters degree and relevant cloud certifications.
Experience with MLOps/LLMOps, AI/ML frameworks (TensorFlow, PyTorch), and agentic AI systems.
Familiarity with vector databases, data pipelines (Airflow, Kubeflow), and agile environments.
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.