Job Description
Description:
We are seeking a Senior DevOps Engineer to join our specialized AI Engineering and Research team. This team is responsible for building Everse (an evaluation and simulation platform for AI agents) and advanced LLM data pipelines. Your role will focus on architecting the underlying infrastructure that allows our researchers and engineers to deploy, scale, and monitor complex AI models and web applications securely.
You will bridge the gap between AI research and production-grade stability, ensuring our Kubernetes clusters and CI/CD pipelines are optimized for high-performance AI workloads.
Key Responsibilities:
• Infrastructure as Code (IaC): Design, build, and maintain scalable cloud infrastructure using Terraform or CloudFormation.
• Kubernetes Orchestration: Manage and optimize secure Kubernetes clusters, specifically for hosting data-heavy React/Node.js applications and Python-based AI services.
• CI/CD Pipeline Development: Build and automate robust deployment pipelines to ensure rapid, high-frequency releases for the Everse platform.
• MLOps Support: Collaborate with AI scientists to streamline the deployment of LLM and RLHF workflows, managing the infrastructure required for model evaluation and simulation.
• Security & Compliance: Implement security best practices (OWASP, IAM roles) to ensure data privacy within our annotation and video surveillance tools.
• Monitoring & Observability: Establish deep visibility into system performance and cost-tracking for cloud resources (AWS/GCP/Azure).
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
• 6+ years of DevOps/SRE experience in a cloud-native environment.
• Expert-level Kubernetes (K8s) knowledge, including cluster security, networking, and scaling.
• Strong proficiency in Python (for automation scripts and data pipeline support) and Shell scripting.
• Hands-on experience with Cloud Providers: Deep expertise in AWS, GCP, or Azure.
• IaC Mastery: Proven experience with Terraform, Pulumi, or similar tools.
• Security Mindset: Experience securing applications in Kubernetes and familiarity with container security scanning.
Nice to Have Skills & Experience
• Prior experience supporting ML/AI teams or managing GPU-accelerated workloads.
• Experience with MLOps tools (e.g., Kubeflow, MLflow, or Weights & Biases).
• Familiarity with Vector Databases or high-scale data processing engines.
• Background in automating complex simulation environments or sandboxes.
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.