As a Senior AI Engineer, youll be a core member of a pioneering team focused on developing foundational AI infrastructure tailored for power grid applications. Youll play a key role in architecting and refining the systems, data pipelines, and training workflows that support advanced AI research. Collaborating closely with research scientists, youll help transform innovative ideas into scalable, high-performance implementations that accelerate the deployment of impactful AI solutions. This position demands hands-on expertise in distributed training, data engineering, and MLOps, along with a strong track record of building resilient and scalable AI systems. The ideal candidate is proactive, resourceful, and committed to delivering high-quality solutions with speed and precision.
Job Responsibilities
Architect and optimize large-scale training and fine-tuning workflows, from data ingestion to inference, with a focus on maximizing Model Flop Utilization (MFU) across compute clusters.
Collaborate closely with research teams to convert experimental models and algorithms into efficient, production-ready code.
Identify and resolve performance bottlenecks throughout the training stack, continuously improving speed and scalability.
Evaluate and select hardware, software, and cloud technologies to support the AI infrastructure platform.
Implement MLOps best practices using tools like MLFlow and Weights & Biases to ensure reproducibility, reliability, and continuous model improvement.
Maintain comprehensive documentation of infrastructure and training processes, and stay informed on emerging strategies to enhance workflows and system performance.
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to
Human Resources Request Form. The EEOC "Know Your Rights" Poster is available
here.
To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/ .
Masters degree or higher in Computer Science, Engineering, or a related technical discipline.
Minimum of 5 years of experience as a Data & AI Engineer or Machine Learning Engineer, with a focus on infrastructure for large-scale machine learning systems. Candidates with more or less experience may be considered for different levels.
Strong hands-on experience with AI frameworks such as PyTorch, JAX, or PyTorch Lightning, and expertise in multi-node GPU training and optimization for large foundation models.
Proven ability to troubleshoot, debug, and optimize performance using data-driven approaches.
Solid communication and collaboration skills, with experience implementing MLOps practices for model tracking, evaluation, and deployment.'
Active GitHub profile showcasing open-source contributions to data engineering or deep learning infrastructure projects.
Experience developing custom CUDA, Triton, or CUTLASS kernels.
Familiarity with performance monitoring and profiling tools for distributed training and data pipeline
Benefit packages for this role will start on the 31st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.