Job Description
A customer of Insight Global is looking to hire a HPC Systems & Kubernetes Engineer to come onboard!
In this role, you will support a highly specialized high-performance computing (HPC) environment by executing extended integration testing, workload validation, and ensuring operational readiness after deployment. You’ll validate end-to-end behavior across compute, network fabric, and storage while running burn-in tests and performance benchmarks to confirm stability under sustained load. The work environment is hands-on and production-adjacent, requiring strong troubleshooting across multiple layers (systems, networking, and storage) and tight collaboration with cross-functional technical teams. Strong verbal and written communication skills are critical, as you’ll document findings, communicate risks, and drive resolution with stakeholders.
OVERVIEW
This position focuses on validating and stabilizing an HPC cluster in a real-world environment through advanced integration testing, workload simulations, and performance verification. You’ll help ensure the platform meets sustained throughput and reliability targets, while identifying and resolving issues across compute, networking, and storage. The role requires deep Kubernetes expertise (non-negotiable), strong Linux fundamentals, and experience with monitoring/automation to improve long-term operational confidence. Great verbal and written communication skills are required to clearly document results, explain technical root causes, and align on next steps.
DAY-TO-DAY RESPONSIBILITIES
-Execute post-deployment integration tests beyond baseline validation to confirm production readiness.
-Validate end-to-end system behavior across compute ↔ network ↔ storage, ensuring stable performance under real workloads.
-Run extended burn-in testing using synthetic workloads, emulators, and simulators (including workload proxies).
-Perform sustained performance testing (storage throughput, latency, stability) and verify cluster behavior under prolonged load.
-Execute advanced benchmarking and I/O test suites (e.g., Elbencho and application-level I/O validations) and interpret results.
-Diagnose issues across layers (CPU/memory, NICs, network fabric, storage) and perform root-cause analysis with clear documentation.
-Monitor network fabric health (ports, switches, NICs) to detect packet drops, marginal ports, and flapping links; improve observability.
-Partner with internal/external technical teams to coordinate remediation steps and ensure smooth transition from deployment to operations.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
-Kubernetes (5+ years): Deep hands-on experience operating, troubleshooting, and optimizing Kubernetes in production; must be able to support Kubernetes-driven HPC workloads.
-Containers (4+ years): Strong Docker/containerization background including image management, runtime troubleshooting, and orchestration fundamentals.
-Linux (5+ years): Advanced Linux administration and performance troubleshooting (process, memory, storage I/O, networking).
-HPC / Cluster Management (3+ years): Experience validating and supporting HPC clusters, workload scheduling/orchestration patterns, and cluster stability under load.
-High-speed Networking (3+ years): Experience with network fabric health/stability (packet drops, link flaps, marginal ports), NIC troubleshooting, and subnetting/configuration in performance-sensitive environments (e.g., 100G/200G/400G).
-Observability + Automation (3+ years): Grafana monitoring (dashboards/alerting) and Ansible automation for repeatable configuration/operations.
-Data + Cloud DB Tooling (2+ years): Kafka experience (deploy/operate/monitor) and familiarity with AWS database services (configuration, connectivity, performance considerations).
Nice to Have Skills & Experience
- Kubernetes depth is a hard blocker for this role.
- Emphasize candidates with HPC cluster validation, extended burn-in testing, and high-speed network fabric troubleshooting.
- Look for strong documentation habits and ability to communicate clearly across teams.
Compensation: $50/hr to $60/hr
**Exact compensation may vary based on several factors, including skills, experience, and education.
Employees in this role will enjoy a comprehensive benefits package starting on day one of employment, including options for medical, dental, and vision insurance. Eligibility to enroll in the 401(k) retirement plan begins after 90 days of employment. Additionally, employees in this role will have access to paid sick leave and other paid time off benefits as required under the applicable law of the worksite location.
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.