Principal Software Engineer, Distributed Systems Engineer - DGX Cloud
$248,000 – $396,750 USD
Tech Stack
Responsibilities
- Be part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads.
- Work on custom software related to scheduling GPU resources on Kubernetes.
- Implement monitoring and health management capabilities to ensure industry-leading reliability, availability, and scalability of GPU assets.
- Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
- Evaluate system failures and improve services based on a well-defined incident management process.
Soft Skills
Communication SkillsCross-Functional CollaborationOperational Excellence
Benefits
- Equity
Culture
Fast-PacedContinuous ImprovementCross-Functional TeamsDiverse LeadershipInclusive Hiring
Requirements
Required: BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience
Regions: Us
Get jobs like this in your inbox
Weekly Kubernetes, Cluster Operations, Operator Development hiring trends and salary data — free.
Join 6 engineers getting weekly insights
Get market intelligence in your inbox
Free weekly insights on tech hiring trends, salaries, and in-demand stacks.
Already a subscriber? Sign in
About NVIDIA
Industry: ai
Size: enterprise
NVIDIA is a technology company focused on AI systems, building products like the NeMo Platform for developing, evaluating, deploying, and operating AI systems at scale.
View company profile →Compensation
Base salary: $248,000 – $396,750 USD
Equity: Eligible for equity