Cluster Deployment Operations Engineer Jobs in Sunnyvale, CA

16 jobs (page 1)

Categories

All Categories

Engineering (9)

Cluster Deployment Operations…

NVIDIA (Santa Clara, CA)

…the first people to make them operational in production? We are seeking a dedicated Cluster Deployment Operations Engineer to support product deployments ... team, acting as the link between engineering and the NVIS field team for cluster deployment and management solutions! We bridge the gap between product roadmaps… more

NVIDIA (12/18/25)
- Save Job - Related Jobs - Block Source
AI and ML HPC Cluster Engineer

NVIDIA (Santa Clara, CA)

…the world's most advanced computing workloads. NVIDIA is looking for an AI/ML HPC Cluster Engineer to join our MARS team. You will provide technical engagement ... and problem solving on the management of large-scale HPC systems including the deployment of compute, networking, and storage. You will be working with a team of… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Principal, Software Engineer - Cloud…

Walmart (Sunnyvale, CA)

…remediation. **Automation & Observability** + Build and standardize automation for cluster deployment , expansion, and monitoring using Ansible, Terraform, and ... **Position Summary ** We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10years+ of deep technical experience in distributed storage… more

Walmart (11/20/25)
- Save Job - Related Jobs - Block Source
AI Senior Staff Systems Engineer

Cadence Design Systems, Inc. (San Jose, CA)

…world of technology. We are seeking a highly skilled and experienced AI Systems Engineer to join our team. This is a hands-on, senior individual contributor role ... that will be pivotal in leading the development, operations , and support of our entire AI infrastructure. You...services on both GCP and Azure. + Hands-on GPU Cluster Management: Take a leadership role in the configuration,… more

Cadence Design Systems, Inc. (12/29/25)
- Save Job - Related Jobs - Block Source
Senior GPU and HPC Infrastructure Engineer…

NVIDIA (Santa Clara, CA)

… operations , and networking, familiarity with software testing and deployment , familiarity with distributed systems, and excellent communication and planning ... management systems (Kubernetes, SLURM.) Hands-on experience in Machine Learning Operations . Hands-on experience with Bright Cluster Manager. + Hands-on… more

NVIDIA (01/08/26)
- Save Job - Related Jobs - Block Source
Senior MLOps Engineer , GenAI Framework

NVIDIA (Santa Clara, CA)

…Artifactory, Jira) in hybrid on-premise and cloud environments. + Assist with cluster operations and system administration (managing: servers, team accounts, ... dedicated and motivated senior build and continuous integration (CI/CD) engineer for its GenAI Frameworks (Megatron-LM (https://github.com/NVIDIA/Megatron-LM) and NeMo… more

NVIDIA (01/14/26)
- Save Job - Related Jobs - Block Source
Senior Software Development Engineer…

NVIDIA (Santa Clara, CA)

We are looking for Senior Software Development Engineer in Test (SDET) to join our New GPU Integration (NPI) team for NVIDIA's Enterprise Compute SWQA team. Are you ... to have your skills on the team! As an engineer on this New Platform GPU Integration team, you...tools to significantly enhance our testing capabilities and streamlining operations for more efficient and accurate results. + Improve… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Staff Software Engineer , AI/ML…

Google (Sunnyvale, CA)

Staff Software Engineer , AI/ML Infrastructure _corporate_fare_ Google _place_ Kirkland, WA, USA; Sunnyvale, CA, USA **Advanced** Experience owning outcomes and ... experience with ML design and ML infrastructure (eg, model deployment , model evaluation, data processing, debugging, fine tuning). **Preferred...on and is growing every day. As a software engineer , you will work on a specific project critical… more

Google (01/10/26)
- Save Job - Related Jobs - Block Source
Member of Technical Staff - Software…

Microsoft Corporation (Mountain View, CA)

…metrics visualization to support operational efficiency. + Manage GPU cluster operations (scheduling, isolation, utilization), high-performance computing (HPC), ... build the infrastructure that powers training, evaluation, and data platforms for reliable deployment of world-class foundational AI models. We are on a mission to… more

Microsoft Corporation (01/10/26)
- Save Job - Related Jobs - Block Source
Staff Software Engineer , AI Infrastructure

Google (Sunnyvale, CA)

Staff Software Engineer , AI Infrastructure _corporate_fare_ Google _place_ Kirkland, WA, USA; Sunnyvale, CA, USA **Advanced** Experience owning outcomes and decision ... on and is growing every day. As a software engineer , you will work on a specific project critical...clusters using the latest technologies for AI acceleration and cluster interconnects and networking. We're building the AI infrastructure… more

Google (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Firmware Engineer - CSP Engagements

NVIDIA (Santa Clara, CA)

…experience. Ways to stand out from the crowd: + Knowledge of cloud and cluster level deployment and management systems. + Experience with GPU computing (CUDA), ... NVIDIA is seeking a Senior Firmware Engineer to join our CSP Engagements team, focusing...hardware and software, driving technical solutions from concept through deployment . What you will be doing: + Design and… more

NVIDIA (12/31/25)
- Save Job - Related Jobs - Block Source
Senior Staff Software Engineer , SRE, ML…

Google (Sunnyvale, CA)

Senior Staff Software Engineer , SRE, ML Fleet Systems _corporate_fare_ Google _place_ Sunnyvale, CA, USA; Kirkland, WA, USA; +2 more; +1 more **Advanced** Experience ... of resource management systems (eg, compute infrastructure, Kubernetes, Flex), cluster management, and scheduling algorithms. + Familiarity with Machine Learning… more

Google (01/14/26)
- Save Job - Related Jobs - Block Source
Hybrid (San Jose or Raleigh) - Site Reliability…

Insight Global (San Jose, CA)

Job Description A large software and networking company is looking for a Site Reliability Engineer to join a growing team (ideally Hybrid in San Jose or RTP). This ... supporting Shell scripting - Build tools and automation to reduce manual operations and improve team efficiency - Integrate and maintain secrets management using… more

Insight Global (01/14/26)
- Save Job - Related Jobs - Block Source
Senior Site Reliability Engineer - Storage

NVIDIA (Santa Clara, CA)

…of artificial intelligence. Join our team at NVIDIA as a Senior Site reliability engineer focused on HPC storage and play a crucial role in designing, implementing, ... deploying distributed storage solutions, build automation tools, and ensuring the efficient operations of our growing IT ecosystem. You will collaborate closely with… more

NVIDIA (01/14/26)
- Save Job - Related Jobs - Block Source
Technical Lead, NPI Program Management

Meta (Fremont, CA)

…Introduction (NPI) Roadmap success. As the Network is essential to every mega cluster and Meta's data center success, its roadmap is rapidly expanding in volume, ... Engineering, Legal, Finance, Accounting, Compliance) across multiple domains.The NPI Operations Organization supports complex roadmaps to deliver NPI programs on… more

Meta (12/20/25)
- Save Job - Related Jobs - Block Source
Software Engineering Manager, Emerging On-prem AI…

Google (Sunnyvale, CA)

…**About the job** Like Google's own ambitions, the work of a Software Engineer goes beyond just Search. Software Engineering Managers have not only the technical ... multiple teams and locations, a large product budget and oversee the deployment of large-scale projects across multiple sites internationally. Google is looking to… more

Google (12/18/25)
- Save Job - Related Jobs - Block Source

"Juju

Recent Searches

Recent Jobs

Account Login

Sign Up

Forgot your password?

Advanced Search