• Cluster Deployment Operations

    NVIDIA (Santa Clara, CA)
    …the first people to make them operational in production? We are seeking a dedicated Cluster Deployment Operations Engineer to support product deployments ... team, acting as the link between engineering and the NVIS field team for cluster deployment and management solutions! We bridge the gap between product roadmaps… more
    NVIDIA (12/18/25)
    - Save Job - Related Jobs - Block Source
  • AI and ML HPC Cluster Engineer

    NVIDIA (Santa Clara, CA)
    …the world's most advanced computing workloads. NVIDIA is looking for an AI/ML HPC Cluster Engineer to join our MARS team. You will provide technical engagement ... and problem solving on the management of large-scale HPC systems including the deployment of compute, networking, and storage. You will be working with a team of… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Principal, Software Engineer - Cloud…

    Walmart (Sunnyvale, CA)
    …remediation. **Automation & Observability** + Build and standardize automation for cluster deployment , expansion, and monitoring using Ansible, Terraform, and ... **Position Summary ** We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10years+ of deep technical experience in distributed storage… more
    Walmart (11/20/25)
    - Save Job - Related Jobs - Block Source
  • AI Senior Staff Systems Engineer

    Cadence Design Systems, Inc. (San Jose, CA)
    …world of technology. We are seeking a highly skilled and experienced AI Systems Engineer to join our team. This is a hands-on, senior individual contributor role ... that will be pivotal in leading the development, operations , and support of our entire AI infrastructure. You...services on both GCP and Azure. + Hands-on GPU Cluster Management: Take a leadership role in the configuration,… more
    Cadence Design Systems, Inc. (12/29/25)
    - Save Job - Related Jobs - Block Source
  • Senior GPU and HPC Infrastructure Engineer

    NVIDIA (Santa Clara, CA)
    operations , and networking, familiarity with software testing and deployment , familiarity with distributed systems, and excellent communication and planning ... management systems (Kubernetes, SLURM.) Hands-on experience in Machine Learning Operations . Hands-on experience with Bright Cluster Manager. + Hands-on… more
    NVIDIA (01/08/26)
    - Save Job - Related Jobs - Block Source
  • Senior MLOps Engineer , GenAI Framework

    NVIDIA (Santa Clara, CA)
    …Artifactory, Jira) in hybrid on-premise and cloud environments. + Assist with cluster operations and system administration (managing: servers, team accounts, ... dedicated and motivated senior build and continuous integration (CI/CD) engineer for its GenAI Frameworks (Megatron-LM (https://github.com/NVIDIA/Megatron-LM) and NeMo… more
    NVIDIA (01/14/26)
    - Save Job - Related Jobs - Block Source
  • Senior Software Development Engineer

    NVIDIA (Santa Clara, CA)
    We are looking for Senior Software Development Engineer in Test (SDET) to join our New GPU Integration (NPI) team for NVIDIA's Enterprise Compute SWQA team. Are you ... to have your skills on the team! As an engineer on this New Platform GPU Integration team, you...tools to significantly enhance our testing capabilities and streamlining operations for more efficient and accurate results. + Improve… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Staff Software Engineer , AI/ML…

    Google (Sunnyvale, CA)
    Staff Software Engineer , AI/ML Infrastructure _corporate_fare_ Google _place_ Kirkland, WA, USA; Sunnyvale, CA, USA **Advanced** Experience owning outcomes and ... experience with ML design and ML infrastructure (eg, model deployment , model evaluation, data processing, debugging, fine tuning). **Preferred...on and is growing every day. As a software engineer , you will work on a specific project critical… more
    Google (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Member of Technical Staff - Software…

    Microsoft Corporation (Mountain View, CA)
    …metrics visualization to support operational efficiency. + Manage GPU cluster operations (scheduling, isolation, utilization), high-performance computing (HPC), ... build the infrastructure that powers training, evaluation, and data platforms for reliable deployment of world-class foundational AI models. We are on a mission to… more
    Microsoft Corporation (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Staff Software Engineer , AI Infrastructure

    Google (Sunnyvale, CA)
    Staff Software Engineer , AI Infrastructure _corporate_fare_ Google _place_ Kirkland, WA, USA; Sunnyvale, CA, USA **Advanced** Experience owning outcomes and decision ... on and is growing every day. As a software engineer , you will work on a specific project critical...clusters using the latest technologies for AI acceleration and cluster interconnects and networking. We're building the AI infrastructure… more
    Google (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Senior Firmware Engineer - CSP Engagements

    NVIDIA (Santa Clara, CA)
    …experience. Ways to stand out from the crowd: + Knowledge of cloud and cluster level deployment and management systems. + Experience with GPU computing (CUDA), ... NVIDIA is seeking a Senior Firmware Engineer to join our CSP Engagements team, focusing...hardware and software, driving technical solutions from concept through deployment . What you will be doing: + Design and… more
    NVIDIA (12/31/25)
    - Save Job - Related Jobs - Block Source
  • Senior Staff Software Engineer , SRE, ML…

    Google (Sunnyvale, CA)
    Senior Staff Software Engineer , SRE, ML Fleet Systems _corporate_fare_ Google _place_ Sunnyvale, CA, USA; Kirkland, WA, USA; +2 more; +1 more **Advanced** Experience ... of resource management systems (eg, compute infrastructure, Kubernetes, Flex), cluster management, and scheduling algorithms. + Familiarity with Machine Learning… more
    Google (01/14/26)
    - Save Job - Related Jobs - Block Source
  • Hybrid (San Jose or Raleigh) - Site Reliability…

    Insight Global (San Jose, CA)
    Job Description A large software and networking company is looking for a Site Reliability Engineer to join a growing team (ideally Hybrid in San Jose or RTP). This ... supporting Shell scripting - Build tools and automation to reduce manual operations and improve team efficiency - Integrate and maintain secrets management using… more
    Insight Global (01/14/26)
    - Save Job - Related Jobs - Block Source
  • Senior Site Reliability Engineer - Storage

    NVIDIA (Santa Clara, CA)
    …of artificial intelligence. Join our team at NVIDIA as a Senior Site reliability engineer focused on HPC storage and play a crucial role in designing, implementing, ... deploying distributed storage solutions, build automation tools, and ensuring the efficient operations of our growing IT ecosystem. You will collaborate closely with… more
    NVIDIA (01/14/26)
    - Save Job - Related Jobs - Block Source
  • Technical Lead, NPI Program Management

    Meta (Fremont, CA)
    …Introduction (NPI) Roadmap success. As the Network is essential to every mega cluster and Meta's data center success, its roadmap is rapidly expanding in volume, ... Engineering, Legal, Finance, Accounting, Compliance) across multiple domains.The NPI Operations Organization supports complex roadmaps to deliver NPI programs on… more
    Meta (12/20/25)
    - Save Job - Related Jobs - Block Source
  • Software Engineering Manager, Emerging On-prem AI…

    Google (Sunnyvale, CA)
    …**About the job** Like Google's own ambitions, the work of a Software Engineer goes beyond just Search. Software Engineering Managers have not only the technical ... multiple teams and locations, a large product budget and oversee the deployment of large-scale projects across multiple sites internationally. Google is looking to… more
    Google (12/18/25)
    - Save Job - Related Jobs - Block Source