HPC Reliability Engineer AI Jobs in San Jose, CA

37 jobs (page 1)

Categories

All Categories

Engineering (15)

Software/IT (6)

HPC Reliability Engineer…

xAI (Palo Alto, CA)

…to join its SuperComputing team. In this role, you will ensure the reliability and performance of HPC infrastructure while collaborating with cross-functional ... A cutting-edge AI company in Palo Alto is seeking a...Bachelor's degree and 3+ years of experience in site reliability engineering or systems engineering, alongside strong problem-solving and… more

job goal (01/14/26)
- Save Job - Related Jobs - Block Source
Staff/Principal Software Engineer , Control

PSI Quantum (Milpitas, CA)

…a system software context. Architecture and implementation of system software for HPC , robotics, AI , quantum computing, semiconductor fabrication, or control ... Experience with Nix. Experience with highly scalable distributed systems in a reliability and up-time critical environment. Familiarity with quantum control. PhD in… more

job goal (01/14/26)
- Save Job - Related Jobs - Block Source
Tech Lead, AI Compute Infrastructure

HeyGen (Palo Alto, CA)

…scaling of our distributed systems. We are looking for a highly motivated engineer with deep experience operating and optimizing AI infrastructure at scale. ... practical experience. 5+ years of full‑time industry experience in large‑scale MLOps, AI infrastructure, or HPC systems . Experience with data frameworks… more

job goal (01/14/26)
- Save Job - Related Jobs - Block Source
Machine Learning Engineer

Protogon Holdings, Inc. (Redwood City, CA)

…reduce detection and response time, minimize false positives, and improve system reliability . Who You Are: Experienced Machine Learning Engineer : Strong ... Protogon Research builds AI models with a deep understanding of the...Strong programming skills in Python Knowledge of high-performance computing ( HPC ) and distributed computing frameworks Experience with cloud platforms… more

job goal (01/14/26)
- Save Job - Related Jobs - Block Source
AI and ML HPC Cluster…

NVIDIA (Santa Clara, CA)

…that power some of the world's most advanced computing workloads. NVIDIA is looking for an AI /ML HPC Cluster Engineer to join our MARS team. You will provide ... be doing: + Support day-to-day operations of production on-premises and multi-cloud AI / HPC clusters, ensuring system health, user satisfaction, and efficient… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Site Reliability Engineer…

NVIDIA (Santa Clara, CA)

…foundational improvements and automation to improve engineer 's productivity. As a Site Reliability Engineer , you are responsible for the big picture of how ... fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU… more

NVIDIA (01/13/26)
- Save Job - Related Jobs - Block Source
Senior GPU and HPC Infrastructure…

NVIDIA (Santa Clara, CA)

NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, knowledge of datacenter hardware, operations, ... and planning abilities. Experience working with High Performance Computing ( HPC ), GPUs, and high-performance networking (RDMA, Infiniband, RoCE) are strongly… more

NVIDIA (01/08/26)
- Save Job - Related Jobs - Block Source
Staff Quality and Reliability…

Google (Sunnyvale, CA)

…architecture and its integration within AI /ML-driven systems. As a Quality and Reliability Engineer for Google Cloud, you will lead the development of ... Staff Quality and Reliability Engineer , Google Cloud _corporate_fare_ Google...Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability… more

Google (12/30/25)
- Save Job - Related Jobs - Block Source
Principal Network Engineer - DC…

NVIDIA (Santa Clara, CA)

…a passionate engineer who will solve networking problems for scalable AI clusters. This is a hands-on network engineering position focused on the architecture, ... and deployment of global-scale DCs inter-connects and fabric for HPC , AI , and GPU computing clusters. +...reliability . + Partner with system, OS, GPU, and HPC teams to deliver scalable, highly available networks for… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Consulting Member of Technical Staff - AI…

Oracle (Santa Clara, CA)

…and debug software programs for databases, applications, tools, networks etc.As an AI /ML Infrastructure Engineer on the GPU Strategic Customers Engineering team, ... or Scala + Proven experience designing, implementing, and managing infrastructure for AI /ML or HPC workloads. + Understanding machine learning frameworks and… more

Oracle (12/05/25)
- Save Job - Related Jobs - Block Source
Senior AI /ML Infrastructure…

General Motors (Sunnyvale, CA)

**Job Description** **About the Team:** The ** AI Validation Platform** team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM's AV ... the Role:** We are seeking a Senior ML Infrastructure engineer to help build and scale robust Compute platforms...of cutting-edge GPUs, while also leveling up the platform's reliability . The successful candidate will have experience building and… more

General Motors (01/07/26)
- Save Job - Related Jobs - Block Source
Sr. System Development Engineer…

Amazon (Cupertino, CA)

…design, deliver, and operate next-generation infrastructure that powers breakthrough innovation in AI /ML and HPC workloads. If you're passionate about pushing ... Do you want to shape the future of Generative AI at AWS? Join the team building the foundation...problems. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or… more

Amazon (10/25/25)
- Save Job - Related Jobs - Block Source
Sr Hardware Development Engineer , High…

Amazon (Cupertino, CA)

…design, deliver, and operate next-generation infrastructure that powers breakthrough innovation in AI /ML and HPC workloads. If you're passionate about pushing ... Do you want to shape the future of Generative AI at AWS? Join the team building the foundation...product development disciplines such as, thermal, mechanical, power, FW/SW, reliability , and sustaining - Experience deploying and operating hardware… more

Amazon (11/05/25)
- Save Job - Related Jobs - Block Source
Software Development Engineer , Annapurna…

Amazon (Cupertino, CA)

Description We are seeking an experienced engineer to work on distributed AI /ML systems. This role involves working on collective operations - the fundamental ... operations that enable AI to scale across multiple accelerators & servers. Most...systems is valued, and experience with high-speed networking or HPC interconnects is valued highly. If you like solving… more

Amazon (12/18/25)
- Save Job - Related Jobs - Block Source
Research Scientist, AI Networking (PhD)

Meta (Menlo Park, CA)

…are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance. **Required Skills:** Research Scientist, AI Networking (PhD) ... this role, you will be a member of the AI Networking Software team and part of the bigger...Library), which enables multi-GPU and multi-node data communication through HPC -style collectives. NCCL has been integrated into PyTorch and… more

Meta (12/20/25)
- Save Job - Related Jobs - Block Source
Data Center Materials Engineer

NVIDIA (Santa Clara, CA)

…AI computing. You will define the materials backbone of NVIDIA's GW-scale AI Factories, enabling unmatched reliability , cleanliness, and longevity for the ... era of computing is built on the strength and reliability of NVIDIA's data centers! As we scale to...to stand out from the crowd: + Experience with AI / HPC and data center cooling systems (CRAHs,… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Software Engineer , SystemML - Scaling…

Meta (Menlo Park, CA)

…we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance. **Required Skills:** Software Engineer , SystemML - Scaling / ... this role, you will be a member of the Network. AI Software team and part of the bigger DC...Library), which enables multi-GPU and multi-node data communication through HPC -style collectives. NCCL has been integrated into PyTorch and… more

Meta (12/20/25)
- Save Job - Related Jobs - Block Source
Senior Software SDET Test Development…

NVIDIA (Santa Clara, CA)

…GPU Computing. We are passionate about markets include gaming, automotive, vision, HPC , datacenters and networking in addition to our traditional OEM business. ... NVIDIA is also well positioned as the ' AI Computing Company', and NVIDIA GPUs are the brains...candidate must have enterprise server integration, strong Linux experience, reliability testing with various telemetries, scale out cluster, test… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Signal and Power Integrity Engineer , PhD,…

Google (Sunnyvale, CA)

…designs, with a specific focus on TPU architecture and its integration within AI /ML-driven systems. As a Signal Integrity/Power Integrity Engineer , you will lead ... We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and… more

Google (12/16/25)
- Save Job - Related Jobs - Block Source
IC Package Mechanical FEA Engineer (R&D)

Broadcom (San Jose, CA)

…team developing high-performance package designs for ASICs for artificial intelligence ( AI ), networking, high-performance computing ( HPC ), and 5G base stations. ... **Job Description:** Broadcom is seeking an experienced package mechanical FEA engineer for very-large and complex packages for industry-leading ASICs. You will… more

Broadcom (12/02/25)
- Save Job - Related Jobs - Block Source

"Juju

Recent Searches

Recent Jobs

Account Login

Sign Up

Forgot your password?

Advanced Search