- xAI (Palo Alto, CA)
- …to join its SuperComputing team. In this role, you will ensure the reliability and performance of HPC infrastructure while collaborating with cross-functional ... A cutting-edge AI company in Palo Alto is seeking a...Bachelor's degree and 3+ years of experience in site reliability engineering or systems engineering, alongside strong problem-solving and… more
- PSI Quantum (Milpitas, CA)
- …a system software context. Architecture and implementation of system software for HPC , robotics, AI , quantum computing, semiconductor fabrication, or control ... Experience with Nix. Experience with highly scalable distributed systems in a reliability and up-time critical environment. Familiarity with quantum control. PhD in… more
- HeyGen (Palo Alto, CA)
- …scaling of our distributed systems. We are looking for a highly motivated engineer with deep experience operating and optimizing AI infrastructure at scale. ... practical experience. 5+ years of full‑time industry experience in large‑scale MLOps, AI infrastructure, or HPC systems . Experience with data frameworks… more
- Protogon Holdings, Inc. (Redwood City, CA)
- …reduce detection and response time, minimize false positives, and improve system reliability . Who You Are: Experienced Machine Learning Engineer : Strong ... Protogon Research builds AI models with a deep understanding of the...Strong programming skills in Python Knowledge of high-performance computing ( HPC ) and distributed computing frameworks Experience with cloud platforms… more
- NVIDIA (Santa Clara, CA)
- …that power some of the world's most advanced computing workloads. NVIDIA is looking for an AI /ML HPC Cluster Engineer to join our MARS team. You will provide ... be doing: + Support day-to-day operations of production on-premises and multi-cloud AI / HPC clusters, ensuring system health, user satisfaction, and efficient… more
- NVIDIA (Santa Clara, CA)
- …foundational improvements and automation to improve engineer 's productivity. As a Site Reliability Engineer , you are responsible for the big picture of how ... fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU… more
- NVIDIA (Santa Clara, CA)
- NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, knowledge of datacenter hardware, operations, ... and planning abilities. Experience working with High Performance Computing ( HPC ), GPUs, and high-performance networking (RDMA, Infiniband, RoCE) are strongly… more
- Google (Sunnyvale, CA)
- …architecture and its integration within AI /ML-driven systems. As a Quality and Reliability Engineer for Google Cloud, you will lead the development of ... Staff Quality and Reliability Engineer , Google Cloud _corporate_fare_ Google...Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability… more
- NVIDIA (Santa Clara, CA)
- …a passionate engineer who will solve networking problems for scalable AI clusters. This is a hands-on network engineering position focused on the architecture, ... and deployment of global-scale DCs inter-connects and fabric for HPC , AI , and GPU computing clusters. +...reliability . + Partner with system, OS, GPU, and HPC teams to deliver scalable, highly available networks for… more
- Oracle (Santa Clara, CA)
- …and debug software programs for databases, applications, tools, networks etc.As an AI /ML Infrastructure Engineer on the GPU Strategic Customers Engineering team, ... or Scala + Proven experience designing, implementing, and managing infrastructure for AI /ML or HPC workloads. + Understanding machine learning frameworks and… more
- General Motors (Sunnyvale, CA)
- **Job Description** **About the Team:** The ** AI Validation Platform** team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM's AV ... the Role:** We are seeking a Senior ML Infrastructure engineer to help build and scale robust Compute platforms...of cutting-edge GPUs, while also leveling up the platform's reliability . The successful candidate will have experience building and… more
- Amazon (Cupertino, CA)
- …design, deliver, and operate next-generation infrastructure that powers breakthrough innovation in AI /ML and HPC workloads. If you're passionate about pushing ... Do you want to shape the future of Generative AI at AWS? Join the team building the foundation...problems. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or… more
- Amazon (Cupertino, CA)
- …design, deliver, and operate next-generation infrastructure that powers breakthrough innovation in AI /ML and HPC workloads. If you're passionate about pushing ... Do you want to shape the future of Generative AI at AWS? Join the team building the foundation...product development disciplines such as, thermal, mechanical, power, FW/SW, reliability , and sustaining - Experience deploying and operating hardware… more
- Amazon (Cupertino, CA)
- Description We are seeking an experienced engineer to work on distributed AI /ML systems. This role involves working on collective operations - the fundamental ... operations that enable AI to scale across multiple accelerators & servers. Most...systems is valued, and experience with high-speed networking or HPC interconnects is valued highly. If you like solving… more
- Meta (Menlo Park, CA)
- …are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance. **Required Skills:** Research Scientist, AI Networking (PhD) ... this role, you will be a member of the AI Networking Software team and part of the bigger...Library), which enables multi-GPU and multi-node data communication through HPC -style collectives. NCCL has been integrated into PyTorch and… more
- NVIDIA (Santa Clara, CA)
- …AI computing. You will define the materials backbone of NVIDIA's GW-scale AI Factories, enabling unmatched reliability , cleanliness, and longevity for the ... era of computing is built on the strength and reliability of NVIDIA's data centers! As we scale to...to stand out from the crowd: + Experience with AI / HPC and data center cooling systems (CRAHs,… more
- Meta (Menlo Park, CA)
- …we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance. **Required Skills:** Software Engineer , SystemML - Scaling / ... this role, you will be a member of the Network. AI Software team and part of the bigger DC...Library), which enables multi-GPU and multi-node data communication through HPC -style collectives. NCCL has been integrated into PyTorch and… more
- NVIDIA (Santa Clara, CA)
- …GPU Computing. We are passionate about markets include gaming, automotive, vision, HPC , datacenters and networking in addition to our traditional OEM business. ... NVIDIA is also well positioned as the ' AI Computing Company', and NVIDIA GPUs are the brains...candidate must have enterprise server integration, strong Linux experience, reliability testing with various telemetries, scale out cluster, test… more
- Google (Sunnyvale, CA)
- …designs, with a specific focus on TPU architecture and its integration within AI /ML-driven systems. As a Signal Integrity/Power Integrity Engineer , you will lead ... We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and… more
- Broadcom (San Jose, CA)
- …team developing high-performance package designs for ASICs for artificial intelligence ( AI ), networking, high-performance computing ( HPC ), and 5G base stations. ... **Job Description:** Broadcom is seeking an experienced package mechanical FEA engineer for very-large and complex packages for industry-leading ASICs. You will… more