- NVIDIA (Santa Clara, CA)
- …at NVIDIA, you will lead the development of DGX Cloud strategy for GPU fleet lifecycle, health, observability and utilization monitoring, and remediation. You ... define and drive the technical implementation for DGX Cloud operations practice for GPU fleet lifecycle. + Collaborate on Cross Domain Disciplines: drive the… more
- Google (Sunnyvale, CA)
- … GPU Performance team is responsible for optimizing, modeling and evaluating GPU systems for comparative analysis and benchmarking for Google's internal ... GPU Performance Engineer _corporate_fare_ Google _place_...We strive for extracting maximum efficiency in Google's growing GPU fleet . The team identifies performance opportunities… more
- NVIDIA (Santa Clara, CA)
- …+ Understanding of performance, security and reliability in complex distributed systems . Familiarity with system level architecture, data synchronization, fault ... science of computer graphics. With the invention of the GPU - the engine of modern visual computing -...Cluster Manager. + Hands-on experience developing and/or operating hardware fleet management systems . Proven operational excellence in… more
- NVIDIA (Seattle, WA)
- … Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA's high-performance GPU infrastructure. You will play a technical ... lead role in designing scalable cloud services that integrate with diverse systems including GPU telemetry in datacenters, and enabling operational automation… more
- NVIDIA (Seattle, WA)
- … Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA's high-performance GPU infrastructure. You will play a critical ... role in designing scalable cloud services that integrate with diverse systems including GPU telemetry in datacenters, and enabling operational automation across… more
- Meta (Menlo Park, CA)
- …aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and ... of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems , AI infrastructure, high performance computing,… more
- LinkedIn (Mountain View, CA)
- …algorithms, AI frameworks, data infra, compute software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also ... billions of user queries. Model Training Infrastructure: As an engineer on the AI Training Infra team, you will...compute efficient infra on top of native cloud, enable GPU based inference for a large variety of use… more
- LinkedIn (Mountain View, CA)
- …algorithms, AI frameworks, data infra, compute software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also ... billions of user queries Model Training Infrastructure: As an engineer on the AI Training Infra team, you will...technical discipline + Experience building ML applications, LLM serving, GPU serving. + Experience with search systems … more
- LinkedIn (Mountain View, CA)
- …algorithms, AI frameworks, data infra, compute software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also ... billions of user queries. Model Training Infrastructure: As an engineer on the AI Training Infra team, you will...compute efficient infra on top of native cloud, enable GPU based inference for a large variety of use… more
- Microsoft Corporation (Mountain View, CA)
- …and operate at the intersection of AI algorithmic innovation, purpose-built AI hardware, systems , and software. We are a team of highly capable and motivated people ... Windows, Bing, SQL Server, and Dynamics. As a Principal Engineer on the team, you will have the opportunity...all levels of abstraction including kernel, model, algorithm and system level, monitor performance and drive efficiencies that contribute… more
- Microsoft Corporation (Redmond, WA)
- …is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, ... AI Infra team is looking for a Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI. The scheduler...inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity… more
- Microsoft Corporation (Redmond, WA)
- …is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, ... the AI Infra team is looking for a Software Engineer II - AI Infrastructure (Scheduler) - CoreAI, with...inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity… more
- Oracle (Oklahoma City, OK)
- …You should be both a rock-solid lead developer, curious problem solver, a distributed systems generalist and/or skilled Linux engineer with Systems triage ... deep into any part of the stack and low-level systems to design broad distributed system interactions....configure, secure, and validate server platforms across OCI's massive fleet of Compute and GPU Infrastructure. You… more
- Microsoft Corporation (Redmond, WA)
- …and operate at the intersection of AI algorithmic innovation, purpose-built AI hardware, systems , and software. We are a team of highly capable and motivated people ... Bing, SQL Server, and Dynamics. As a Senior Software Engineer on the team, you will have the opportunity...the art LLMs + Measure, benchmark performance on Nvidia/AMD GPU 's and first party Microsoft silicon + Optimize and… more
- NVIDIA (TX)
- …on-call rotation. + Consult with and provide consultation for peer teams on systems design best practices. + Participate in a supportive culture of values-driven ... experience. + 5+ years of relevant experience in infrastructure and fleet management engineering. + Experience with infrastructure automation and distributed … more
- NVIDIA (Santa Clara, CA)
- NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More ... recently, GPU deep learning ignited modern deep learning - the...What you will be doing: + Drive next generation fleet management solutions for scaling AI infrastructure using GPUs… more
- DoorDash (San Francisco, CA)
- …Logistics, Fraud, and Search. About the Role We're looking for a Staff Software Engineer with deep expertise in ML model serving to drive the next generation of ... This is a highly technical, hands-on role: you'll design and build systems that power real-time predictions across millions of requests per second, tackling… more
- Oracle (Juneau, AK)
- …Support the high-level thermal design direction and data center strategy for complex systems ranging from advanced computing (HPC, GPU , FPGA Accelerators, etc.) ... **Job Description** As a Senior Principal Thermal Engineer , you will focus on the alignment of...fluid cooling assembly and fit-up engineering verification of various GPU and switch configurations in respective racks + Experience… more
- LinkedIn (Mountain View, CA)
- …across algorithms, AI frameworks, infrastructure software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team ... work for Training infrastructure. As a Principal Staff Software Engineer on the AI Training Infra team, you will...models. + Improving the observability and understandability of various systems with a focus on improving developer productivity and… more
- NVIDIA (Santa Clara, CA)
- We are seeking a Senior AI/ML Performance and Efficiency Engineer , GPU Clusters at NVIDIA to join our AI Efficiency efforts. As an Engineer , you will have a ... and application deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions… more