- Amazon (Cupertino, CA)
- …- Preferred previous software engineer expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training . The group ... Web Services (AWS) is looking for a Software Development Engineer II to build, deliver, and maintain complex products...stable diffusion, Vision Transformers and many more. The ML Distributed Training team works side by side… more
- Amazon (Cupertino, CA)
- …as well as Stable Diffusion, Vision Transformers (ViT) and many more. The ML Distributed Training team works side by side with chip architects, compiler ... accelerators. This role is for a Senior Machine Learning Engineer in the Distribute Training team for...engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances. Experience… more
- Google (Sunnyvale, CA)
- Senior Software Engineer , Google Distributed Cloud, Kubernetes _corporate_fare_ Google _place_ Sunnyvale, CA, USA **Mid** Experience driving progress, solving ... year of experience with software design and architecture for distributed systems. **Preferred qualifications:** + Master's degree or PhD...on and is growing every day. As a software engineer , you will work on a specific project critical… more
- Rubrik (Palo Alto, CA)
- …heart of this transformation is our Atlas platform. We are looking for an experienced distributed systems engineer to guide us through the next stage of the ... the edge, or in the cloud. It is a distributed , scale-out, fault tolerant, performant, deduplicated user-space filesystem that...evolution of our data platform. As an engineer in the team, you'll design, develop and deliver… more
- Google (Sunnyvale, CA)
- Software Engineer III, Infrastructure, Google Distributed Cloud _corporate_fare_ Google _place_ Sunnyvale, CA, USA **Mid** Experience driving progress, solving ... + 2 years of experience with developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies,...on and is growing every day. As a software engineer , you will work on a specific project critical… more
- Google (Sunnyvale, CA)
- Site Reliability Engineer , Google Distributed Cloud, Connected SRE _corporate_fare_ Google _place_ Sunnyvale, CA, USA **Advanced** Experience owning outcomes and ... projects. + 3 years of experience designing, analyzing, and troubleshooting distributed systems. **Preferred qualifications:** + Master's degree in Computer Science… more
- Google (Sunnyvale, CA)
- Staff Systems Development Engineer , Google Distributed Cloud _corporate_fare_ Google _place_ New York, NY, USA; Seattle, WA, USA; +2 more; +1 more **Advanced** ... experience working with vendors or customers. + Experience as a Customer Solution Engineer . + Experience with physical servers, storage, and network devices, as well… more
- Google (Sunnyvale, CA)
- Senior Software Engineer , Google Distributed Cloud Hosted _corporate_fare_ Google _place_ Sunnyvale, CA, USA **Mid** Experience driving progress, solving ... and architecture. + 3 years of experience developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage or… more
- Palo Alto Networks (Santa Clara, CA)
- …Summary** At Palo Alto Networks, we are redefining cybersecurity. As a Distinguished Engineer on the Enterprise DLP team, you will be the foremost technical leader ... all network, cloud, and user vectors. **Key Responsibilities** As a Distinguished Engineer , you will own the long-term technical direction and execution for all… more
- Rubrik (Palo Alto, CA)
- …or Networking domain + Strong fundamentals in data structures, algorithms, and distributed systems design + Strong background in Systems Programming + Expertise in ... Proficient in Python, Go, and either C++, Java, or Scala + Large distributed systems design and development experience is preferred + Knowledge of Storage,… more
- LinkedIn (Mountain View, CA)
- …problems. + Designing, implementing, and optimizing the performance of large-scale distributed training for personalized recommendation as well as large ... LLMs, GNNs, Incremental Learning, Online Learning, and advanced LLM Agents work for Training infrastructure. As a Principal Staff Software Engineer on the AI… more
- NVIDIA (Santa Clara, CA)
- …product roadmaps. What you will be doing: + Design and maintain large-scale distributed training systems to support multi-modal foundation models for robotics. + ... NVIDIA is searching for a senior or principal engineer who specializes in building cutting-edge infrastructure for...and AI infrastructure; + Proven experience designing and optimizing distributed training systems with frameworks like PyTorch,… more
- LinkedIn (Mountain View, CA)
- …and resolve issues in popular libraries like Huggingface, Horovod and PyTorch, enable distributed training over 100s of billions of parameter models, debug and ... Online Learning and Serving performance optimizations across billions of user queries. Model Training Infrastructure: As an engineer on the AI Training … more
- LinkedIn (Mountain View, CA)
- …and resolve issues in popular libraries like Huggingface, Horovod and PyTorch, enable distributed training over 100s of billions of parameter models, debug and ... Online Learning and Serving performance optimizations across billions of user queries Model Training Infrastructure: As an engineer on the AI Training … more
- LinkedIn (Mountain View, CA)
- …fundamentally believe top talent can come from anywhere, regardless of educational training or professional experience. REACH Program REACH is a multi-year program ... set and gain the experience needed to become an Engineer at LinkedIn. The time each apprentice spends in...dramatic growth in membership and products. You will utilize distributed systems and algorithms, develop applications at scale, learn… more
- LinkedIn (Mountain View, CA)
- …and resolve issues in popular libraries like Huggingface, Horovod and PyTorch, enable distributed training over 100s of billions of parameter models, debug and ... Online Learning and Serving performance optimizations across billions of user queries. Model Training Infrastructure: As an engineer on the AI Training … more
- Google (Sunnyvale, CA)
- Software Engineer , Threat Infrastructure and Detection, AI Security _corporate_fare_ Google _place_ Sunnyvale, CA, USA **Mid** Experience driving progress, solving ... + 2 years of experience with developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies,...or more of these programming languages as a back-end engineer : Java, Go. + Experience in one or more… more
- Meta (Menlo Park, CA)
- …NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training . In other words, nearly every distributed GPU-based ML ... full-stack distributed ML reliability and performance (eg Large-Scale GenAI/LLM training ) from the trainer down to the inter-GPU and network communication layer.… more
- Google (Sunnyvale, CA)
- …Enhance model and system performance for both low-latency inference and large-scale distributed training workloads. + Develop post- training algorithms, such ... speed and reduce memory consumption on modern GPU and TPU architectures. + Engineer custom kernels to maximize training efficiency for memory-bound large models… more
- General Motors (Sunnyvale, CA)
- …model training performance analysis and optimizaiton solutions to scale distributed training workflows and maximize resource utilization across heterogeneous ... experience + 3+ years specialized experience in AI/ML infrastructure, eg, enabling distributed training for scaling large ML models + Strong programming… more