- NVIDIA (Santa Clara, CA)
- Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer . At NVIDIA, you'll be part of the team shaping the future of computing and ... techniques and Infrastructure as Code (IaC). + Deep understanding of Linux operating systems and TCP/IP fundamentals. + Expertise with at least one major cloud… more
- Celonis (Redwood City, CA)
- …engineering and Site Reliability Engineering (SRE) principles to drive system reliability , scalability, and operational excellence across the organization. ... Engineering with modern Software Engineering practices to build resilient and scalable systems . + Lead reliability efforts for a fleet of 80+ FedRAMP-compliant… more
- NVIDIA (Santa Clara, CA)
- …efficient and reliable systems is an imperative. We are looking for a System Reliability Engineer to join NVIDIA's existing Reliability Engineering ... Center Servers. What you'll be doing: + Provide expertise in Hardware Reliability Engineering for Electronics/Server Systems (graphics cards, server, rack,… more
- Amazon (Cupertino, CA)
- …designs cutting AI platforms for the world's largest Cloud Services provider. As a Senior Reliability Engineer you will engage with an experienced ... * You will have a fundamental understanding of Reliability statistics/ Reliability tests and/or solid understanding of computer systems to influence… more
- NVIDIA (Santa Clara, CA)
- …drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer , you are responsible for the big picture of ... of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on… more
- Palo Alto Networks (Santa Clara, CA)
- …insights into our systems ' performance and health. **Your Impact** As a Senior SRE with the Cortex Cloud Security Posture Management team, you will: + Cloud ... including the design, implementation, and continuous enhancement of our comprehensive observability systems . To meet the opportunities that such a role provides, you… more
- NVIDIA (Santa Clara, CA)
- …foundational improvements and automation to improve engineer 's productivity. As a Site Reliability Engineer , you are responsible for the big picture of how ... our systems relate to each other, we use a breadth...comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency. + Develop, define… more
- NVIDIA (Santa Clara, CA)
- …once they are live by measuring and monitoring availability, latency and overall system health. + Scale systems sustainably through mechanisms like automation, ... time enabling developers to make changes to the existing system through careful preparation and planning while keeping an... systems by pushing for changes that improve reliability and velocity + Practice sustainable incident response and… more
- NVIDIA (Santa Clara, CA)
- …TensorFlow, JAX, and Ray. + A strong background in hardware health monitoring and system reliability . + Hands-on expertise in operating and scaling distributed ... the world. What You Will Be Doing: + Develop and maintain large-scale systems supporting critical use cases for AI Infrastructure, driving reliability ,… more
- Rubrik (Palo Alto, CA)
- … and services with the objective of achieving and exceeding availability and reliability goals * Manage and streamline monitoring systems to enhance ... enable teams at Rubrik to develop secure software and protect data and systems with appropriate security controls. Information Security also develops systems to… more
- Palo Alto Networks (Santa Clara, CA)
- …insights into our systems ' performance and health. **Your Impact** As a Senior Staff SRE with the Cortex Observability team, you will: + Cloud Expertise: Utilize ... including the design, implementation, and continuous enhancement of our comprehensive observability systems . To meet the opportunities that such a role provides, you… more
- ServiceNow, Inc. (Santa Clara, CA)
- …unlock new work experiences in the future. **As a Senior Staff Machine Learning Engineer - Site Reliability Engineer you will:** + Contribute to the ... It all started in sunny San Diego, California in 2004 when a visionary engineer , Fred Luddy, saw the potential to transform how we work. Fast forward to today -… more
- Google (Sunnyvale, CA)
- …+ 7 years of experience building and developing infrastructure, distributed systems , or networks, or experience with compute technologies, storage, or hardware ... + Experience in building large-scale operations capabilities in Site Reliability Engineering. Google Cloud's software engineers develop the next-generation… more
- SanDisk (Milpitas, CA)
- …to keep our world moving forward. **Job Description** We are seeking a Principal Engineer , Reliability Engineering to join our team in Milpitas, United States. ... fostering a culture of continuous improvement + Develop and implement comprehensive reliability programs for complex hardware and software systems + Collaborate… more
- LinkedIn (Mountain View, CA)
- …leading architectural transformations at internet-scale companiesDeep knowledge of systems reliability , observability frameworks, and fault-tolerant architecture ... based in Sunnyvale, CA or San Francisco, CA.Key ResponsibilitiesServe as a senior technical leader driving the long-term reliability and observability strategy… more
- Palo Alto Networks (Santa Clara, CA)
- …Networks runs a large infrastructure and is one of the largest GCP customers. As a Senior Staff DevOps Engineer for the App Services team, you will be part of ... of critical business and production issues **Your Experience** + 4+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering + 2+ years… more
- Rubrik (Palo Alto, CA)
- …+ Minimum 1-3 years of experience as a Development, DevOps or Site Reliability Engineer Willing to provide 24/7 coverage + Strong Documentation skills ... to talk to you! **About The Role:** Sr. Site Reliability Engineers at Rubrik are systems /software engineers...Polaris Cloud Platform + Good mix of software and system engineering skills + Participate on-call rotations across continents,… more
- SanDisk (Milpitas, CA)
- …RESPONSIBILITIES: Main responsibilities of the role focus on validation of memory system design on Sandisk's enterprise SSD products + In-depth understanding of NAND ... Design and development of test cases for new memory system firmware designs + Development and validation of data...of NAND management FW features + Perform end-of-life (EOL) reliability verification tests + Perform failure analysis on EOL… more
- Amazon (Santa Clara, CA)
- Description As a Senior Systems Development Engineer of the Backbone Enterprise, and Regional Engineering (BERE) team, you will provide expertise to expand ... Team: - Meet Robert, VP: https://youtu.be/8v5i42FL02w - Meet Xiaonan, Sr. Systems Development Engineer : https://youtu.be/58Q1SyGZ4qg Key job responsibilities You… more
- NVIDIA (Santa Clara, CA)
- …places to work in the world. We are now looking for a Senior System Power Validation & Applications Engineer in the Datacenter System Engineering Team. ... solutions of density, performance, transient response, manageability, scalability, manufacturability, reliability , security, protection, and cost. You will gain a… more