Senior Software Engineer - AI Infrastructure Networking

2 Days Old

Join Our Innovative Team! The Oracle Cloud Infrastructure (OCI) AI Infrastructure Innovation team is leading the way in developing advanced AI/HPC networking solutions tailored for GPU superclusters. Our mission is to implement state-of-the-art RDMA-based networking across frontend and backend fabrics, empowering our clients to achieve unparalleled performance for AI training and inference. In this essential role, you will play a key part in architecting, designing, and developing pioneering networking software that enhances RDMA capabilities for GPUs and optimizes storage access. If you are enthusiastic about large-scale distributed systems, high-speed networking, and AI workloads, we invite you to make significant contributions to revolutionary advancements in the industry. Your Responsibilities: Lead the architectural design and development of high-performance RDMA solutions within OCI's AI/HPC platforms, emphasizing comprehensive frontend and backend networking solutions. Enhance network performance and TCP efficiency by pinpointing improvements across multiple components including Kernel, NIC, switches, transport protocols, and storage communications. Develop production-quality, high-performance software features while maintaining stringent standards for reliability, observability, and security. Define performance goals and success metrics; design benchmarks, and execute large-scale experiments to validate throughput, latency, and performance behavior. Collaborate effectively with teams working on GPU platforms, storage systems, databases, and control systems to deliver thorough end-to-end networking solutions, influencing OCI's network architecture and standards. Mentor fellow engineers, offer technical leadership, and contribute to long-term roadmaps and strategies. Required Qualifications: A robust background in software engineering with a solid understanding of data structures and algorithms, proven expertise in optimizing large systems for high scale, low latency, and high throughput. Demonstrated experience in developing, deploying, and maintaining high-performance production code. Ability to lead teams technically, mentor peers, and achieve results in challenging and unclear problem spaces. A BS/MS in Computer Science, Electrical/Computer Engineering, or equivalent practical experience. Preferred Qualifications: Familiarity with RDMA networking (RoCE and/or InfiniBand), including aspects like congestion control, reliability, and performance tuning at scale. Keen understanding of AI/HPC technologies and workflows, including NCCL/RCCL/MPI, Slurm or similar schedulers, and GPU communication patterns. Hands-on experience with integrating GPU Direct and NVMe-oF in live environments. Proficiency with performance and observability tools, such as eBPF, perf, flame graphs, and NIC/switch telemetry, along with a focus on SLO-driven operations at scale. We provide a competitive salary ranging from $96,800 to $223,400 per year, with opportunities for bonuses and equity. Oracle also offers a comprehensive benefits package including medical, dental, and vision insurance, retirement plans, and generous paid time off. Join Oracle, where true innovation flourishes and every voice is valued. We are dedicated to creating an inclusive workforce that fosters opportunities for everyone. Apply now to help us shape the future of AI and cloud solutions that will make a positive impact on billions of lives.
Location:
Seattle
Category:
Computer And Mathematical Occupations

We found some similar jobs based on your search