An Open Source Model for Execution Graph Simulation for Distributed AI/ML Training.
Paid internship at Keysight Technologies Romania · 26/06/2023
  • – Cloud computing
  • – Networking
  • room București

Participate in building an Open Source Model for Execution Graph Simulation for Distributed AI/ML Training and the ecosystem tools around it. Implementing OpenAPI/protobuf model of the Distributed ML/AI system. Understand and build models of collective communication primitives like AllReduce, Scatter, Broadcast, Gather, AllToAll for various collective communications libraries. Contribute to the Simulation/Emulation engine using TCP and RoCEv2 transport. Build curated models of AI/ML benchmarks and existing models.

What you will gain: • Work as part of an agile team working on building the testing infrastructure for the next generation distributed systems that run AI/ML Trainings at Datacenter scale. • Understand the challenges in designing, building and testing such systems. Skills required: Programming with Python on Linux, working understanding of TCP/IP Networking, Containerization.