Domain-Specific Network Optimization for Distributed Deep Learning
Prof. Kai Chen
Department of Computer Science & Engineering, HKUST
Communication overhead poses a significant challenge to distributed DNN training. In this talk, I will overview existing efforts toward this challenge, study their advantages and shortcomings, and further present a novel solution exploiting the domain-specific characteristics of deep learning to optimize the communication overhead of distributed DNN training in a fine-grained manner. Our solution consists of several key innovations beyond prior work, including bounded-loss tolerant transmission, gradient-aware flow scheduling, and order-free per-packet load-balancing, etc., delivering up to 84.3% training acceleration over the best existing solutions. Our proposal by no means provides an ultimate answer to this research problem, instead, we hope it can inspire more critical thinkings on intersection between Networking and AI.
Kai Chen is an Associate Professor at HKUST, the Director of Intelligent Networking Systems Lab (iSING Lab) and HKUST-WeChat joint Lab on Artificial Intelligence Technology (WHAT Lab), as well as the PC for a RGC Theme-based Project. He received his BS and MS from University of Science and Technology of China in 2004 and 2007, and PhD from Northwestern University in 2012, respectively. His research interests include Data Center Networking, Cloud Computing, Machine Learning Systems, and Privacy-preserving Computing. His work has been published in various top venues such as SIGCOMM, NSDI and TON, etc., including a SIGCOMM best paper candidate. He is the Steering Committee Chair of APNet, serves on the Program Committees of SIGCOMM, NSDI, INFOCOM, etc., and the Editorial Boards of IEEE/ACM Transactions on Networking, Big Data, and Cloud Computing.
Join Zoom Meeting:
Enquiries: Ms. Karen Chan at Tel. 3943 8439