Special Session: Deep Neural Network Acceleration: From Training To Inference

Organizer: Bei Yu, Chinese University of Hong Kong

Talk 1: Training Large Neural Networks with Small Network Footprint

Hong Xu, City University of Hong Kong


Distributed machine learning (ML) systems using parameter servers are prevalently used in industry. With the rapid development of GPU, training performance is usually bottlenecked at network communication for exchanging gradients and parameters. In this talk, I will share our work on how to alleviate the communication bottleneck and speed up distributed ML training. First I will motivate the problem with measurements on GPU clusters in Azure and EC2. Then I will share the design and implementation of our solution, a system called Stanza that separates the training of different layers in ML models, by exploiting their distinct characteristics. A prototype of Stanza is implemented on PyTorch. Our evaluation on Azure and EC2 shows that Stanza provides 1.25x to 13x speedups over parameter server, for training common CNNs on ImageNet with Nvidia V100 GPUs and 10GbE network.


Hong Xu is an assistant professor in Department of Computer Science, City University of Hong Kong. His research area is computer networking and systems, particularly data center networks and big data systems. He received the B.Eng. degree from The Chinese University of Hong Kong in 2007, and the M.A.Sc. and Ph.D. degrees from University of Toronto in 2009 and 2013, respectively. He was the recipient of an Early Career Scheme Grant from the Hong Kong Research Grants Council in 2014. He received several best paper awards, including the IEEE ICNP 2015 best paper award. He is a senior member of IEEE and member of ACM.

Talk 2: Performance Modeling and Optimization Framework for CNN Acceleration on FPGA

Wei Zhang, Hong Kong University of Science and Technology


FPGAs speed up the system performance significantly with low energy consumption, attracting increasing attention in a wide variety of applications, such as the deep neural network acceleration. However, efficient FPGA design for CNN applications requires a long development time and a strong background in hardware details. It is because the implementation on FPGAs requires deep comprehension of the hardware architecture and even with high-level synthesis, it relies on the use of synthesis directives to generate desired designs meeting a set of specifications. Consequently, an easy-to-use yet powerful auto CNN design generation framework is required. In this work, we introduce performance modeling and optimization framework for HLS based on C and OpenCL. The OpenCL based framework is further extended to find efficient FPGA design for CNN applications according to the device resource limitation and the CNN specification. Our framework mainly consists of a LoopTree data structure to present the design space and a coarse-grained and fine-grained performance model to predict the throughput. At last, several performance metrics are developed to guide the search the optimal design in the design space. The experimental results show the efficiency of the framework.


Prof. Zhang received her Ph.D. degree in Electrical Engineering from Princeton University with Wu Prize for research excellence. She joins Hong Kong University of Science and Technology in 2013 and establishes Reconfigurable Computing Systems Lab. Prof. Zhang currently serves as the Associate Editor for IEEE Transactions on VLSI Systems and the Area Editor of Reconfigurable Computing for ACM Transactions on Embedded Computing Systems. She serves on many organization committees and technical program committees including CASES, ISLPED, ASP-DAC, FPT, FPL, etc. Prof. Zhang published more than 60 technical papers in referred international journals and conferences, and authored two book chapters. She receives a best paper award from IEEE Computer Society Annual Symposium on VLSI, and currently holds two international patents.

Talk 3: Neural Network Data-Flow Optimization and Scheduling

Bei Yu, Chinese University of Hong Kong


Deep neural network (DNN) has been applied to many learning tasks thanks to its significant improvement of the accuracy. However, it suffers from the conflict between the resource-demanding characteristic and the resources-constrained feature of hardware, especially in inference part, which is usually deployed in resource constrained platform. In this talk, I will share our works on how to speed up DNN inference under resource constraint. First, I will motivate this kind of conflict with measurements on different hardware platforms. Then I will share the design and implementation of our works. We design a DNN inference system on FPGA, which optimize data access patterns elaborately to reduce communication cost and achieve better performance. Further, we propose a scheduling algorithm of DNN to fully make use of the hardware resources.


Prof. Bei Yu received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Texas at Austin in 2014. He is currently an Assistant Professor in the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He has served as TPC Chair of ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) 2019, served in the program committees of DAC, ICCAD, DATE, ASPDAC, ISPD, the editorial boards of Integration, the VLSI Journal, and IET Cyber-Physical Systems: Theory & Applications. He is Editor-of-Chief of IEEE TCCPS Newsletter. He has received five Best Paper Awards from Integration, the VLSI Journal in 2018, ISPD 2017, SPIE Advanced Lithography Conference 2016, ICCAD 2013, and ASPDAC 2012, three other Best Paper Award Nominations at DAC 2014, ASPDAC 2013, ICCAD 2011, and four ICCAD/ISPD contest awards.