Here we only list some system work in Husky Data Lab. Our work on algorithm and theory can be found in the publications page. In addition, our lab has also developed a number of large-scale systems with industry such as graph databases, big data platforms, cloud-scale data warehousing, deep learning frameworks and optimization, graph neural network systems, and large-scale cluster resource management.
- Cluster Resource Scheduler: ParSync [Best Paper Award of USENIX ATC 2021]
ParSync is a distributed resource scheduler framework designed for large-scale production clusters with 100K matchines and billioins of tasks each day on average. The core of ParSync is a fine-grained staleness-aware state sharing design called partitioned synchronization, which effectively reduces conflicts in contending resources to achieve low scheduling delay and high scheduling quality. The simplicity in its design also enables it to maintain user transparency and backward compatibility in production clusters. ParSync has been deployed in Fuxi - the distributed cluster scheduler in Alibaba.
- Graph Neural Network Framework: Seastar+DGCL
Seastar is a high-performance GNN training framework. For easy programmability, Seastar offers a vertex-centric programming model to allow users to focus on the logic of a single vertex and program GNNs with native Python syntax. To achieve high performance, Seastar identifies abundant operator fusion opportunities in the computational graphs of GNN training and generates high-performance fused kernels with novel designs such feature-adaptive groups, locality-centric execution and dynamic load balancing.
To support distributed training, DGCL - an efficient Distributed Graph Communication Library is used for communication planning when extending a single-GPU GNN system to distributed training. DGCL jointly considers fully utilizing fast links (e.g., NVLinks), fusing communication, avoiding contention and balancing loads on different links to reduce the communication time and improve the scalability of distributed GNN training.
Seastar and DGCL are key components in our GNN open-source community project with MindSpore - Huawei's Deep Learning Training/Inference Framework. A simple demo video of Seastar is shown here and more details can be found in the main page of our GNN project.
- Resource Maximization in Big Data Analytics: Ursa
Ursa is a big data analytics framework that integrates resource scheduling and job execution. Ursa is designed to handle jobs with frequent fluctuations in resource usage, which are commonly found in workloads such as OLAP, machine learning and graph analytics. Ursa enables the job scheduler to capture accurate resource demands dynamically from the execution framework and to provide low-latency fine-grained resource allocation, thereby achieving high resource utilization and improving both makespan and average JCT. Ursa is currently the big data framework that we are actively maintaining and has also been deployed for commercial use.
- Graph Database: Grasper
Grasper is an RDMA-enabled distributed graph database for OLAP on large graphs. Grasper is designed to address key challenges of processing OLAP workloads on graphs such as diverse query complexity, diverse data access patterns, and high communication and CPU costs. Grasper introduces an RDMA-enabled native graph store to reduce the high network communication cost for distributed query processing. The key design of Grasper is a novel query execution model, called Expert Model, which supports high concurrency in processing multiple queries and adaptive parallelism control within each query according to its workload. Expert model also enables tailored optimizations for different categories of query operations based on their own data access patterns. Grasper has been deployed for commercial use. The academic version can be found here.
- Graph Database: G-Tran
G-Tran is an RDMA-enabled distributed graph database for OLTP on large graphs with serializable and snapshot isolation support. Grasper is designed to address key challenges of processing OLTP workloads on graphs such as the irregularity of graph structures, skewed data distribution, and low throughput and high abort rate due to the large read/write sets in graph transactions. G-Tran introduces a graph-native data store to achieve good data locality and fast data access for transactional updates and queries. It adopts a fully decentralized architecture that leverages RDMA to process distributed transactions with the MPP model, which can achieve high performance by utilizing all computing resources. G-Tran also features a new MV-OCC implementation to address the issue of large read/write sets in graph transactions.
- Stream Processing: Nova
Nova is a distributed stream processing system that targets at complex streaming data analytics workloads such as dynamic graph analytics and online learning. Nova supports timestamped state sharing by a new system-wide state abstraction with well-defined time semantics. It introduces optimizations such as conflict-free message delivery, prioritizing replies, and update progress acceleration to achieve much lower latency and higher throughput for advanced data stream analytics compared with the state-of-the-art systems.
- Elastic Deep Learning: EDL
EDL is a framework that supports elastic deep learning in multi-tenant GPU clusters with low overhead. EDL is a light-weight coordination layer between a cluster scheduler and a DL framework (e.g., MindSpore, TensorFlow, PyTorch). The DL framework only needs to retrieve the meta data of a block of training data from EDL and notifies EDL after finishing a mini-batch. The scheduler can instruct EDL to remove/add any worker for a training job using a simple API, e.g., scale_in() and scale_out(). EDL can benefit multi-tenant GPU cluster management in many ways, including improving resource utilization by adapting to load variations, maximizing the use of transient idle GPUs, performance profiling, straggler mitigation, and job migration.
- Big Data Framework: Tangram
Tangram is a distributed dataflow framework that presents a new programming model called MapUpdate to determine data mutability according to workloads, which not only brings good expressiveness but also enables a rich set of system features (e.g., asynchronous execution) and provides strong fault tolerance. Tangram achieves comparable performance with specialized systems for a wide variety of workloads including bulk processing, graph analytics, graph mining, iterative machine learning, and complex pipelines.
- Geo-Distributed Data Center Management: Yugong
Yugong is a system that manages data placement and job placement in Alibaba's geo-distributed DCs. Each DC consists thousands to tens of thousands of physical servers, storing EBs of data and serving millions of batch analytics jobs every day. Yugong works seamlessly with MaxCompute (a fully fledged, multi-tenancy data processing platform) in very large scale production environments and has significantly reduced the costly cross-DC bandwidth usage (70-80\% total reduction).
- Big Data Framework: Husky
Husky is a general-purpose distributed computing system that builds on a high-performance kernel, supporting execution patterns such as fine-grained and coarse-grained, batching and streaming, synchronous and asynchronous. Husky has been significantly improved in many aspects since its introduction in PVLDB 9(5), and a new version with more powerful features and better performance will be open source soon.
- Distributed Machine Learning: FlexPS
FlexPS is a Parameter-Server based system that introduces a novel multi-stage abstraction to support flexible parallelism control, which is crucial for the efficient execution of machine learning tasks with dynamic workloads. Distinguishing system designs such as stage scheduler, stage-aware consistency controller, flexible model preparation, and direct model transfer are introduced to support efficient multi-stage machine learning. FlexPS includes optimizations such as customizable parameter partition, customizable KV-server, and repetitive get avoidance to support general machine learning. FlexPS achieves significant speedups and resource saving compared with existing PS systems.
- Big Graph Data Platform: BigGraph@CUHK
BigGraph@CUHK is a fast and scalable open-source big graph platform for both offline graph analytics and online graph querying. Currently, BigGraph@CUHK consists of the following components:
More details, including source codes and examples of applications implemented on the BigGraph@CUHK platform, can be found in BigGraph@CUHK homepage (http://www.cse.cuhk.edu.hk/systems/graph/). BigGraph@CUHK is still evolving to become more complete for all sorts of graph-related tasks, visit our site again to find more new components that will be added.
- G-Miner: a distributed graph mining system that supports a wide range of graph mining jobs. G-Miner streamlines tasks so that CPU computation, network communication and disk I/O can process their workloads concurrently, and achieves orders of magnitude speedups over existing solutions and scales to much larger graphs.
- Pregel+: an optimized Pregel implementation with novel techniques to significantly reduce communication cost and eliminate skewness in communication.
- Blogel: a novel block-centric framework which naturally solves performance bottlenecks arising from adverse characteristics of real-world graphs, namely skewed degree distribution, (relatively) high density, and large diameter, and achieves orders of magnitude performance improvements over the state-of-the-art graph processing systems.
- Quegel: a distributed system supporting efficient online graph querying with a Pregel-like API for user-friendly programming and a novel superstep-sharing execution model for better utilization of the cluster resources.
- GraphD: a distributed Pregel-like system offering out-of-core support for processing very big graphs in a small cluster of commodity PCs, with performance comparable with the state-of-the-art distributed in-memory graph systems.
- Gtimer: a distributed system that provides a general programming framework for users to easily develop scalable and efficient algorithms for both offline analytics and online querying in a massive dynamic, temporal graph.
- LWCP: a fault tolerance mechanism for Pregel-like systems with performance tens of times faster than conventional checkpointing mechanisms.