Systems only (our lab has also done or is doing a number of large-scale systems with industry such as graph database, cloud-scale data warehousing, deep learning frameworks and optimization, graph neural networks, federated learning, and large-scale cluster resource management)
Tangram is a distributed dataflow framework that presents a new programming model called MapUpdate to determine data mutability according to workloads, which not only brings good expressiveness but also enables a rich set of system features (e.g., asynchronous execution) and provides strong fault tolerance. Tangram achieves comparable performance with specialized systems for a wide variety of workloads including bulk processing, graph analytics, graph mining, iterative machine learning, and complex pipelines.
Yugong is a system that manages data placement and job placement in Alibaba's geo-distributed DCs. Each DC consists thousands to tens of thousands of physical servers, storing EBs of data and serving millions of batch analytics jobs every day. Yugong works seamlessly with MaxCompute (a fully fledged, multi-tenancy data processing platform) in very large scale production environments and has significantly reduced the costly cross-DC bandwidth usage (70-80\% total reduction).
Husky is a general-purpose distributed computing system that builds on a high-performance kernel, supporting execution patterns such as fine-grained and coarse-grained, batching and streaming, synchronous and asynchronous. Husky has been significantly improved in many aspects since its introduction in PVLDB 9(5), and a new version with more powerful features and better performance will be open source soon.
FlexPS is a Parameter-Server based system that introduces a novel multi-stage abstraction to support flexible parallelism control, which is crucial for the efficient execution of machine learning tasks with dynamic workloads. Distinguishing system designs such as stage scheduler, stage-aware consistency controller, flexible model preparation, and direct model transfer are introduced to support efficient multi-stage machine learning. FlexPS includes optimizations such as customizable parameter partition, customizable KV-server, and repetitive get avoidance to support general machine learning. FlexPS achieves significant speedups and resource saving compared with existing PS systems.
BigGraph@CUHK is a fast and scalable open-source big graph platform for both offline graph analytics and online graph querying. Currently, BigGraph@CUHK consists of the following components:
More details, including source codes and examples of applications implemented on the BigGraph@CUHK platform, can be found in BigGraph@CUHK homepage (http://www.cse.cuhk.edu.hk/systems/graph/). BigGraph@CUHK is still evolving to become more complete for all sorts of graph-related tasks, visit our site again to find more new components that will be added.
- G-Miner: a distributed graph mining system that supports a wide range of graph mining jobs. G-Miner streamlines tasks so that CPU computation, network communication and disk I/O can process their workloads concurrently, and achieves orders of magnitude speedups over existing solutions and scales to much larger graphs.
- Pregel+: an optimized Pregel implementation with novel techniques to significantly reduce communication cost and eliminate skewness in communication.
- Blogel: a novel block-centric framework which naturally solves performance bottlenecks arising from adverse characteristics of real-world graphs, namely skewed degree distribution, (relatively) high density, and large diameter, and achieves orders of magnitude performance improvements over the state-of-the-art graph processing systems.
- Quegel: a distributed system supporting efficient online graph querying with a Pregel-like API for user-friendly programming and a novel superstep-sharing execution model for better utilization of the cluster resources.
- GraphD: a distributed Pregel-like system offering out-of-core support for processing very big graphs in a small cluster of commodity PCs, with performance comparable with the state-of-the-art distributed in-memory graph systems.
- Gtimer: a distributed system that provides a general programming framework for users to easily develop scalable and efficient algorithms for both offline analytics and online querying in a massive dynamic, temporal graph.
- LWCP: a fault tolerance mechanism for Pregel-like systems with performance tens of times faster than conventional checkpointing mechanisms.