Quegel: Query-Centric Pregel-Style Big Graph Querying

We decided to develop Quegel since existing Pregel-like systems are not suitable for processing queries on big graphs. However, data scientists in many companies (like those in online social media, e-commerce and telecom industries) need to issue queries on their huge graph-structured data to analyze business data for making better business decisions.

Weaknesses of Existing Pregel-Like Systems. Existing systems are designed for offline analytics workload where most vertices (if not all) participate in the processing for many supersteps, but a graph query usually accesses only a small fraction of the whole graph. If we write a vertex-centric algorithm for a generic query, we have to run a job for every incoming query. As a result, the network bandwidth is not fully utilized and a lot of synchronization barriers are incurred, not to mention that some systems such as Giraph tightly bind the expensive graph loading with each job.

Optionally, one may hard code a vertex-centric algorithm to process a batch of k queries to improve the efficiency, where k is an input argument. However, users need to differentiate the incoming messages and/or aggregators of different queries and update k vertex values in their program, and take care of additional details like when a vertex can be deactivated (e.g., when it should be halted for all the k queries). Moreover, since queries come on demand, one has to write additional programs to probe for incoming queries and then call the batch-evaluation job to answer them. Also, it is not clear how many queries one should wait for them to arrive before starting a batch-evaluation job. Last but not least, the batch-evaluation approach does not solve the problem of low utilization of network bandwidth, since in later stage when most queries finish their processing, only a small number of queries (or stragglers) are still being processed which generates insufficient amount of messages to saturate the network bandwidth.

What Quegel Does. Quegel solves all the above problems of existing systems, and can support interactive querying by a small number of users or high-throughput batch processing of a large number of graph queries. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model that better utilizes the cluster resources. Quegel further provides convenient interfaces for constructing indexes to significantly improve query performance, which existing graph-parallel systems lack. Our experiments show that Quegel is highly efficient in answering various types of queries and is up to orders of magnitude faster than existing systems.