G-thinker: A SubgraphCentric Framework for Efficient Subgraph Finding in a Big Graph

G-thinker is an efficient and novel subgraph-centric system, for developing scalable solutions to different subgraph finding problems in a unified framework.

Many distributed graph computation platforms have been developed recently to analyze big graphs created by modern applications, such as Pregel and GraphLab. These systems adopt a vertex centric framework where vertices exchange messages with each other iteratively to update their states, and thus data-intensive tasks like PageRank computation and graph traversal can be efficiently supported. In contrast, real applications often require computation intensive graph analytics such as graph matching and community detection, which aim to find from a big graph those subgraphs that satisfy certain requirements. These subgraph finding problems cannot be represented by a vertex-centric algorithm for efficient execution.

G-thinker overcomes all the above-mentioned problems and adopts a subgraph-centric programming interface that is natural for subgraph finding: users can design their algorithms to perform backtracking like in the serial algorithm counterparts, which avoids transmitting large amounts of intermediate data. To write a G-thinker program, a user only needs to specify how to grow a candidate subgraph by pulling more surrounding vertices, and how to process the subgraph; details like communication and distributed execution are hidden from the programming interface.

In G-thinker, each machine only keeps and processes a small batch of tasks in memory at any time (to achieve high throughput), and subgraphs that are waiting to be processed are buffered in a diskbased priority queue. The priority queue is organized with a minhashing based task scheduling strategy, to maximize the opportunity that the processing of different subgraphs can share common vertices (including their adjacent edges) that are cached in a local machine. As far as we know, G-thinker is the first framework for subgraph finding that supports vertex-sharing by different subgraphs, and that guarantees that the processing of a candidate subgraph will not block that of any other subgraph (whether it is on the same machine or not). We have used G-thinker to develop significantly more efficient and scalable solutions for a number of classic subgraph finding problems, including triangle counting, maximum clique finding, and graph matching.