Compiling and Running G-thinker Programs from the Console

We assume that a cluster is already set up as in the tutorial for G-thinker deployment. We now show how to run the graphmatch algorithm from the console.

Download the system code and extract to a location of your choice, such as $HOME/gthinker. This is the root directory of our system code. Download the source code of graphmatch and extract it into the folder $HOME/gthinker.

Before run application code, we should do graph partitioning. We have provided three basic partition algorithms, as shown in Download page. We select LDGPartitioner for label graph as an example in here.

Compilation

In the $HOME/gthinker directory, we need to write a makefile that refers to the libraries of HDFS. The C++11 option should also be enabled (i.e., -std=c++11).

Download the makefile sample and put it into $HOME/gthinker. In the sample makefile, the places surrounded by brackets need to be specified. Specifically, if you are using a 64-bit (or respectively, 32-bit) Linux, replace Lines 2 to 4 with PLATFORM=Linux-amd64-64 (or respectively, PLATFORM=Linux-i386-32).

Then, replace the run.cpp to the source code you want to run, use the command make to compile the source code to the binary file run.

Step1: Upload Data

To run the program, we first need to put a graph data under HDFS path /toyFolder. Download the toy graph described on the download page and put the label_toy.txt onto HDFS as follows:

hadoop fs -mkdir /toyFolder

hadoop fs -put label_toy.txt /toyFolder

(For large data, we need to use G-thinker's put program instead, which will be introduced shortly.)

Step 2: Graph Partition

For every Partitioner.cpp, the main function needs two parameters: the input data path and output result path on HDFS. In here, the first parameter is /toyFolder, we mark the output folder on HDFS as /toyPart. Using Makefile to make the graph partition program, and rename it as PartRun.

Step 3: Run Application

For everyt Application.cpp, the main function needs three parameters: a local root path, the partitioning graph path, marked as /toyPart and output path on HDFS. The temporary local root path can be set as:

string local_root = "/home/hzchen/tmp/gthinker";

The local root path is a temporary path system needs on each worker to store the serialized data, if main memory is not enough to hold that. User can set this property by parameters in a configuration file. Using Makefile to make the graphmatch program, and rename it as GmatchRun.

Process-to-Machine Mapping

Suppose that the cluster contains one master machine and N slave machines where the i-th slave has hostname "slave(i+1)". We need to prepare a configuration file to specify how many computing process are to be run on each machine.

For example, let us create a file conf under $HOME/gthinker with the following content:

master:1

slave1:4

slave2:4

......

slaveN:4

This file states that the master machine runs only one process (i.e., Process 0 which is the master process), while each slave runs 4 processes.

Program Distribution

Now that the binary file of Partition and Application only exists on the master, it is necessary to distribute the file to all the slave machines, under the same path $HOME/gthinker. For each slave slavei, run the following command:

[Make sure directory $HOME/gthinker is already created on each slave]

scp $HOME/gthinker/PartRun {username}@slavei:$HOME/gthinker

scp $HOME/gthinker/GmatchRun {username}@slavei:$HOME/gthinker

Alternatively, one may use a shell script like this one for program distribution, using command ./distribute.sh $HOME/gthinker/PartRun N and ./distribute.sh $HOME/gthinker/GmatchRun N.

Running the Program

Finally, use the following command to run the compiled program:

mpiexec -n number-of-processes -f process-to-machine-mapping-file compiled-binary-file other-arguments

In our case, firstly we run graph partitioning program as follows, where the input graph is under HDFS path /toyFolder, and the results are written under HDFS path /toyPart.:

mpiexec -n N -f $HOME/gthinker/conf $HOME/gthinker/PartRun /toyFolder /toyPart

Then, we should start to execute the graphmatch program on the cluster as follows, the final results are written into HDFS path /toyResult

mpiexec -n N -f $HOME/gthinker/conf $HOME/gthinker/GmatchRun /home/hzchen/tmp/gthinker /toyPart /toyResult

If the program reports "Input path "/toyFolder" does not exist!", please edit the system program file utils/ydhdfs1.h to change hdfsConnect("default", 0) of function getHdfsFS() to hdfsConnect({your_NameNode_IP}, {your_NameNode_port}) as configured in $HADOOP_HOME/conf/core-site.xml.

If Hadoop 2.x is deployed instead of Hadoop 1.x, uncomment #define YARN in utils/ydhdfs.h, and update function getHdfsFS() in utils/ydhdfs2.h with the IP and port (specified in $HADOOP_HOME/etc/hadoop/core-site.xml).

Putting Large Files to HDFS

G-thinker requires that a large graph data is partitioned into smaller files under the same data folder, and these files are loaded by different computing processes during graph computing.

To achieve this goal, we cannot use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file is loaded by one computing process, and the other processes simply wait for it to finish loading.

We remark that parallel loading only speeds up data loading, and has no influence on the performance of graph computing. This is because, after all processes finish data loading, they need to exchange vertices so that each vertex reaches its process which is decided by hashing the vertex ID.

To put a large graph data onto HDFS, one needs to compile and run this data-putting program with two arguments being {local-file} {HDFS-path}.