Compiling and Running Pregel+ Programs from the Console

 

We assume that a cluster is already set up as in the tutorial for Pregel+ deployment(Hadoop 1.x/Hadoop 2.x).

We now show how to run the Hash-Min algorithm for computing connected components from the console. See Section 3.2 of our technical report for the algorithm.

Download the system code(for Hadoop1.x/for Hadoop2.x) and extract to a location of your choice, such as $HOME/pregelplus. This is the root directory of our system code.

Download the application code of Hash-Min (two files) and extract to a location of your choice, such as $HOME/hashmin. This is the directory for the Hash-Min application.

Now, let us take a look at $HOME/hashmin/run.cpp. It runs Hash-Min over the data under HDFS path /toyFolder, and the results are written under HDFS path /toyOutput.

To run the program, we need to put a graph data under HDFS path /toyFolder. Download the toy graph described on the download page and put it onto HDFS as follows:

hadoop fs -mkdir /toyFolder

hadoop fs -put toy.txt /toyFolder

(For large data, we need to use Pregel+'s put program instead, which will be introduced shortly.)

 

Compilation

In the application directory, we need to write a makefile that refers to the libraries of (1)HDFS and (2)the Pregel+ system.

Download the makefile sample(for Hadoop1.x/for Hadoop2.x) and put it under the application directory $HOME/hashmin.

In the sample makefile, the places surrounded by brackets need to be specified.

Specifically, if you are using a 64-bit (or respectively, 32-bit) Linux, replace Lines 2 to 4 with PLATFORM=Linux-amd64-64 (or respectively, PLATFORM=Linux-i386-32).

Also replace [Input the path for system code directory] with the system code directory, which is $HOME/pregelplus in our case.

Then, use the command make to compile the source code to the binary file run.

 

Process-to-Machine Mapping

Suppose that the cluster contains one master machine and N slave machines where the i-th slave has hostname "slave(i+1)". We need to prepare a configuration file to specify how many computing process are to be run on each machine.

For example, let us create a file conf under $HOME/hashmin with the following content:

master:1

slave1:4

slave2:4

......

slaveN:4

This file states that the master machine runs only one process (i.e., Process 0 which is the master process), while each slave runs 4 processes.

 

Program Distribution

Now that the binary file of Hash-Min only exists on the master, it is necessary to distribute the file to all the slave machines, under the same path $HOME/hashmin. For each slave slavei, run the following command:

[Make sure directory $HOME/hashmin is already created on each slave]

scp $HOME/hashmin/run {username}@slavei:$HOME/hashmin

Alternatively, one may use a shell script like this one for program distribution, using command ./distribute.sh $HOME/hashmin/run N.

 

Running the Program

Finally, use the following command to run the compiled program:

mpiexec -n number-of-processes -f process-to-machine-mapping-file compiled-binary-file other-arguments

In our case, we run Hash-Min as follows:

mpiexec -n N -f $HOME/hashmin/conf $HOME/hashmin/run

Sometimes, automatic HDFS binding may fail and the program will report Input path "/toyFolder" does not exist! In this case, you need to edit the system program file utils/ydhdfs.h to hardwire the connection. Change hdfsConnect("default", 0) of function getHdfsFS() to hdfsConnect({your_NameNode_IP}, {your_NameNode_port}) as configured in $HADOOP_HOME/conf/core-site.xml(Hadoop 1.x) or $HADOOP_HOME/etc/hadoop/core-site.xml(Hadoop 2.x).

 

Putting Large Files to HDFS

Pregel+ requires that a large graph data is partitioned into smaller files under the same data folder, and these files are loaded by different computing processes during graph computing.

To achieve this goal, we cannot use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file is loaded by one computing process, and the other processes simply wait for it to finish loading.

We remark that parallel loading only speeds up data loading, and has no influence on the performance of graph computing. This is because, after all processes finish data loading, they need to exchange vertices so that each vertex reaches its process which is decided by hashing the vertex ID.

To put a large graph data onto HDFS, one needs to compile and run this data-putting program with two arguments being {local-file} {HDFS-path}.