Compiling and Running GraphD Programs from the Console

 

We assume that a cluster is already set up as in the tutorial for GraphD deployment. We now show how to run the Hash-Min algorithm for computing connected components from the console.

Download the system code and extract to a location of your choice, such as $HOME/graphd. This is the root directory of our system code. Download the basic-mode application code of Hash-Min and extract to a location of your choice, such as $HOME/hashmin. This is the directory for the Hash-Min application.

Let us rename the code as run.cpp, and take a look at the code's main function. We can see that the data-input path (and result-output path) on HDFS is given by the first (and second) argument. There is also another path:

string local_root = "/home/yanda/tmp/iopregel";

It indicates the local path of the edge and message streams to be written by the job, and one may change it as $HOME/tmp/iopregel according to the value of $HOME on your machines.

To run the program, we first put a graph data under HDFS path /toyFolder. Download the toy graph described on the download page and put it onto HDFS as follows:

hadoop fs -mkdir /toyFolder

hadoop fs -put toy.txt /toyFolder

(For large data, we need to use GraphD's put program instead, which will be introduced shortly.)

 

Compilation

In the application directory, we need to write a makefile that refers to the libraries of (1)HDFS and (2)the GraphD system. The C++11 option should also be enabled (i.e., -std=c++11).

Download the makefile sample and put it under the application directory $HOME/hashmin. In the sample makefile, the places surrounded by brackets need to be specified. Specifically, if you are using a 64-bit (or respectively, 32-bit) Linux, replace Lines 2 to 4 with PLATFORM=Linux-amd64-64 (or respectively, PLATFORM=Linux-i386-32). Also replace [Input the path for system code directory] with the system code directory, which is $HOME/graphd in our case.

Then, use the command make to compile the source code to the binary file run.

 

Process-to-Machine Mapping

Suppose that the cluster contains one master machine and N slave machines where the i-th slave has hostname "slave(i+1)". We need to prepare a configuration file to specify how many computing process are to be run on each machine.

For example, let us create a file conf under $HOME/hashmin with the following content:

master:1

slave1:4

slave2:4

......

slaveN:4

This file states that the master machine runs only one process (i.e., Process 0 which is the master process), while each slave runs 4 processes.

 

Program Distribution

Now that the binary file of Hash-Min only exists on the master, it is necessary to distribute the file to all the slave machines, under the same path $HOME/hashmin. For each slave slavei, run the following command:

[Make sure directory $HOME/hashmin is already created on each slave]

scp $HOME/hashmin/run {username}@slavei:$HOME/hashmin

Alternatively, one may use a shell script like this one for program distribution, using command ./distribute.sh $HOME/hashmin/run N.

 

Running the Program

Finally, use the following command to run the compiled program:

mpiexec -n number-of-processes -f process-to-machine-mapping-file compiled-binary-file other-arguments

In our case, we run basic-mode Hash-Min as follows, where the input graph is under HDFS path /toyFolder, and the results are written under HDFS path /toyOutput.:

mpiexec -n N -f $HOME/hashmin/conf $HOME/hashmin/run /toyFolder /toyOutput

If the program reports "Input path "/toyFolder" does not exist!", please edit the system program file utils/ydhdfs1.h to change hdfsConnect("default", 0) of function getHdfsFS() to hdfsConnect({your_NameNode_IP}, {your_NameNode_port}) as configured in $HADOOP_HOME/conf/core-site.xml.

If Hadoop 2.x is deployed instead of Hadoop 1.x, uncomment #define YARN in utils/ydhdfs.h, and update function getHdfsFS() in utils/ydhdfs2.h with the IP and port (specified in $HADOOP_HOME/etc/hadoop/core-site.xml).

Runing Jobs in Recoded Mode

The recoded mode can be run similarly. Please first run the ID-recoding code to recode the graph on HDFS, the recoded graph of which will be written to local disks. Then one may run the recoded code that reads the recoded graph from local disks.

 

Putting Large Files to HDFS

GraphD requires that a large graph data is partitioned into smaller files under the same data folder, and these files are loaded by different computing processes during graph computing.

To achieve this goal, we cannot use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file is loaded by one computing process, and the other processes simply wait for it to finish loading.

We remark that parallel loading only speeds up data loading, and has no influence on the performance of graph computing. This is because, after all processes finish data loading, they need to exchange vertices so that each vertex reaches its process which is decided by hashing the vertex ID.

To put a large graph data onto HDFS, one needs to compile and run this data-putting program with two arguments being {local-file} {HDFS-path}.