Compiling and Running Quegel Programs from the Console

 

We assume that a cluster is already set up as in the tutorial for Quegel deployment.

We now show how to run BFS to answer the point-to-point shortest path queries (s, t) from the console. See Section 5.1.1 of our technical report for the algorithm.

Download the system code and extract to a location of your choice, such as $HOME/quegel. This is the root directory of our system code.

Download the application code of BFS and extract to a location of your choice, such as $HOME/bfs. This is the directory for the BFS application.

Now, let us take a look at $HOME/bfs/bfs.cpp. It runs Hash-Min over the data under HDFS path /toy_out, and query results are written under HDFS path /ol_out.

To run the program, we need to put a graph data under HDFS path /toy_out as is described here about the Web-Stanford graph. Note that for large data, we need to use Quegel's put program instead, which we will describe shortly.

 

Compilation

In the application directory, we need to write a makefile that refers to the libraries of (1)HDFS and (2)the Quegel system.

Download the makefile sample and put it under the application directory $HOME/bfs.

In the sample makefile, the places surrounded by brackets need to be specified.

Specifically, if you are using a 64-bit (or respectively, 32-bit) Linux, replace Lines 2 to 4 with PLATFORM=Linux-amd64-64 (or respectively, PLATFORM=Linux-i386-32).

Also replace [Input the path for system code directory] with the system code directory, which is $HOME/quegel in our case.

Then, use the command make to compile the source code to the binary file run.

 

Putting Large Files to HDFS

Quegel requires that a large graph data is partitioned into smaller files under the same data folder, and these files are loaded by different computing processes during graph computing.

To achieve this goal, we cannot use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file is loaded by one computing process, and the other processes simply wait for it to finish loading.

We remark that parallel loading only speeds up data loading, and has no influence on the performance of graph computing. This is because, after all processes finish data loading, they need to exchange vertices so that each vertex reaches its process which is decided by hashing the vertex ID.

To put a large graph data onto HDFS, one needs to compile and run this data-putting program with two arguments being {local-file} {HDFS-path}. To avoid changing the makefile, one may rename put.cpp as run.cpp, call "make" to get the compiled program "run", and then rename "run" to "put" as a tool for putting large graphs to HDFS.

 

Process-to-Machine Mapping

Suppose that the cluster contains one master machine and N slave machines where the i-th slave has hostname "slave(i+1)". We need to prepare a configuration file to specify the machines that a Quegel job runs on.

For example, assume that we only use slave machines for running a distributed job, then we may create a file conf under $HOME/bfs with the following content:

slave1

slave2

......

slaveN

This file states that the processes of a job will be distributed evenly to all the N slave machines.

 

Program Distribution

Now that the binary file run only exists on the master, it is necessary to distribute the file to all the slave machines, under the same path $HOME/bfs. For each slave slavei, run the following command:

[Make sure directory $HOME/bfs is already created on each slave]

scp $HOME/hashmin/run {username}@slavei:$HOME/bfs

Alternatively, one may use a shell script like this one for program distribution, using command ./distribute.sh $HOME/bfs/run N.

 

Running the Server-Side Program

We remark that the application code is for the server-side program only. We may start the compiled server-side program by running the following command:

mpiexec -n number-of-processes -f process-to-machine-mapping-file compiled-binary-file other-arguments

In our case, we start BFS query server as follows:

mpiexec -n N -f $HOME/bfs/conf $HOME/bfs/run

If the program reports Input path "/toy_out" does not exist!, please check utils/ydhdfs1.h (for Hadoop 1.x) or utils/ydhdfs2.h (for Hadoop 2.x) to hardwire the hostname and port in getHdfsFs() to those configured in $HADOOP_HOME/conf/core-site.xml.

The server-side program will process queries from client programs after the input graph is loaded, until a client program closes the server.

 

Preparing the Client-Side Program

In Quegel, the client-side program is just responsible for sending query strings to the Quegel server, and the code is fixed.

There are two types of clients, whose code can be downloaded from here: "console_version/run.cpp" is the client code for users to type their queries in the console, while "batchFile_version/run.cpp" is the client code for users to type the name of a file that contains a batch of queries.

After getting the compiled program "run", one may rename it as "client" or "batch" as the client-side program.

 

Running the Client-Side Program

For the Quegel server program, its first process is responsible for receiving queries from clients. In our example, the process is on slave1 since slave1 is written in the first line of $HOME/bfs/conf.

Quegel uses the IPC message queue of linux, where a client program puts query strings into a queue and the server program fetches them from the queue. Therefore, we should start a client on machine that runs the first process of the distributed server program (i.e., slave1 in our example).

For the console client, one may start it as "./client", and then one may input the query strings (e.g., "100 1000" to get the distance from vertex 100 to vertex 1000), and the console running the server program will show the progress.

For the batch-file client, one may start it as "./batch inputfile_name C outputfile_name", where inputfile_name refers to the input file of a batch of queries (one line per query), C refers to the maximum number of queries that are allowed to be processed simultaneously, and "outputfile_name" specifies the file for recording the processing time of each query (which is batch_out.txt if not explicitly specified). One may run "./batch inputfile_name C outputfile_name" for multiple times to process different batches of queries

The dumped results of the queries can be found in the output folder on HDFS (i.e., "/ol_out" in our example).

 

Terminating Programs

To terminate the server program from client, type "server_exit".

To terminate a client program from console, type "exit".

The server-side program may exit abnormally, due to invalid query string or due to bugs of the query processing algorithms. In this case, the IPC message queue is not freed (as can be observed by running command ipcs) on the machine running the first process (i.e., slave1 in our example).

Since a Quegel server will exit if it detects an already existed IPC message queue, one needs to free the message queue. One may either use the ipcrm command or run this script by calling "./kill.py".