Compiling and Running Gtimer Programs from the Console

 

We assume that a cluster is already set up as in the tutorial for Gtimer deployment.

We now show how to run our system and a user application from the console.

Download the system code(seven directories) and extract to a location of your choice, such as $HOME/Gtimer. This is the root directory of our system code.

Users can compile our system in tgs/ directory using command make and run the program using mpiexec command. Then, users have three choices to analyze the temporal graph.

(1) Users can compile the client.cpp file in client/ directory, and run the client program, the program will hint you to input your query. When you input your query, the system will handle the query and the result will be dumped to HDFS.

(2) Users can put a list of queries in a file, and run them in batch using batchClient.cpp provided in client/ directory.

(3) In frontend/ directory, users can write their own application code based on high-level operators provided by Gtimer.

 

Upload Temporal Graphs

To run the program, we need to put a temporal graph data under HDFS path /toyFolder. Download the toy graph described on the download page and put it onto HDFS as follows:

hadoop fs -mkdir /toyFolder

hadoop fs -put toy.txt /toyFolder

(For large data, we need to use Gtimer's put program instead, which will be introduced shortly.)

 

Compilation

In the tgs/ directory, we need to write a makefile that refers to the libraries of (1)HDFS and (2)the Gtimer system.

Download the makefile sample and put it under the application directory $HOME/Gtimer/tgs/.

In the sample makefile, the places surrounded by brackets need to be specified.

Specifically, if you are using a 64-bit (or respectively, 32-bit) Linux, replace Lines 2 to 4 with PLATFORM=Linux-amd64-64 (or respectively, PLATFORM=Linux-i386-32).

Also replace [Input the path for system code directory] with the system code directory, which is $HOME/Gtimer in our case.

Then, use the command make to compile the source code to the binary file Gtimer.

The default makefile is also downloadable in the system code. But it may not be compatiable to your machine.

 

Process-to-Machine Mapping

Suppose that the cluster contains one master machine and N slave machines where the i-th slave has hostname "slave(i+1)". We need to prepare a configuration file to specify how many computing processes are to run on each machine.

For example, let us create a file conf under $HOME/Gtimer/tgs/ with the following content:

master:1

slave1:4

slave2:4

......

slaveN:4

This file states that the master machine runs only one process (i.e., Process 0 which is the master process), while each slave runs 4 processes.

 

Program Distribution

Now that the binary file of Gtimer only exists on the master, it is necessary to distribute the file to all the slave machines, under the same path $HOME/Gtimer/tgs/. For each slave slave i, run the following command:

[Make sure directory $HOME/Gtimer/tgs/ is already created on each slave]

scp $HOME/Gtimer/tgs/Gtimer {username}@slavei:$HOME/Gtimer/tgs/Gtimer

Alternatively, one may use a shell script like this for program distribution, using command ./distribute.sh $HOME/Gtimer/tgs/Gtimer .

 

Running the System

Finally, use the following command to run the compiled program:

mpiexec -n number-of-processes -f process-to-machine-mapping-file compiled-binary-file /input_path /output_path

In our case, we run Gtimer as follows:

mpiexec -n N -f $HOME/tgs/conf $HOME/tgs/Gtimer /toyFolder /output_path

Sometimes, automatic HDFS binding may fail and the program will report Input path "/toyFolder" does not exist! In this case, you need to edit the system program file utils/ydhdfs.h to hardwire the connection. Change hdfsConnect("default", 0) of function getHdfsFS() to hdfsConnect({your_NameNode_IP}, {your_NameNode_port}) as configured in $HADOOP_HOME/conf/core-site.xml.

Note: Since Gtimer provides indexes to support efficient and scalable operations on a large temporal graph, it takes some additional time for processing the input temporal graph before querying, which is discussed below.

 

Querying Gtimer with One Query Each Time

When Gtimer is running in background, users can run the client program in client/ directory worker1 to query the temporal graph one query at a time.

Users can compile the client.cpp file in client/ directory, and run the client program using the command ./client in client/ directory, the program will hint you to input your query. Users can simply type in the query in the command line, then the system will handle the query and the result will dump to HDFS.

The input of a query begins with an integer represents the id of the query type. For example, if we want to ask a earliest-arrival query, we can begin the query by 7 which indicates this query is earliest-arrival query.

 

Querying Gtimer with a Batch of Queries

When Gtimer is running in background, users can run the batchClient program in client/ directory in worker1 to query the temporal graph with a batch of queries.

Users can compile the batchClient.cpp file in client/ directory and put a list of queries in input_file, then run the batchClient program, using command ./batchClient input_file. Gtimer will handle the queries one by one and dump the results to HDFS.

The query format is the same as above.

 

Querying Gtimer with User-defined Application

Additionally, we also provide a user-friendly APIs for users to query temporal graphs.

Users can design their own applications based on Gtimer's high-level APIs. You can refer to Gtimer's High-level APIs for the APIs details. Users can put their codes in frontend/ directory and compile them using the command make. Then, you can simply use the command ./main to run the program and analyze the temporal graph.

 

Putting Large Files to HDFS

Gtimer requires that a large graph data is partitioned into smaller files under the same data folder, and these files are loaded by different computing processes during graph computing.

To achieve this goal, we cannot use the command hadoop fs -put {local-file} {HDFS-path}. Otherwise, all the data file is loaded by one computing process, and the other processes simply wait for it to finish loading.

We remark that parallel loading only speeds up data loading, and has no influence on the performance of graph computing. This is because, after all processes finish data loading, they need to exchange vertices so that each vertex reaches its process which is decided by hashing the vertex ID.

To put a large graph data onto HDFS, one needs to compile and run this data-putting program with two arguments being {local-file} {HDFS-path}.