Pregel+ Deployment in a Multi-Node Cluster

Dependencies

Pregel+ is implemented in C/C++, and the current version runs in Linux. Pregel+ reads/writes data from/to files on Hadoop Distributed File System (HDFS) through libhdfs, and sends messages using MPI.

We require the following dependencies.

G++
Open MPI or MPICH
JDK
Hadoop

The following operations are performed on each machine in the cluster.

G++ Installation

If G++ is not installed on your Linux, install it first. The command depends on your Linux distribution.

If you are using Ubuntu. Install G++ using the following command.

sudo apt-get install g++

[type your root password]

If you are using Fedora. Install G++ using the following command.

su -c "yum install gcc-c++"

[type your root password]

MPI Installation

We now show how to deploy MPICH3 on each machine in the cluster. MPICH3 is highly recommended compared with other MPI versions like OpenMPI, due to its superior performance.

Download MPICH3 from mpich.org and extract the contents of the MPICH package to some temporary location, before compiling the source code to binaries.

cd {the-directory-containing-downloaded-mpich-package}

tar xzf mpich-3.1.tar.gz

cd mpich-3.1

Choose an installation directory for MPICH3, such as /usr/local/mpich, and compile and install MPICH3.

./configure -prefix=/usr/local/mpich (--disable-f77 --disable-fc)

make

sudo make install [Ubuntu] / su -c "make install" [Fedora]

[type your root password]

Add the following lines to the end of the file $HOME/.bashrc. Here, $HOME can be used interchangably with /home/{username} and ~.

export MPICH_HOME=/usr/local/mpich

export PATH=$PATH:$MPICH_HOME/bin

Compile the file with the command source $HOME/.bashrc.

JDK Installation

Hadoop requires a working Java 5+ (aka Java 1.5+) installation. We now describe the installation of OpenJDK 7.

sudo apt-get install openjdk-7-jdk [Ubuntu] / su -c "yum install java-1.7.0-openjdk-devel" [Fedora]

[type your root password]

It is also necessary to add the following lines to the end of the file $HOME/.bashrc.

[For 64-bit Linux]

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server

[For 32-bit Linux]

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/i386/server

Compile the file with the command source $HOME/.bashrc.

SSH Configuration

Both Hadoop and MPI requires password-less SSH connection between the master and all the slaves. If SSH server is not installed on your Linux, install it first.

sudo apt-get install openssh-server [Ubuntu] / su -c "yum install openssh-server" [Fedora]

If SSH server is not started, start it using command sudo /etc/init.d/ssh start.

Suppose that the cluster contains one master machine with IP "192.168.0.1" and hostname "master", and N slave machines where the i-th slave has IP "192.168.0.(i+1)" and hostname "slave(i+1)". Then, for each machine, we need to add the following lines to the end of the file /etc/hosts.

192.168.0.1 master

192.168.0.2 slave1

192.168.0.3 slave2

......

192.168.0.(N+1) slaveN

We now show how to configure the password-less SSH connection. First, create an RSA key pair with an empty password on the master:

ssh-keygen -t rsa

[Press "Enter" for all questions]

This creates files id_rsa and id_rsa.pub under the default directory $HOME/.ssh/. Then, we copy the public key id_rsa.pub to another file authorized_keys:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Finally, copy the public key authorized_keys to the directory $HOME/.ssh of all slaves using "scp".

scp $HOME/.ssh/authorized_keys {username}@slave1:$HOME/.ssh/

[type {username}'s password]

scp $HOME/.ssh/authorized_keys {username}@slave2:$HOME/.ssh/

[type {username}'s password]

......

scp $HOME/.ssh/authorized_keys {username}@slaveN:$HOME/.ssh/

[type {username}'s password]

Now, you should be able to connect from master to any machine using SSH, without any password.

ssh master

[no password is required]

exit

ssh slave1

[no password is required]

exit

Hadoop Deployment

Download Hadoop (a stable version like Hadoop 1.2.1 is preferred) to the master and extract the contents of the Hadoop package to /usr/local. We will copy it to the slaves using "scp" after configuration.

tar xzf hadoop-1.2.1.tar.gz

sudo mv hadoop-1.2.1 /usr/local [Ubuntu] / su -c "mv hadoop-1.2.1 /usr/local" [Fedora]

[type your root password]

sudo chown -R {username}:{usergroup} /usr/local/hadoop-1.2.1 [Ubuntu] / su -c "chown -R {username}:{usergroup} /usr/local/hadoop-1.2.1" [Fedora]

[type your root password]

In the above setting, /usr/local/hadoop-1.2.1 is the root directory of Hadoop, which we call HADOOP_HOME. Make sure to add {username} as the owner of all the files in HADOOP_HOME (inclusive). One can find the group(s) of {username} using command groups {username}.

It is also necessary to add the following lines to the end of the file $HOME/.bashrc.

export HADOOP_HOME=/usr/local/hadoop-1.2.1

export PATH=$PATH:$HADOOP_HOME/bin

[For 64-bit Linux] export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-amd64-64/lib

[For 32-bit Linux] export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-i386-32/lib

for i in $HADOOP_HOME/*.jar

CLASSPATH=$CLASSPATH:$i

done

for i in $HADOOP_HOME/lib/*.jar

CLASSPATH=$CLASSPATH:$i

done

export CLASSPATH

Compile the file with the command source $HOME/.bashrc. To save workload, one may edit the file $HOME/.bashrc only on the master, and use "scp" to copy it to the slaves, and then connect to each slave using "ssh" to compile the file.

Next, we need to configure the files under folder $HADOOP_HOME/conf. Referring to the previous cluster setting again, then the content of the file $HADOOP_HOME/conf/masters should be updated to:

master

The content of the file $HADOOP_HOME/conf/slaves should be updated to:

slave1

slave2

......

slaveN

Add the JAVA_HOME environment variable to the end of the file $HADOOP_HOME/conf/hadoop-env.sh.

[For 64-bit Linux]

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

[For 32-bit Linux]

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

Create a directory for holding the data of HDFS (Hadoop Distributed File System) such as /app/hdfs, and set the required ownership and permissions for {username}.

sudo mkdir -p /app/hdfs [Ubuntu] / su -c "mkdir -p /app/hdfs" [Fedora]

[type your root password]

sudo chown {username}:{usergroup} /app/hdfs [Ubuntu] / su -c "chown {username}:{usergroup} /app/hdfs" [Fedora]

sudo chmod 755 /app/hdfs [Ubuntu] / su -c "chmod 755 /app/hdfs" [Fedora]

Add the following properties to the file $HADOOP_HOME/conf/core-site.xml:

<name>hadoop.tmp.dir</name>

</property>

<name>fs.default.name</name>

<value>hdfs://master:9000</value>

</property>

This indicates that HDFS data are stored under /app/hdfs, and the HDFS master (aka NameNode) resides at master with port 9000.

Add the following property to the file $HADOOP_HOME/conf/mapred-site.xml:

<name>mapred.job.tracker</name>

<value>hdfs://master:9001</value>

</property>

This specifies the location of JobTracker that manages MapReduce/Giraph jobs.

Add the following property to the file $HADOOP_HOME/conf/hdfs-site.xml:

<name>dfs.replication</name>

</property>

This configuration asks HDFS to replicate any data three times, so that data won't be lost even if a machine is down.

Other properties of Hadoop can also be configured in the files according to the your need. Now that Hadoop is properly configured on the master, we use "scp" to copy it to slaves. For each slavei, we do the following operations on the master:

(Commands below are for Ubuntu, use su -c "command" for Fedora instead)

sudo scp -r /usr/local/hadoop-1.2.1 {username}@slavei:$HOME

[type your root password]

ssh slavei

sudo mv hadoop-1.2.1 /usr/local/

[type your root password]

sudo chown -R {username}:{usergroup} /usr/local/hadoop-1.2.1

sudo mkdir -p /app/hdfs

sudo chown {username}:{usergroup} /app/hdfs

sudo chmod 755 /app/hdfs

exit

Before we start the newly configured Hadoop cluster, we must format HDFS via the NameNode. Type the following command on master to format HDFS:

hadoop namenode -format

Warning: formatting an existing Hadoop cluster erases all data in HDFS!

Now, you are ready to start the Hadoop cluster. To start both HDFS and JobTracker, type the following command on master:

$HADOOP_HOME/bin/start-all.sh

To see whether the Hadoop cluster is properly started, you may use the "jps" command. On the master, you should see processes "Jps", "SecondaryNameNode", "JobTracker", and "NameNode". On a slave, you should see "DataNode", "Jps", and "TaskTracker". You are now able to access http://master:50070 to view the state of HDFS, and access http://master:50030 to view the state of MapReduce/Giraph jobs.

To stop the Hadoop cluster, type the following command on master:

$HADOOP_HOME/bin/stop-all.sh

If you are not using Hadoop MapReduce, or Giraph, but only Pregel+, there is no need to start the JobTracker. You only need to start HDFS using the following command:

$HADOOP_HOME/bin/start-dfs.sh

You may stop HDFS later on using the following command:

$HADOOP_HOME/bin/stop-dfs.sh