HODV4

From CAC Documentation wiki
Jump to navigation Jump to search

Hadoop On Demand on V4

Hadoop On Demand (HOD) is meant to be a way to dynamically allocate a hadoop cluster and provide a means to submit jobs to that allocated cluster. It was primarily intended to provide a means to allow allocation of small hadoop clusters on an exisiting resource so that multiple users can use the resource. However, it also provides an excellent means to dynamically allocate a hadoop system on an exisiting HPC system. HOD can either 1) talk to an existing HDFS system, 2) Start a deployed HDFS binary, 3) Deploy and start HDFS binary. For this instance, we'll be using method 3, which requires no local installation of anything on a compute node (but requires the most things to be lined up correctly).

First of all, links to (sometimes sparse) documentation:

http://hadoop.apache.org/core/docs/r0.17.2/hod_admin_guide.html : This guide will walk you through an overview of architecture of HOD, prerequisites, installing various components and dependent software, and configuring HOD to get it up and running. http://hadoop.apache.org/core/docs/r0.17.2/hod_user_guide.html Hod User Guide : This guide will let you know about how to get started on running hod, its various features, command line options and help on troubleshooting in detail. http://hadoop.apache.org/core/docs/r0.17.2/hod_config_guide.html Hod Configuration Guide : This guide discusses about onfiguring HOD, describing various configuration sections, parameters and their purpose in detail.


To get started, we need a hadoop tarball in our home directory, I'm working with hadoop-0.18.3 and I have it in ~/hadoop/hadoop-0.18.3.tar.gz . Let's unpack and move hod.


$ tar xvzf hadoop-0.18.3.tar.gz
$ cp -r hadoop-0.18.3/contrib/hod hod

Note::We'll refer to the hod directory as $HOD_HOME


Next, we'll need to setup the hodrc file, which contains all of the options we HOD needs to run. Before we do that, let's examine what hadoop has to do and what the components are.

  • A user calls "HOD allocate", this submits a job to the local resource manager with a generated batch script which allocates a set number of nodes and starts a torque job there
  • The master node of the job (mother superior) executes the generated batch script, which starts a process called Ringmaster, this is basically a master control app.
  • Ringmaster uses pbsdsh to start up processes on all of the nodes in the job. These processes are called hodring.
  • HODring starts the hdfs and mapred daemons on the nodes and returns these values to the Ringmaster, which returns them to back to HOD allocate, which creates an xml file with all the neccessary information to submit hadoop jobs to the ring.

A stub hodrc file is provided in $HOD_HOME/conf/hodrc. Most of the values in here you can leave alone, below, I've copied the default hodrc file (from 0.18.3), and shown the modifications made to get HOD working.

[hod]
stream                          = True
java-home                       = /usr/java/default  WAS ${JAVA_HOME}
cluster                         = nate_hod           WAS ${CLUSTER_HOME}
cluster-factor                  = 1.8
xrs-port-range                  = 32768-65536
debug                           = 3
allocate-wait-time              = 3600
temp-dir                        = /tmp/hod

[ringmaster]
register                        = True
stream                          = False
temp-dir                        = /tmp/hod
http-port-range                 = 8000-9000
work-dirs                       = /tmp/hod/1,/tmp/hod/2
xrs-port-range                  = 32768-65536
debug                           = 4                  WAS 3

[hodring]
stream                          = False
temp-dir                        = /tmp/hod
register                        = True
java-home                       = /usr/java/default  WAS ${JAVA_HOME}
http-port-range                 = 8000-9000
xrs-port-range                  = 32768-65536
debug                           = 4                  WAS 3

[resource_manager]
queue                           = v4                 WAS ${RM_QUEUE}
batch-home                      = /usr/local         WAS ${RM_HOME}
id                              = torque
env-vars                       = HOD_PYTHON_HOME=/usr/local/bin/python2.5      WAS commented out

[gridservice-mapred]
external                        = False
pkgs                            = ${HADOOP_HOME}
tracker_port                    = 8030
info_port                       = 50080

[gridservice-hdfs]
external                        = False
pkgs                            = ${HADOOP_HOME}
fs_port                         = 8020
info_port                       = 50070

NOTE: You'll notice I tend not to use the environment variables in the config file. As a system administrator, I don't want to have to ensure users have appropriate environment settings, but that's just a personal choice. There are obviously cases where this isn't true, but just to be clear about my bias!

NOTE: The resource_manager:batch_home value may look odd. $TORQUE_HOME is probably something /var/spool/torque, but what hod actually wants is the directory such that $batch_home/bin/[qsub | qstat | qdel | pbsdsh | qalter] exists.


Start HOD The hod binary is a little strange. It's a shell/python script. It needs python2.5 and extracts it from an environment variable. You'll also need to identify the location of the hodrc conf file. In this example, we'll start up a 5 node Hadoop cluster for 60 minutes (3600 seconds). A successful allocation creates an xml file that describes the Hadoop cluster and allows submission of Hadoop jobs, we'll keep that in ~/hadoop/cluster.

$ export HOD_PYTHON_HOME=/usr/local/bin/python2.5
$ export HOD_CONF_DIR=${HOD_HOME}/conf
$ cd ${HOD_HOME}/bin
$ ./hod allocate -d ~/hadoop/cluster -n 5 -t ~/hadoop/hadoop-0.18.3.tar.gz -A acctName -l 3600
INFO - Cluster Id 2938.scheduler.v4linux
INFO - HDFS UI at http://compute-3-42.v4linux:54072
INFO - Mapred UI at http://compute-3-43.v4linux:52010
INFO - hadoop-site.xml at /home/gfs01/naw47/hadoop/cluster
$ ./hod list -d ~/hadoop/cluster
INFO - alive 2938.scheduler.v4linux /home/gfs01/naw47/hadoop/cluster

At this point, we have a successful hadoop cluster that we can submit jobs to. We can make sure everything works by loading a little data onto DFS and running one of the example programs provided by Hadoop.

$ export HADOOP_CONF_DIR=~/hadoop/cluster
$ cd hadoop-0.18.3/bin
$ ./hadoop fs -mkdir input
$ ./hadoop fs -mkdir output
$ ./hadoop fs -put ~/somefile input/words
$ ./hadooop jar ~/path/to/hadoop-0.18.3-examples.jar wordcount input/words output/wordcount
$ ./hadoop fs -get output/wordcount ~/wc_results

NOTE: Make sure you export HADOOP_CONF_DIR to point at the cluster directory (the path you named with the -d parameter to hod allocate). Otherwise, everything will act like it's working, but you'll be doing all you're work locally!

Once you're done working with Hadoop, you'll want to shut down the cluster (otherwise, the job continues to run and you'll continue to get charged!).

$ cd $HOD_HOME/bin
$ ./hod list -d ~/hadoop/cluster
INFO - alive 2938.scheduler.v4linux /home/gfs01/naw47/hadoop/cluster
$ ./hod deallocate -d ~/hadoop/cluster/
$ ./hod info -d ~/hadoop/cluster/
CRITICAL - Invalid hod.clusterdir(--hod.clusterdir or -d). /home/gfs01/naw47/hadoop/cluster : Not tied to any allocated cluster