Hadoop on Demand under Moab
Hadoop on Demand (HOD) is a system for dynamically creating and using Hadoop clusters. As distributed it is designed to integrate with the Torque resource manager, but it is relatively easy to make it work with other resource managers. This document describes how to integrate with Moab and provides the needed source changes to do so. If you install this, please drop us a line and let us know how things went. We'd be happy to help you get it working if it lets us clean up any bugs in our modules.
NOTE: If you use Torque as the resource manager under Moab, you might be able to run HOD stock, but it depends on your configuration. The information provided here interfaces with Moab directly (as opposed to using the Torque PBS/style tools), so it WILL work with any reasonable Moab installation.
If you don't really care how this happens, but just want to get HOD working under Moab, follow these directions. A better description of what actually happens is provided below. They are enjoyable and a better understanding of how HOD works will make for a better administrator. Really. You'll need to rename the files when you download them, I named them so you would be less likely to overwrite them as two of them are moab.py! While we get storage configured for this wiki, these links are external, so sorry about the jump!
- Download sch_moab.py and place in hod/hodlib/Schedulers/moab.py
- Download np_moab.py and place in hod/hodlib/NodePools/moab.py
- Replace the existing hod/hodlib/Common/nodepoolutil.py with nodepoolutil.py
- In your hodrc file set resource_manager.id to "moab" (it's "torque" by default)
- You may want to take a quick look at the HODV4 page for instructions on configuring HOD in general.
HOD is reasonably modular and provides an interface to Torque commands through the hodlib/Scheduler/torque.py classes. In order to work for Moab, we'll need to provide a new Schedulers interface and register this name with hodlib/Common/nodepoolutil.py. I've added a dynamic import of modules based on the resource_manager.id provided in hodrc, which maps to classes added to the distribution. This allows the addition of other resource manager interfaces, without editing any existing HOD code. So replacing this file makes it much easier to add more modules.
from hodlib.NodePools.torque import TorquePool class NodePoolUtil: def getNodePool(nodePoolDesc,cfg, log): """returns a concrete instance of a NodePool as configured by 'cfg'""" npd = nodePoolDesc name = npd.getName() if name == 'torque': return TorquePool(npd,cfg,log) #We're going to try a dynamic import here based on the name, this may fail else: try: modName = "hodlib.NodePools.%s" % name className = "%s%sPool" % (name.upper(),name[1:].lower()) mod = __import__(modname,fromlist=[classname]) klass = getattr(mod,className) obj = klass(npd,cfg,log) return obj except: #Just a catch-all except, if we didn't create this class, we're going to die log.critical("Unable to dynamically create Resource Manager interface for '%s'" % name) sys.exit() getNodePool = staticmethod(getNodePool)
Next we turn to the scheduler interface in hodlib/Scheduler/torque.py. This basically provides an interface to basic job submission and querying commands (qsub,qdel, qstat, qalter) and the pbs distributed shell pbsdsh. pbsNodes is also provided but it not ever used, so we can ignore it. We create a Scheduler/moab.py that maps to the Moab commands that perform the same function.
- qsub --> msub
- qdel --> mjobctl -c
- qstat --> checkjob
- qalter --> mjobctl
qsub -> msub msub exactly replicates the qsub functionality, so we don't actually have to make any changes to this call, we just point qsub to msub. Both qsub and moab output the jobID to STDOUT after job submission, but with slightly different formats. Moab returns a shortened ID (1234), whereas qsub returns a "namespaced" id (1234.clustername). I had some problems when using the shortened ID, so after job submission, I use checkjob to lookup the full name and return the full name instead of the short name.
qdel --> canceljob The interface supports qdel jobid and qdel -p jobid (the force option). mjobctl -c jobid performs a job cancel, but no force option is available. This shouldn't really matter, the force cancel of a job is pretty bad mojo anyways and is only needed in extreme cases where a node/mom has died during a job. So we just map both qdel calls to mjobctl -c.
qstat --> checkjob The torque interface calls qstat -f -1 jobid, but this is the same as qstat -f jobid (-1 is not compatible with -f, at least on my version of Torque), which lists all information about the job in a <KEY>=<VALUE> output. However, HOD only uses the 'job_state' and 'exec_host' values from this. HOD is expecting PBS-style states which are single capital letters where Q=Queued, R=Running, W=WAIT, etc. However, it only actually wants to know whether it's queued or not. Moab returns more complicated state, which will be one of [Idle, Starting, Running, Completed]. We can just map Running to 'R', Idle/Starting/etc to 'Q', and Completed to none. Unfortunately, the checkjob -A jobid which returns a simple <KEY>=<VALUE>; list doesn't contain the node list, so we have to parse the checkjob --format=xml jobid information to extract those two fields.
qalter --> mjobctl HOD's qalter command is a qalter -W which sets extended attributes on the job. It's not used for setting any core job attributes (setting/removing holds etc), so this effects things like attempting to set dependencies, setting stage-in/stage-out, etc. It's referenced in a the checklimits.sh batch script, but not anywhere else. mjobctl -m <ATTR>=<VAL> allows access to this type of thing, so I've simply set it to that and I'll wait for a bug report around this.
pbsdsh This leaves pbsdsh, which is a bit tricky. pbsdsh is a nice tool that grabs the $PBS_NODEFILE and executes a requested command on all of the nodes in that list. However, this isn't really a resource manager job per se. As an external scheduler Moab certainly has no functionality to do this. There are a number of ways of dealing with this:
- dsh is a general purpose tool that could be used to perform the same function.
- call multiple ssh commands to all hosts
- use mpiexec to spawn the processes on remote nodes.
For the moab module, I elected to just use a simple ssh method. On almost any imaginable HPC system where this would be used, this should work. The correct answer is to remove this from the resource manager responsibility and allow seperate selection of the method. This is a bit messy as you've got to spawn a daemon process over ssh (perhaps mpiexec IS the way to go), which in order to work you have to redirect all of the connected streams. You also have to provide the $HOD_PYTHON_HOME directory so that you ensure that hodring can find python2.5. All in all, you're line ends up looking like this:
/usr/bin/ssh compute-0-0 "env HOD_PYTHON_HOME=/usr/local/bin/python2.5 hodring -options </dev/null 1>/dev/null 2>/dev/null & "
This hasn't been tested on a non-Torque Moab installation. It is equally valid to write modules against the resource manager Moab is using on a given cluster rather than this, but the hope here is to not NEED to do that.