Red Cloud with MATLAB/Tutorial

From CAC Documentation wiki
Jump to navigation Jump to search

Introduction

This tutorial assumes that you’ve already installed the CAC MATLAB client software according to the installation instructions and successfully run cac_runtests. The client software enables your MATLAB client to run your code on Red Cloud, which runs up to 64 simultaneous MATLAB sessions on 64 separate cores.

Basic Workflow

Within a MATLAB session, submitting jobs to Red Cloud with MATLAB follows a general pattern:

  1. The user selects cacscheduler as the MATLAB parallel scheduler. This will cause parallel/distributed jobs to be submitted to Red Cloud with MATLAB for execution, rather than the local machine.
    • Installing the CAC client code makes this scheduler available for use.
  2. A CAC execution queue is selected for job submission.
    • By default, the Default queue is assumed by the CAC client.
    • Different queues can be chosen by executing ClusterInfo.setQueueName('QUEUE_NAME'). New jobs will be submitted to this queue until another one is selected.
    • Queues may have different execution policies or use different hardware resources.
      • The Quick queue has a policy where jobs must be limited to 10 minutes of execution or less. That is to say, the user must specify ClusterInfo.setWallTime(10) in order to be allowed into this queue. In return, these short-lived jobs may be prioritized over jobs with a higher execution time limit.
      • The GPU queue executes on machines that have GPU hardware available to accelerate processing.
  3. Individual jobs are created, configured, and submitted by the user.
    • Jobs may be composed of one or more individual tasks.
    • If jobs consist of one's own MATLAB code, all files and associated job dependencies should be uploaded in advance to Red Cloud via gridFTP().
  4. Job status and results are stored in Red Cloud indefinitely until destroyed by the user using the destroy(job) function. A user can log in and view his or her past and present jobs from any client at any time.
    • The exact ending time of a job depends on the nature of the tasks that have been defined within it and the available resources.

Creating your jobs

The MATLAB documentation has a wide array of options for running your code by utilizing the Parallel Computing Toolbox (PCT). The execution strategy depends on how jobs are configured. There are three recommended choices for creating jobs:

  1. createJob (Distributed Job). Submits tasks independently to available processing nodes. Typically used for parametric sweeps (running the same code with different inputs).
  2. createMatlabPoolJob (Pool Job). For code that requires one of the workers to distribute work to the other workers. Such code would be run locally with the batch command and might include, e.g., parfor loops.
  3. createParallelJob (Parallel Job). For code would be run locally in a spmd block or in pmode. The code typically makes use of explicit message passing functions such as labBarrier, labindex, etc. Such code could also define one or more codistributed arrays as a means of handling data too large to fit into the memory of any one machine.

Most of our users have found that Distributed Jobs works the best for running parametric sweeps. Users that had been using a parfor loop have opted to convert their code from Pool Jobs to Distributed Jobs. One of the advantages of using a Distributed Job is that it tends to be scheduled ahead of Pool Jobs on the CAC cluster. For this tutorial we will cover Distributed and Pool job submission.

Simple Distributed Job

For this example we’ll consider a trivial function that takes any input and waits 10 seconds before returning the input value. We will run this job locally, then compare with distributed execution on Red Cloud.

First, create a file simple_distributed_job.m containing the following:

function output = simple_distributed_job(input)
pause(10);
output = input;
end

Running simple_distributed_job in your local MATLAB window should yield:

>> simple_distributed_job(1)

ans =

     1

>> 

To run this code on the Red Cloud we’ll need to upload the “simple_distributed_job.m” file to the CAC server first. To upload code to the server use the CAC provided gridFTP command.

>> cacftp = gridFTP();
>> cacftp.put('simple_distributed_job.m','simple_distributed_job.m');
>> cacftp.list('simple_distributed_job.m'); %verify the upload
file  userID       79 20-Mar-2011 17:20:26 simple_distributed_job.m
>> 

To run on the CAC cluster we need to use the createJob function. Here is an example of running the same code on the CAC cluster:

>> job = createJob();
>> task = createTask(job, @simple_distributed_job, 1, {1});
>> submit(job);
>> wait(job);
Downloading completed job: Job60.
>> getAllOutputArguments(job)

ans = 

    [1]

>>

This by itself is not that interesting, but let’s pretend we want to run simple_distributed_job 10x in a row, running locally in serial this should take 10x10 = 100 seconds.

>> tic; for i=1:10; simple_distributed_job(1); end; toc
Elapsed time is 100.003877 seconds.
>> 

Using a distributed job, we can make it faster.

>> for i=1:10
createTask(job,@simple_distributed_job, 1, {i});
end
>> tic; submit(job); wait(job); toc
Downloading completed job: Job61.
Elapsed time is 69.754727 seconds.

The distributed job is 30% faster than running it the job in serial. One might expect a better speed-up, but there is an overhead associated with time to submit the job and retrieve the results from the remote server. Calling getAllOutputArguments on the job object yields the results of the 10 tasks.

>> getAllOutputArguments(job)

ans = 

    [ 1]
    [ 2]
    [ 3]
    [ 4]
    [ 5]
    [ 6]
    [ 7]
    [ 8]
    [ 9]
    [10]

>> 

If you need help with any of the commands please be sure to make use of MATLAB’s built-in help function.

>> help gridFTP
   gridFTP is a CAC object that provides simple stateless access to the Red Cloud
   storage facility to enable you to examine file in your home directory as
   well as upload and download files to the system.  This tool is primarily
   useful for uploading or downloading single files or for ensuring the
   location of files on the storage system.

Running a pool job

For this example we’ll consider another trivial function that waits for 100 seconds before running the input value.

function output = simple_pool_job(input)
parfor i=1:100
  pause(1);
end
output=input
end

Running the job locally we might first run without a pool job where we would expect the function to take 100 seconds to run.

>> tic; simple_pool_job(1); toc

output =

     1

Elapsed time is 100.098521 seconds.

Running with two MATLAB labs would result in a speed-up.

>> matlabpool local 2
Starting matlabpool using the 'local' configuration ... connected to 2 labs.
>> tic; simple_pool_job(1); toc

output =

     1

Elapsed time is 50.275047 seconds.
>> matlabpool close
Sending a stop signal to all the labs ... stopped.
>>

Don’t forget to run “matlabpool close” after running this example.

Similar to the distributed job the first step of running on the CAC cluster is to upload a copy of the “simple_pool_job.m” file to the CAC server (see distributed job example).

To run on the CAC cluster we need to use the createMatlabPoolJob function from the MATLAB PCT. Here is an example of running the same code on the CAC cluster:

>> ClusterInfo.setQueueName('Default');
>> job = createMatlabPoolJob();
>> task = createTask(job, @simple_pool_job, 1, {1});
>> job.MinimumNumberOfWorkers = 8;
>> job.MaximumNumberOfWorkers = 8;
>> tic; submit(job); wait(job); toc
Downloading completed job: Job64.
Elapsed time is 89.065123 seconds.

But wait, the CAC cluster is slower than running it locally with two labs! Again, there is overhead involved in submitting/retrieving the results. In this example there were 7 labs that were running in the cluster. This is one less than the number of requested workers due to one worker overseeing the work that is done on the other 7 workers.

Note: if we were to submit this same job to the local configuration, we would get 8 labs for the pool job. The reason for this difference is that the MATLAB GUI is already running on the local machine, so there is no need to assign one worker as an overseer. (Try it—but don’t forget to switch back to cacscheduler when you’re done!)

Simple Debugging

Here’s a few examples of common problems and how to detect the problem. One common example is forgetting to upload your code. This will result in something like:

>> job = createJob();
>> createTask(job, @does_not_exist_function, 1, {});
>> submit(job);
>> wait(job);
Downloading completed job: Job65.
>> job

job =

Job ID 65 Information
=====================

                  UserName : apb18
                     State : finished
                SubmitTime : Tue Oct 25 13:16:00 GMT-05:00 2011
                 StartTime : Tue Oct 25 14:16:22 EDT 2011
          Running Duration : 0 days 0h 0m 1s

- Data Dependencies

          FileDependencies : {}
          PathDependencies : \\matlabstorage01.cac.cornell.edu\matlab\apb18

- Associated Task(s)

           Number Pending  : 0
           Number Running  : 0
           Number Finished : 1
          TaskID of errors : 1
>>

Note the last line "TaskID of errors". This indicates that one of the tasks had an error. Inspecting the task object associated with the job, we can see details of the error.

>> job.tasks(1)

ans =

Task ID 1 from Job ID 65 Information
====================================

                     State : finished
                  Function : @does_not_exist_function
                 StartTime : Tue Oct 25 14:16:22 EDT 2011
          Running Duration : 0 days 0h 0m 1s

- Task Result Properties

           ErrorIdentifier : MATLAB:UndefinedFunction
              ErrorMessage : Undefined function or variable 'does_not_exist_function'.
>>

Long running jobs

Does your MATLAB computation require hours or even days to run? You'll be glad to know you can exit MATLAB completely and re-open it later to check on the status of a job that takes a long time to complete. For the sake of example, let’s create a job that waits for 5 minutes then returns. For this example we will just call the built-in pause function and will not upload any code.

>> job = createJob();
>> createTask(job, @pause, 0, {300});
>> submit(job);
pause(10);
job

job =

Job ID 66 Information
=====================

                  UserName : apb18
                     State : running
                SubmitTime : Tue Oct 25 13:24:46 GMT-05:00 2011
                 StartTime : 
          Running Duration : 

- Data Dependencies

          FileDependencies : {}
          PathDependencies : \\matlabstorage01.cac.cornell.edu\matlab\apb18

- Associated Task(s)

           Number Pending  : 1
           Number Running  : 0
           Number Finished : 0
          TaskID of errors : 
>> 

Note that the job is in the state “running”. It is now possible to exit MATLAB and start a new session. To retrieve the previously running job we need to use the findResource and findJob functions to retrieve the job that was running. We need to remember to record the “Job ID” from the previous session.

>> sched = findResource();
>> job = findJob(sched, 'Name', 'Job66')

job =

Job ID 66 Information
=====================

                  UserName : apb18
                     State : running
                SubmitTime : Tue Oct 25 13:24:46 GMT-05:00 2011
                 StartTime : 
          Running Duration : 

- Data Dependencies

          FileDependencies : {}
          PathDependencies : \\matlabstorage01.cac.cornell.edu\matlab\apb18

- Associated Task(s)

           Number Pending  : 1
           Number Running  : 0
           Number Finished : 0
          TaskID of errors : 
>> 

If you decide later that you do not want the job to complete you can cancel the job using the cancel function.

>> cancel(job);
>> job

job =

Job ID 66 Information
=====================

                  UserName : apb18
                     State : finished
                SubmitTime : Tue Oct 25 13:24:46 GMT-05:00 2011
                 StartTime : 
          Running Duration : 

- Data Dependencies

          FileDependencies : {}
          PathDependencies : \\matlabstorage01.cac.cornell.edu\matlab\apb18

- Associated Task(s)

           Number Pending  : 0
           Number Running  : 0
           Number Finished : 1
          TaskID of errors : 1
>> 

If you forgot the “Job ID” you can inspect the “jobs” field of the “sched” object that was returned by findResource (the following output is for R2011b and newer).

>> sched.jobs

ans =

    Jobs: 46-by-1
    =============

  #  Job ID        State       FinishTime  UserName  #tasks
 ----------------------------------------------------------
  1       1       queued                -     apb18       2
  2       2       queued                -     apb18       2
  3       5     finished  Oct 21 13:47:11     apb18       2
  4       8     finished  Oct 21 12:50...     apb18       2
  5       9     finished  Oct 21 12:50...     apb18       2
  6      10     finished  Oct 21 13:50:54     apb18       1
  7      13     finished  Oct 21 13:02...     apb18       2
  8      14     finished  Oct 21 13:03...     apb18       2
  9      15     finished  Oct 21 14:04:00     apb18       1
 10      18     finished  Oct 21 13:06...     apb18       2
 11      19     finished  Oct 21 13:07...     apb18       2
 12      20     finished  Oct 21 14:08:13     apb18       1
 13      23     finished  Oct 21 13:10...     apb18       2
 14      24     finished  Oct 21 13:11...     apb18       2
 15      25     finished  Oct 21 14:11:49     apb18       1
 16      28     finished  Oct 21 13:14...     apb18       2
 17      29     finished  Oct 21 13:14...     apb18       2
 18      30     finished  Oct 21 14:15:22     apb18       1
 19      33     finished  Oct 21 13:17...     apb18       2
 20      34     finished  Oct 21 13:18...     apb18       2
 21      35     finished  Oct 21 14:18:42     apb18       1
 22      38     finished  Oct 24 10:44...     apb18       2
 23      39     finished  Oct 24 10:45...     apb18       2
 24      40     finished  Oct 24 11:45:27     apb18       1
 25      43     finished  Oct 24 10:51...     apb18       2
 26      44     finished  Oct 24 10:51...     apb18       2
 27      45     finished  Oct 24 11:51:49     apb18       1
 28      46     finished  Oct 24 13:16:41     apb18       1
 29      47     finished  Oct 24 13:21:02     apb18       1
 30      48     finished  Oct 24 13:27:17     apb18      20
 31      49     finished  Oct 24 13:31:28     apb18      10
 32      50       failed                -     apb18       8
 33      51       failed                -     apb18       8
 34      54       queued                -     apb18     101
 35      55     finished  Oct 24 15:11...     apb18       8
 36      56     finished  Oct 24 16:20:21     apb18       8
 37      57     finished  Oct 24 16:04...     apb18       8
 38      58     finished  Oct 24 17:05:55     apb18       8
 39      59      pending                -     apb18       1
 40      60     finished  Oct 25 13:27:41     apb18       1
 41      61     finished  Oct 25 13:35:42     apb18      10
 42      62     finished  Oct 25 13:02...     apb18       8
 43      63     finished  Oct 25 13:05...     apb18       8
 44      64     finished  Oct 25 13:10...     apb18       8
 45      65     finished  Oct 25 14:16:23     apb18       1
 46      66     finished  Oct 25 13:29...     apb18       1

 
>> 

Next Steps

We recommend that you try working though this guide to understand the basic mechanics of submitting jobs to the Red Cloud with Matlab resource. If you have any further questions please contact: help@cac.cornell.edu