Tips
Queue for fast throughput
There is a quick turnaround queue to enable high throughput jobs requiring 10 minutes or less.
Prerequisites: You must have at Version 0.3.1 or 0.3.2 of the client software installed. The download can be found here:
http://www.cac.cornell.edu/matlab/downloads/
Limits: 8 cores, 10 minutes. Submission will fail if you submit a job with higher limits!
Usage: Issue the command ClusterInfo.setQueueName('Quick'); before you submit your job. The setting applies to all subsequent job submissions. The setting persists even after you restart MATLAB, as does setWallTime.
Example:
≫ ClusterInfo.setWallTime(10);
≫ ClusterInfo.setQueueName('Quick');
To return to the normal queue, issue:
≫ ClusterInfo.setQueueName('Default');
Queue for GPU nodes
Notes:
- In the current configuration, the GPU queue will only work for distributed jobs and will not work with pool/parallel jobs. This is due to how the GPU queue is configured to only allow one job per-node to prevent problems with scheduling.
- You must first request access to the GPU node pool before you can submit jobs.
- You must be running R2011a.
Create a function that utilizes GPUs, e.g. here is a simple function that compares the time to compute the fft of a vector of 100 million random numbers on the CPU and on the GPU.
function [nongpu,gpu] = gputest()
t = rand(100000000,1);
tic; fft(t); nongpu=toc;
tic; gt = gpuArray(t); gather(fft(gt)); gpu=toc;
end
|
Specify the GPU queue, then create and run the job. The result shows that the GPU fft calculation is at least twice as fast than the CPU version when performed on a vector of 100 million numbers.
>> ClusterInfo.setQueueName('GPU');
>> sched = findResource('scheduler','configuration','cacscheduler');
>> j = createJob();
>> createTask(j,@gputest,2);
>> j.FileDependencies = {'gputest.m'};
>> submit(j);
>> wait(j);
Downloading completed job: Job1.
>> j.getAllOutputArguments
ans =
[7.4863] [2.5106]
Email notification of job status
This function will allow you to send email from your job. For example, you may want to
send an email when your job begins and ends. Note, you must put a valid email address in the two locations shown with italics.
Example for a distributed job:
Add the line to your job as a standalone task to get email notification when it starts.
t = createTask(j,@cac_sendmail,0,{'from address','to address','job started'});
. . .
t = createTask(j,@cac_sendmail,0,{'from address','to address','job ending'});
Example for a parallel job:
For a parallel job, it's best to call cac_sendmail from only one worker, otherwise you’ll get one message for each core requested.
if labindex == 1:
cac_sendmail(‘from address’,’to address’,’parallel job started’);
Renew your certificate - it expires after 12 hours.
Certificates are required for all operations and they do expire after 12 hours. If your certificates expire during the run, you will see:
Certificate not valid:certificate expired on 20100222090150GMT+00:00
??? Error while evaluating TimerFcn for timer 'timer-1'
There is a workaround to get a new certificate in mid-stride. If you see this message, issue these two lines:
>> CM = edu.cornell.cac.tuc.littlejohn.globus.CertManager.getInstance();
>> CM.clearCreds();
This will pop-up the dialog to enter your username and password again and operations will continue to work.
Then, re-call:
>> waitForJobState(jobobject);
We expect to have a more robust solution soon.
As an alternative for long running jobs that will shorten execution time a little, note that you
don’t need to leave matlab open polling on that status function every 10 seconds for 18 hours.
You can actually just close matlab down and restart it in the morning and go check on the state of your job.
See LITTLEJOHNHOME/examples/cacNonBlocksubmit.m. for an example.
All users should use the full path to remote files
All users should use the full path to their home directory when saving files to ensure they know the destination and can retrieve the later.
The fullpath to the your home directory is \\matlabstorage01\matlab\username for windows, or /home/matlab/username for linux.
For example:
>> save([‘\\matlabstorage01\matlab\abc23\’ ‘fileout.mat’], datamat); <=== save a file
>> copyDataFromCAC(pwd,'~/fileout.mat'); <=== retrieve the file
Be sure the software is in your path each time you start MATLAB
Each time you start MATLAB, remember to set the path to the software. You can issue
>> addpath('LITTLEJOHNHOME'); replace LITTLEJOHNHOME with the appropriate path
Or you can use the File | Set Path menu in MATLAB. Be warned that you will need administrative priviledges.
How to use a config file rather than cacsched
What if I prefer to use a config file instead of running cacsched?
- Import CACconfig.mat into MATLAB (e.g., browse to C:\Projects\matlab\CACconfig.mat)
Parallel | Manage Configurations | File | Import
- change >username< in three places
- change the number of workers in three places (optional)
- Run a test job with CACconfig.mat selected. Enter your Certificate username/password when prompted.
Error: Used createTask call with incorrect number of input arguments
The matlab worker will fail silently if the incorrect number of input arguments is provided in the createTask call. So if you have jobs that seem to start OK, seem to fail right away with no error message and no CommandWindowOutput, double check that the number of arguments in the createTask call match the function signature.
For example:
t=createTask(j, @myFunc, 1, {1,2,3,4}); % 4 input arguments and 1 output argument
function out = myFunc(arg1,arg2,arg3,arg4)
Error: Used fast throughput queue with invalid core or time limits
If you set ClusterInfo.setWallTime to more than 10 minutes and you have set ClusterInfo.setQueueName to Quick, submission will fail:
Submission of Task1 failed, retrying in 5 seconds...
??? Error using ==> distcomp.genericscheduler.pSubmitJobCommon at 64
Job submission did not occur because the user supplied SubmitFcn (cacNonSharedSimpleSubmitFcn)
errored.
Error using ==> cacNonSharedSimpleSubmitFcn at 200
LITTLEJOHN:JobSubmission Failure
Error in ==> distcomp.genericscheduler.pSubmitJobCommon at 48
feval(submitFcn, scheduler, job, setprop, args{:});
Error in ==> distcomp.genericscheduler.pSubmitJob at 16
scheduler.pSubmitJobCommon( job, scheduler.SubmitFcn );
Error in ==> distcomp.abstractjob.submit at 48
scheduler.pSubmitJob(job);
Error in ==> cacpwd at 5
submit(j);