Red Cloud with MATLAB/Tips

From CAC Documentation wiki
Revision as of 16:23, 4 April 2016 by Srl6 (talk | contribs) (merged last edit with the stdout and stderr info already present)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

E-mail notification of job status

The cac_sendmail function will allow you to send email from your job. For example, you may want to send an email when your job begins and ends. Note, you must put a valid email address in the two locations where indicated.

Example for a distributed job

Begin your job with the following standalone task to get email notification when it starts:

t = createTask(j,@cac_sendmail,0,{'from address','to address','job started'});
% . . . define further tasks that actually do the work . . .

By appending a final call to cac_sendmail in your task function(s), you can also get email notification when each task ends. Additional logic would be required in order to make whichever task finishes last send a single email that the whole job has ended.

Example for a parallel job

For a parallel job, it's best to call cac_sendmail from only one worker, otherwise you’ll get one message for each core requested.

if labindex == 1
cac_sendmail(‘from address’,’to address’,’parallel job started’);
end

Since a parallel job consists of only one task, it is easier to add similar commands at the end of the task function to send email notification when the job ends.

Recovering standard job data from a PCT job

Whenever you use the Parallel Computing Toolbox, your PCT jobs produce a standard set of files or "job data" in a standard location determined by sched.DataLocation. The job data include one or more standard error (.er) and standard out (.ou) files, plus an assortment of .mat files. The contents of these files are often helpful for troubleshooting problems that arise.

But there's a wrinkle: when your jobs are scheduled on a distributed computing resource like Red Cloud, the worker processes and the client do not share the same file system. The job data must be synced back to the client from time to time. Usually, the CAC client-side software will take care of this for you automatically.

However, sometimes job data (and other, user-defined output from a job) must be recovered manually—perhaps because the job failed in such a way that it did not return data, or perhaps because it completed successfully but the files did not transfer back for some reason. To recover the data from such a "lost" job, CAC provides a number of scripts to simplify the information retrieval process. Most of these can be found in the contrib folder. For details, type "help commandname" at the prompt, or simply read the relevant .m files.

Note that in order to use many of these special commands, the job object "job" must already be defined in your current MATLAB client session. If the job object was defined in a previous MATLAB session, it can be retrieved as follows. This example retrieves the job information for a job named "Job66".

>> sched = findResource('scheduler', 'configuration', 'cacscheduler')
>> job = findJob(sched, 'Name', 'Job66')

Ordinarily you can trigger a download of the job files, including the job's output and error messages, just by making a request for

job.State  % this request may need to be repeated one time

or

wait(job);

Then you can simplify the process of viewing all your messages by using the CAC-provided scripts getErrors and getOutput.

getErrors

Normally MATLAB will capture any errors from the task functions and store them in a set of objects of class MException. These are incorporated into your job object job as part of the output that is normally downloaded for all tasks in job. The function getErrors collects and pretty-prints the error messages and stack traces contained in job. If you want to look at the message and the stack for just a single task, along with all of the tasks together, this can be done manually as follows:

getErrors(job);  % the following commands show how to look at just one error message
>> ts = get(job,'Tasks');
>> es = get(ts,'Errors');
>> es.message;
>> es.stack(1); 

getOutput

This function returns the command window output of all tasks in a job. As with getErrors, this output is retrieved from Red Cloud and stored in the local job object; then getOutput collects and pretty-prints the output for you.

>> getOutput(job);    % nicely prefaced with task numbers
>> at = get(job,'Tasks')
>> out = get(at,'CommandWindowOutput')    % not prefaced with task numbers 

For this command to be useful, your job must have been submitted with CaptureCommandWindowOutput enabled for all its tasks. If this property is true for a given task, then MATLAB will return the results of fprintf statements and other console output (e.g., from statements lacking a semicolon) from that task to your client.

alltasks = get(job, 'Tasks');
set(alltasks, 'CaptureCommandWindowOutput', true)

Stdout, stderr

When debugging it is often useful to examine the stdout and stderr files. These can hold stacktrace information from matlab processes (rather than exceptions from within MATLAB itself). If these files do not get downloaded automatically as part of the normal job output, or via downloadJob, you can retrieve them using gridFTP. Let's say Job 36 in your R2012a client has failed for unknown reasons. You can use gridFTP to check the directory R2012a/Job36 (and its subdirectories, if any) and look for files ending with .er (error) and .ou (output). Use "get" to retrieve the files and examine their contents for clues.

  • A parallel or pool job named JobN produces just two files, called JobN.ou and JobN.er for standard output and error files respectively, in the JobN folder.
  • In a distributed job, each defined task stores its own output and errors. Therefore, in the JobN folder, examine TaskM.ou and TaskM.er to see the stdout and stderr messages associated with task M.


downloadJob

This function retrieves files from a job that either hung or failed in such a way that the files can't be downloaded correctly.

downloadJob(sched,job)

If you try this and see an error message that a certain zip file can't be found, then repeat the command with the forceNoZip option, which is selected via an optional argument:

downloadJob(sched,j,0,1)

The forceNoZip option takes longer, but it may succeed if the other way fails.

gridFTP

This function opens an ftp connection to Red Cloud for manual retrieval of specific files. It has uses beyond recovering the stdout and stderr files as described above.

  • You can use gridFTP to retrieve a data file, if your job created one in your Red Cloud with MATLAB home directory.
  • You may just want to use gridFTP to shortcut downloadJob, which downloads all of the files (several per task). If you need to spot the error in a large parallel job, for example, you can get just Task1.out.mat and it will very likely contain the exception you need, because the error is almost always found in either all Tasks or the master (Task1).
>> ftp = gridFTP();
>> ftp.get('T1out.mat','Job4/Task1.out.mat');
>> ftp.close();

As you may have guessed, these files are where the job object—and thereby the getErrors(job) and getOutput(job) functions—get their information after the job is downloaded.

  • For a distributed job, Task1.out.mat simply contains the output for the first task, Task2.out.mat for the second task, etc.
  • A parallel job is only allowed to have one task, so Task1.out.mat contains (confusingly) the output for the first lab, Task2.out.mat for the second lab, etc.
  • For a pool job, Task1.out.mat has the output for the master task, while Task2.out.mat has the output for the first lab, etc.

Remember, gridFTP does not have a "cd" command, so complete paths (relative to home) must be supplied; see "help gridFTP" for full details.

Common errors

Silent errors due to incorrect task parameters

There are some conditions under which the MATLAB job appears to fail silently. Perhaps the most common specifying the wrong number of arguments in a task. Continuing the simple_distributed_job.m example Red_Cloud_with_MATLAB/Tutorial#Simple Distributed Job, suppose we make an error:

>> job = createJob();
>> task = createTask(job, @simple_distributed_job, 1, {}); %The function expects one argument, but we give it none!
>> submit(job)
>> wait(job)
Downloading completed job: Job10.
>> getAllOutputArguments(job)

ans = 

   Empty cell array: 1-by-0

As you can see, there is no obvious error message, yet we know this did not succeed. Individual tasks can be explored directly by examining the contents of the job object. In particular, the job object contains a Tasks that array contains detailed information from the evaluation of each task:

>> job.Tasks(1)

ans =

Task ID 1 from Job ID 10 Information
====================================

                     State : finished
                  Function : @simple_distributed_job
                 StartTime : Tue Dec 06 13:03:44 EST 2011
          Running Duration : 0 days 0h 0m 10s

- Task Result Properties

           ErrorIdentifier : MATLAB:minrhs
              ErrorMessage : Not enough input arguments.
               Error Stack : \\matlabstorage01.cac.cornell.edu\matlab\apb18\simple_d...

Now we can see the nature of the error: "Not enough input arguments."

Insufficient time

The following error was caused by accidentally setting ClusterInfo.setWallTime(10) for a job that was supposed to take much longer than 10 minutes:

>> wait(job);
Downloading completed job: Job36.
>> job

job =

Job ID 36 Information
=====================
                  UserName : xxxx
                     State : failed
                             ...
           Number Finished : 1
          TaskID of errors : 1
>> job.tasks(1)

ans =

Task ID 1 from Job ID 36 Information
====================================

                     State : failed
                             ...
           ErrorIdentifier : distcomp:task:Canceled
              ErrorMessage : job ended unexpectedly - see messages from oth
                           : er Tasks, or contact help@cac for further assi
                           : stance: The task's job has been canceled. Plea
                           : se see the Job details for additional informat
                           : ion.

To prevent this error, make sure ClusterInfo.getWallTime() returns a result that is easily long enough to allow your job to run to completion.

Wrong project name

If you set ClusterInfo.setProjectName incorrectly, or to a valid project name which does not list your username as a project member, then any jobs you submit will return an error:

Error using ==> distcomp.genericscheduler.pSubmitJobCommon at 64 Job submission did not occur because the user supplied SubmitFcn
(cac_distributedSubmitFcn) errored.

Error using ==> cac_distributedSubmitFcn at 190 Failed to pass job submission filter. Error: [...etc.]

The "job submission filter" checks to make sure the project name is an acceptable one for your username.

Inappropriate parameters for queue

If you set ClusterInfo.setWallTime to more than 10 minutes and you have set ClusterInfo.setQueueName to 'Quick', job submission will fail:

Error using distcomp.genericscheduler/pSubmitJobCommon (line 64)
Job submission did not occur because the user supplied SubmitFcn
(cac_distributedSubmitFcn) errored.

Error using cac_distributedSubmitFcn (line 211)
Job template validation failed: The value of property RuntimeSeconds is out of range.
Update the job and try again.

Error in distcomp.genericscheduler/pSubmitJobCommon (line 48)
    feval(submitFcn, scheduler, job, setprop, args{:});

Error in distcomp.genericscheduler/pSubmitJob (line 16)
scheduler.pSubmitJobCommon( job, scheduler.SubmitFcn );

Error in distcomp.abstractjob/submit (line 48)
    scheduler.pSubmitJob(job);


Error in distcomp.genericscheduler/pSubmitJob (line 16)
scheduler.pSubmitJobCommon( job, scheduler.SubmitFcn );

Error in distcomp.abstractjob/submit (line 48)
    scheduler.pSubmitJob(job);

Java errors

The MATLAB user interface is built on Java, as is a portion of the CAC client-side software; therefore, you may occasionally see Java errors in the console.

Errors referring to "swing" or "awt"

Errors of this type mean that MATLAB has encountered a problem using one of Java's graphics libraries. Most often, such errors come from trying to run a GUI or show a plot on the compute nodes. This is not allowed. The solution is to comment out any commands that are trying to display something.

NullPointerException at java.util.logging.Logger

This error message comes from your local installation of MATLAB. It stems from security updates to Java that were issued in early 2013. Please see this link for the official fix from MathWorks:

http://www.mathworks.com/matlabcentral/answers/62496

Authentication failure due to SSL errors, etc.

In setting up your Red Cloud with MATLAB configuration, you may get an error message like this during authentication, which typically involves the com.claymoresystems.ptls Java classes:

Authentication failed. Caused by Failure unspecified at GSS-API level.
Caused by COM.claymoresystems.ptls.SSLThrewAlertException: Bad certificate
(java.security.NoSuchAlgorithmException: no such algorithm:
SHA-1/RSA/PKCS#1 for provider Cryptix)
at COM.claymoresystems.ptls.SSLConn.alert(SSLConn.java:243)
[...]
edu.cornell.cac.cacscheduler.tools.CertReg.registerCert(CertReg.java:215)
[2013-04-24 13:34:00,770] Failed to open connection!

Errors of this type are often due to problems with your javaclasspath, which is set in the classpath.txt file. The following steps will help you to diagnose the problem from your MATLAB client:

>> javaclasspath  % lists the paths of all of the jar files that Matlab knows about
>> which classpath.txt  % are you using the classpath.txt file you expected?
>> updateContrib();
>> cacCheckClassPath();  % tests that classpath.txt is set up properly

The CAC-supplied Java classes need to appear close to the beginning of the javaclasspath.

Renewing a certificate

Certificates are required for many operations and they do expire after 12 hours. There is no way to lengthen the lifetime of a certificate. However, you can clear your current credentials and get a fresh certificate good for the next 12 hours. To do this, first issue these two lines:

>> CM = edu.cornell.cac.tuc.cacscheduler.globus.CertManager.getInstance();
>> CM.clearCreds();

Then issue a command that requires a valid certificate to run, e.g.,

>> ftp = gridFTP();

The usual dialog box will pop up to request your username and password. After you complete it, a message will appear bearing the timestamp for your new certificate.

What happens if your certificate expires during a long run? Let's say you are waiting on a job's state, for instance. Then you might see something like:

[2011-12-14 18:50:55,063] /var/folders/II/III4FcH+Gj01xajF1Y4le++++TQ/-Tmp-//x509up_u503 is an expired certificate: certificate expired on 20111215005054GMT+00:00
[2011-12-14 18:50:55,064] Attempting to retrieve new certificate for you...

The username/password dialog box will pop up at this point. If this occurs, just enter the information and grab a new certificate in mid-stride; after that, all operations should continue to work normally.

Note, however, that there is a simpler alternative to waiting for long running jobs. Perhaps you don’t really need to leave MATLAB open, causing "wait" to poll on the job state every 10 seconds for 18 hours. You might prefer just to close MATLAB down and restart it in the morning. Then, if you simply instantiate a new "sched" object, you will (1) prompt a fresh certificate to be issued and (2) trigger a check on the job states for jobs that have not yet reached a terminal state. This will be followed by an automatic download of the results from any completed jobs.

Speeding up submission of large parallel jobs

By default, MATLAB Distributed Computing Server will upload and download its standard job files over the network singly, one file at a time. For large parallel jobs, the cumulative latency costs from numerous individual job files can cause long waits for commands like submit(). To make the communication process more efficient, CAC provides zip-file functionality on both the client and server sides. It allows all job files to be bundled up and travel together. The functionality is switched on and off through the commands:

>> cac_enable_zip()
>> cac_disable_zip()

These functions are available in the "contrib" folder of the CAC client software distribution. Note, zip mode does not persist between sessions; you have to re-issue the cac_enable_zip() command each time you start up MATLAB (append it to startup.m if you like).

MEX-files

If your jobs require one or more MEX-files, they must be compiled specifically for CAC's software environment. The best way to do this is to compile the MEX-files directly on Red Cloud with MATLAB. This ensures compatibility in all respects, e.g., the availability of any compiler runtimes. Red Cloud with MATLAB is currently based on Microsoft's 64-bit Windows Server 2008 platform, so CAC provides Microsoft Visual C++ 2010 Express to enable the compilation of MEX-files on this platform.

To assist you with compilation, two scripts are provided in the "examples" directory: cacMex.m is a task function that invokes mex for you, while cacMexSubmitter.m is a client-side function that sets up the cacMex job for Red Cloud and submits it. Note that your cacMex job must be submitted to the Quick queue.

The supplied scripts will have to be customized and edited in several places in order to work properly for your MEX-file. In particular, cacMexSubmitter.m needs to know the exact names of your source files, because it will copy the source files to CAC and copy back the resulting .mexw64 file. Therefore, make copies of the original scripts to a working directory of your choice and make the following specific edits to your scripts.

cacMexSubmitter.m

  1. Comment out all sendFileToCAC commands, then add sendFileToCAC commands for your own files
  2. Comment out fileList on two lines, then add fileList = {'myfile1','myfile2'...}; for your files after those two lines
  3. Change set(j,'PathDependencies',{a{3}}); to this: set(j,'PathDependencies',{a{3}(1:end-7)}); %should be home directory at CAC
  4. On Mac and Linux, change [path,name,ext] = fileparts(a{1}); to this: [path,name,ext] = fileparts(strrep(a{1},'\',filesep));

cacMex.m

  1. Comment out the mopts line, then add mopts = fullfile(matlabroot,'bin','win64','mexopts','msvc100freeopts.bat');
    • For R2012a and R2013a, add instead the line: mopts = fullfile(fileparts(matlabroot),'R2011b','bin','win64','mexopts','msvc100freeopts.bat');
    • Note that R2012a and R2013a have to use the R2011b settings because the appropriate settings file was not included in later releases
  2. Following the comment line %Clean the temporary directory, add the line: delete(fullfile(tDir,'*'));
  3. Change mexFilePath = fullfile(jobLocation,f); to this: mexFilePath = fullfile(jobHome,f);

When you've edited the scripts, you can do a quick test of the compilation process by making a quick copy-and-paste clone of avg.c, the simple 30-line example code that is available in the CAC Virtual Workshop module on MATLAB Programming. Edit cacMexSubmitter.m so that avg.c appears in both sendFileToCAC and fileList (in two places). You can then demonstrate that your resulting MEX-file works by submitting a second job that calls the aavg function, defined as follows:

function a = aavg(invec)
a = avg(invec);
end

Don't forget to upload aavg.m and avg.mexw64 to CAC (avg.mexw64 should be there already, but it's in an odd location).