Red Cloud with MATLAB/Storage

From CAC Documentation wiki
Jump to navigation Jump to search

Users with a subscription to Red Cloud with MATLAB are currently provided with 50GB of storage space for MATLAB scripts, job data, or other resources. This storage is persistent in Red Cloud and is accessible to all distributed/parallel MATLAB jobs run by the same user.

Directory Structure

Your MATLAB data (including PCT job data) appears in a specific directory structure on Red Cloud.

/ (or "root")

Your root directory is the normal landing place for gridFTP(). "Root" is also included in your default search path for MATLAB. In other words, if your task functions are defined via .m files in this directory, they will be found by all matlab worker instances in Red Cloud. Thus, your MATLAB scripts should be uploaded here via gridFTP() before they can be used in parallel or distributed jobs. Raw data for processing may also be uploaded here. Using gridFTP (or even commands in your MATLAB jobs), you are free to create directories within root that will help you organize your data and results.

/R2011b

Underneath the root directory are directories for each MATLAB client version that you might use to connect to the service. Let's say your client is R2011b. All job data run by workers of that particular version of MATLAB are found in this directory. When you destroy a job locally, the same job data are also deleted from this location.

/R2011b/JobN

Underneath the MATLAB version directory is a directory for each job containing job and task metadata, outputs, logs, and other data uniquely associated with a particular job. Each job directory is named after the job it represents, so files associated with Job64 may be found in /R2011b/Job64.

gridFTP

The gridFTP() function is used within the local MATLAB client to upload, download, create, or delete files and directories from Red Cloud. A user will typically want to use gridFTP() to upload all the .m files containing code to execute in Red Cloud. It is also recommended to use gridFTP() to upload/download any data files associated with a job, especially when the input or output files are large.

Unlike traditional FTP, gridFTP() is stateless, which means there is no concept of current directory outside of root. The consequence of this is that all remote file and directory names must be referred to by their full Linux-style path relative to root, e.g. /myData/images/img1.bmp

Typical usage resembles:

cacftp = gridFTP();

% Perhaps we need to upload some data to analyze; put it somewhere logical
cacftp.put('myLocalCopyOfData.dat','/oneOfMyFoldersOnRedCloud/data.dat');

% Put the matlab code in the root directory
cacftp.put('my_function.m', 'my_function.m');

% Confirm that the file is present in the root directory
cacftp.list('');

% later on, uploaded code can used in tasks to be submitted to Red Cloud
task = createTask(job, @my_function, 1, {'arg'});

More help

For complete capabilities, syntax, and usage information, please refer to

>> help gridFTP

Access by a Red Cloud Job

We have seen that from your local client, your CAC storage area is accessed via gridFTP, where it appears as a Linux-style directory tree that starts at root (/). From your Red Cloud with MATLAB workers, however, the view is somewhat different:

  1. Because your Red Cloud workers will run under the Windows Server 2008 OS, you must refer to files using Windows-style paths when you write your MATLAB task function(s).
  2. Since your workers may be running on multiple servers at CAC, they must access your CAC storage through a specific network share, \\matlabstorage01\matlab\myid.

An important note is that this network share is not the current working directory when your workers start. If you want the shared "root" directory to be the working directory for your workers, you must cd to it using a Windows-style path:

>> cd \\matlabstorage01\matlab\myid  % you might want to put this line in your task function

where myid is your CAC username. If you don't want to insert this cd command, then your input and output filenames should include the full shared network location, e.g., \\matlabstorage01\matlab\myid\inputfile.

Another possiblility is to read and write files in your default working directory, which is C:\Users\myid for each worker. However, this folder name does not correspond to a network share; rather, it is a local folder on each server that hosts one or more workers. Therefore, the use of the default folder(s) is not recommended because there is no way to get access to your files before your job starts or after your job ends.

Setting PathDependencies

If you create a subdirectory under your main directory on the network share, and you want your Red Cloud job to find your files in there, be sure to set PathDependencies appropriately for the job:

set(job,'PathDependencies',{'\\matlabstorage01\matlab\myid\mydir'})

This setting should be made on your local client prior to submitting your job. Note that the PathDependencies parameter must be a cell array of strings. By default, the cacscheduler configuration predefines PathDependencies to be just your home directory, \\matlabstorage01\matlab\myid.

Notice once again that the path is a fully-qualified Windows-style path to a specific network share. This is because CAC's Red Cloud with MATLAB servers are running Windows as their OS; regardless of your local client's OS, the workers will need to operate in a Windows environment, so the job's settings must be made accordingly.

Contrast with FileDependencies

Another way to move files to Red Cloud would be to set the FileDependencies property for a job. However, this method is not recommended for several reasons:

  1. FileDependencies are not retained in persistent storage, so large input files might need to be transferred again and again.
  2. Because FileDependencies are merely placed into the temporary storage assigned to each worker, they become unavailable after a job ends.
  3. Each worker receives a separate copy of all the FileDependencies when the job starts; workers do not utilize a single set of files in a common location.
  4. Consequently, in most cases, a job's startup is greatly slowed by FileDependencies.
  5. The gridFTP mechanisms are generally faster and more efficient at transferring files.