TACCSpurGuide

From Cornell CAC Documentation

Jump to: navigation, search

Most of the following is taken from TACC's Spur User Guide. There are updates and some detail for those less experienced with Linux.

Contents

System Overview

Spur (spur.tacc.utexas.edu), a Sun Visualization Cluster, contains 128 compute cores, 1 TB aggregate memory and 32 GPUs. Spur shares the InfiniBand interconnect and Lustre Parallel file system of Ranger, TACC's Sun Constellation Linux Cluster. Thus, Spur acts not only as a powerful stand-alone visualization system: it also enables researchers to perform visualization tasks on Ranger-produced data without migrating to another file system and to integrate simulation and rendering tasks on a single network fabric.

Spur is an 8 node visualization cluster (+ login node), with each node containing 16 cores, at least 128 GB of RAM, and 4 NVIDIA Quadro FX 5600 GPUs. The individual nodes are configured as follows:

  • spur: login node (no graphics resources)
  • visbig : Sun Fire X4600M2 server with
    • 8 dual-core AMD Opteron processors
    • 256 GB RAM
    • 2 NVIDIA QuadroPlex 1000 Model IV (2 FX 5600 each)
  • vis1: Sun Fire X4400 server with
    • 4 quad-core AMD Opteron processors
    • 128 GB RAM
    • 2 NVIDIA QuadroPlex 1000 Model IV (2 FX 5600 each)
  • vis2-7: Sun Fire X4400 server with
    • 4 quad-core AMD Opteron processors
    • 128 GB RAM
    • 1 NVIDIA QuadroPlex 2100 S4 (4 FX 5600)


Login and Get a Visualization Node

Graphics applications on Spur are served through a remote VNC desktop. VNC desktops are launched by way of SGE script, and visualization applications are then run directly under VNC.

Visualization and data analysis (VDA) applications must be run on a vis node (vis*.ranger.tacc), allocated to you by SGE. No VDA applications should be run on the Spur login node (spur.tacc). VDA applications running on the login node may be terminated without notice, and repeated violations may result in your account being suspended. Please submit a consulting ticket at https://portal.tacc.utexas.edu/consulting/ if you have any questions regarding this policy.

To launch a VNC desktop, execute the following commands:

  • ssh to spur: ssh spur.tacc.utexas.edu. At login, a banner prints your account name, TG-MyAcct. Remember it.
  • If this is your first time connecting to spur, you must run vncpasswd to create a password for your VNC servers. This should NOT be your login password! This mechanism only deters unauthorized connections; it is not fully secure, as only the first eight characters of the password are saved. All VNC connections are tunneled through ssh for extra security, as described below.
  • Launch a vnc desktop via SGE:
    cp /share/sge/default/pe_scripts/job.vnc job.vnc
    Edit job.vnc to add -A TG-MyAcct to the job directives, so it looks something like line 3 of this:
  1. #$ -V                             # Inherit the submission environment
  2. #$ -cwd                           # Start job in submission dir
  3. #$ -A TG-MyAcct                   # Added this line with my account
  4. #$ -N vncserver                   # Job name
  5. #$ -j y                           # Combine stderr and stdout into stdout
  6. #$ -o $HOME/$JOB_NAME.out         # Name of the output file
  7. #$ -pe 16way 16                   # Request 1 Vis node
  8. #$ -q vis                         # Queue name
  9. #$ -l h_rt=4:00:00                # runtime (hh:mm:ss) - 4 hours
If you haven't used a Unix text editor, try SimpleViDirections. At present, the command-line option "-l h_rt" is not read properly by qsub, so you would have to edit it in job.vnc. The maximum time is 24 hours. To specify a particular desktop size when you submit the job, use:
qsub job.vnc -geometry 1440x900
Success will show you your job number:
------------------------------------------------------------------------
 Welcome to TACC's Spur Visualization System, an NSF TeraGrid Resource
------------------------------------------------------------------------

  --> Submitting 16 tasks...
  --> Submitting 16 tasks/host...
  --> Submitting exclusive job to 1 hosts...
  --> Verifying HOME file-system availability...
  --> Verifying WORK file-system availability...
  --> Verifying SCRATCH file-system availability...
  --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
  --> Requesting valid memory configuration (mt=31.3G)...
  --> Checking ssh keys...
  --> Checking file existence and permissions for passwordless ssh...
  --> Verifying accounting...
  --> Validating against Spur allocations
  --> Using queue vis ...
  --> Using parallel environment 16way ...
  --> Using project TG-MyAcct ...
Your job 581332 ("vncserver") has been submitted

The default window manager is twm, a spartan window manager which reduces connection overhead. Gnome is available, if your connection speed is sufficient to support it. To use gnome, open the file ~/.vnc/xstartupand replace twm with gnome-session.
  • Once the job launches, connection info will be written to vncserver.out in your home directory. You can check job status with showq -u. When you first run the job, you can ls in your home directory to look for vncserver.out. Once it exists, a convenient way to check this is use the tail -f command:
spur% tail -f ~/vncserver.out
job execution at: Sat Mar 7 15:03:17 CST 2009
got VNC display vis3.ranger.tacc.utexas.edu:1
VNC display number is 1
local (compute node) VNC port is 5901
got spur vnc port 5931
Your VNC server is now running!
To connect via VNC client:  SSH tunnel port 5931 to spur.tacc.utexas.edu:5931
                            Then connect to localhost:5931

Stop tail -f by typing Ctrl-C. VNC is running on vis3.tacc.utexas.edu. The VNC listens on port 5901 for display :1, 5902 for display :2. On Spur, however, you can't connect directly to the nodes at vis[1-7].tacc.utexas.edu. You have to ssh to the head node, spur, where your connection is automatically port forwarded to the vis[1-7] node. The port spur:5931 corresponds to a VNC display :31 on spur.
If the vncserver.out file includes the following lines, then either you are looking at setup information for the previous job or your VNC terminal has exited. Check the job execution time.

 TACC: Cleaning up after job: 581332
 TACC: Done.
  • For security, TACC requires that you tunnel your VNC session through ssh. You can set up a tunnel on a unix command line, or with a windows ssh client. You will need to select an arbitrary VNC port number on your local machine for the tunnel. VNC clients might prefer you pick a local port in the range from 5901 to 5999. For Windows, see VNCTunnelWindows.
    On a unix command line, where 59xx is your local port and 59yy is the port on spur specified in your vncserver.out file, use the command:
    ssh -L 59xx:spur.tacc.utexas.edu:59yy <username>@spur.tacc.utexas.edu
    for instance, to tunnel from 5911 on the local machine to 5901 on spur, use:
    ssh -L 5911:spur.tacc.utexas.edu:5901 <username>@spur.tacc.utexas.edu

    You can also open the tunnel from within ssh without opening a new xterm. If you are coming from Linux, then use an escape sequence, ~C to ask your ssh client to open a tunnel. Hit <Enter> a few times before typing ~C.
spur%
spur%
spur%
spur% ~C
ssh> -L5931:spur.tacc.utexas.edu:5931
Forwarding port.
spur%
  • Once the ssh tunnel has been established, use a VNC client to connect to the local port you created, which will then be tunneled to your VNC server on spur. Connect to localhost:59xx, where 59xx is the local port you chose for your tunnel. In the examples above, we would connect the VNC client to localhost::5911. Some VNC clients accept localhost:11.
    If you do not have a VNC client, you can download one from the any of the following sites:
  • What if you cannot connect with VNC?

First see if your job is still running with showq -u. If it is, then you can check whether VNC is listening for connections on spur. From the spur command prompt, telnet to the VNC port on spur. (Don't telnet to localhost, which will always fail. Telnet to spur.) Replacing 5941 with your desired port, it looks as follows:

spur% telnet spur 5941 < /dev/null
Trying 129.114.50.162...
Connected to spur.tacc.utexas.edu (129.114.50.162).
Escape character is '^]'.
RFB 003.008
Connection closed by foreign host.

If you see RFB then VNC is listening and you most likely mistyped your tunnel name. Redo the tunnel. If telnet fails with Unable to connect... then it is not. Kill your job with qdel <jobnumber> and try again. Note that the job usually runs fine.

  • After connecting your VNC client to your VNC server on spur, you may use visualization applications directly on the remote desktop without launching other SGE jobs. Any application that uses OpenGL library calls must be launched using vglrun, as described in the following sections. Open a new xterm so you don't accidentally end your session:
[tgusername@vis3 ~]$ xterm&

The xterm first appears, under this window manager, as an outline. It will wait for you to click your mouse to tell it where to place the window. Use the new xterm to start any visualization, as described below.

  • When you are finished with your VNC session, you should kill your VNC server by typing exit in the black xterm window titled "*** Exit this window to kill your VNC server ***". Merely closing your VNC client will NOT kill your VNC server job on spur, and you will continue to be billed for time usage until the job ends. If you close your VNC client, you can reconnect to your VNC server at any time until the server job ends.

Running Visualization Applications

To access the vis stack on spur, you must load the vis module, if it is not already loaded: module load vis. Then list the available modules with module avail and the visualization applications will be listed under /opt/apps/vis/modulefiles. You must load an application's module, and possibly prerequisite modules, before launching the application.

To run applications that make use of the OpenGL library, you must precede the executable with vglrun. For example: vglrun visit. If you receive the error "Xlib: extension GLX missing on display" when launching an application, you must precede the executable with vglrun.

Running Parallel VisIt on Spur

After connecting to a VNC server on spur, as described above, do the following:

  • VisIt was compiled under the intel v10 compiler and mvapich v1.0.1 MPI stack. These must be loaded prior to running VisIt. Also, the default module 'CTSSV4' is incompatible with the VisIt server and must be removed. From the default environment, execute the following:
module delete mvapich mvapich2
module delete CTSSV4
module load mvapich/1.0.1
  • If the vis module is not yet loaded, you must load it. Then load visit and run it:
module load vis
module load visit
vglrun visit
  • Configure a parallel run engine:
    • Open the host profile: < Ctrl-H > or Options -> Host Profiles
    • Click the button "New Profile"
    • Under the "Selected profile" tab
      • Name the profile, e.g. "spur parallel"
      • Remote host name will be the current vis node: vis5.tacc.utexas.edu
      • Host name aliases: vis*.ranger.tacc.utexas.edu.
      • Check the "Parallel computation engine" box. This activates the "Parallel options" tab.
    • Under the "Parallel options" tab:
      • Check the "Parallel launch method" box, and select poe. Be careful! There are several 'qsub' options, and you may need to scroll down to find "poe"
      • Set the "Default number of processors" field to a value greater than one. The exact value is ignored, but it must be 2 or more to avoid automatic launch of the serial engine. The actual number of processors requested is controlled by the number of nodes you requested for your VNC session.
    • Under the "Advanced options" tab:
      • Check the box "Use VisIt script to set up parallel environment"
      • Check the box "Tunnel data connections through SSH"
    • Click the button "Apply"
    • Click the button "Dismiss"
    • Save your configuration! Select Options -> Save Settings

For more information on VisIt, see https://wci.llnl.gov/codes/visit/home.html

Running Parallel VisIt on Ranger

VisIt can be run in batch mode on Ranger using software rendering, usually through VisIt’s python scripting interface. You must also load Mesa, a software-rendering implementation of OpenGL, because Ranger compute nodes lack GPUs. Here is a sample job script that launches visit in command-line mode and runs a python script:

#!/bin/bash
 
#$ -V				      # Inherit the submission environment
#$ -cwd 			      # Start job in submission dir
#$ -N visit.job			      # Job name
#$ -j y				      # stderr and stdout into stdout
#$ -o $HOME/$JOB_NAME.o$JOB_ID	      # Name of the output file
#$ -pe 16way 16			      # Request 1 Ranger node
#$ -q normal                          # Queue name
#$ -A TG-MyTGAcct		      # Account
#$ -l h_rt=01:00:00		      # runtime (hh:mm:ss) - 1 hour
 
# configure environment for visit
module purge
module load TACC
module delete pgi mvapich CTSSV4
module load intel mvapich
module load vis mesa visit/1.10.0
# run visit
visit -cli -nowin –s myVisItScript.py

To open a parallel engine in your python script, use the following:

launchArguments = (“-par”, “-np”, “16”,  “-l”, “poe”)
OpenComputeEngine(“localhost”, launchArguments)

Just like when using VisIt on Spur, the number of parallel engines launched is controlled by the job script ‘-pe’ option, not the engine ‘-np’ argument, but the –np argument must be 2 or more so a parallel engine will be used.

A manual for VisIt’s python interface is at: https://wci.llnl.gov/codes/visit/manuals.html


Running Parallel ParaView on Spur

After connecting to a VNC server on spur, as described above, do the following:

  • ParaView was compiled under the intel v10 compiler and mvapich v1.3 MPI stack. These must be loaded prior to running ParaView. However, they are not loaded by default. You must load them manually. If you want a challenge, try starting by typing module load paraview and the module program will lead you through what you need to load, which is the following:
[username@vis5 ~]$ module delete mvapich mvapich2
[username@vis5 ~]$ module load openmpi/1.3
[username@vis5 ~]$ module load vis
[username@vis5 ~]$ module load paraview
[username@vis5 ~]$ vglrun paraview

Paraview may take a while to launch.

  • Launch a parallel ParaView server:
    • In a separate xterm, load the modules as described above
    • After all modules have been loaded, launch the ParaView server. To launch with parallel rendering and the ParaView server (which may open additional rendering windows) use:
[username@vis5 ~]$ ibrun vglrun pvserver
TACC: Setting up parallel environment for OpenMPI mpirun.
TACC: Setup complete. Running job script.
TACC: starting parallel tasks...
Listen on port: 11111
Waiting for client...

If additional windows open, you can minimize them, but do not close them.

  • Connect to the server from within the ParaView client:
    • Click the "Connect" button, or select File -> Connect
    • Click "Add Server"
    • Enter a "Name", e.g. "manual launch"
    • Click "Configure"
    • For "Startup Type", select "Manual"
    • Click "Save"
    • Select the name of your server configuration, and click "Connect"
    • In the xterm where you launched ParaView server, you should see "Client connected."

For more information on ParaView, see http://www.paraview.org/

Running Amira on Spur

Amira runs only on node vis6 of spur. You must request this node explicitly in qsub, either in your script or on the command line using the argument "-l h=ivis6". Note the leading 'i'!

qsub -l h=ivis6 -A TG-MyAcct /share/sge/default/pe_scripts/job.vis

After connecting to a VNC server on spur, as described above, do the following. If the vis module is not yet loaded, you must load it, then load Amira, and run it with vglrun.

module load vis
module load amira
vglrun amira

Running IDL on Spur

To run IDL interactively in a VNC session, connect to a VNC server on spur as described above, then do the following:

module load vis
module load idl

Then either launch IDL directly or launch the IDL virtual machine.

idl
 - or -
idl -vm

If you are running IDL in scripted form, without interaction, simply submit an SGE job to the 'vis' queue that loads IDL and runs your script. The 'vis' queue only allocates to spur's vis nodes.

If you need to run IDL interactively outside of a VNC session, you will need an SGE job running in the vis queue so that you are allocated a vis node. The vncserver job is an easy way to do this, though submitting a script with 'sleep N' where N is the amount of time you will use the node would work just as well. The point is to allocate a vis node to you.

Once you have a node allocated, you can ssh to it (ssh -X vis[1-7,big]). However, these are not reachable from outside the Ranger/Spur firewall, so you must first ssh to a login node (spur, login3.ranger, login4.ranger).

To summarize: once you have a vis node allocated, you can ssh through a login node to the allocated vis node.

Once you're done, you can let the SGE job expire naturally (though you will be charged for the time used), or you can qdel it to terminate it immediately.

Frequently Asked Questions (FAQs)

If you do not see your question answered here, please submit a consulting ticket at https://portal.tacc.utexas.edu/consulting/ General Questions

Why VNC? Can I use X-forwarding?

VNC is more responsive that X-forwarding because only keyboard, mouse, and screen updates are sent over the remote connection. However, X-forwarding will work on Spur.

You may X-forward terminals, editor sessions, etc from the login node. Visualization and data analysis apps may be X-forwarded as well, but they must be run on a vis node (vis*.ranger), allocated by SGE, not the login node (spur). The vncserver job is an easy way to allocate a vis node, though submitting a job script with 'sleep N' where N is the amount of time you will use the node would work just as well. The point is to allocate a vis node to you.

Once you have a node allocated, you can ssh to it (ssh -X vis[1-7,big]). However, these are not reachable from outside the Ranger/Spur firewall, so you must first ssh to a login node (spur, login3.ranger, login4.ranger).

Must I ssh-tunnel my VNC connection?

No, but VNC communications are not encrypted, so your VNC password is visible and keystrokes can be logged. Never type a password inside an unencrypted VNC session! Tunneling the VNC connection with ssh encrypts the communication, with negligible effect to performance, thus removing this security vulnerability.

My argument to qsub or vncserver was not recognized

Arguments to qsub must be put before the script filename, arguments to vncserver must be put after the script filename. See the example at point #3 under "Logging into Spur", above.

My application fails with GL errors

Xlib: extension "GLX" missing on display ":1.0"

Your application uses the OpenGL library, and must be run with vglrun on a vis node: vglrun < app >

'vglrun' command not found

vglrun is installed only on the vis nodes of spur (vis*.ranger), not on the login node (spur). If your prompt contains "spur", you are on the login node. You must be in a VNC session on a vis node to use vglrun and OpenGL-based applications.

How can I get more RAM for my application? It has hit a memory limit.

SGE sets per-process memory limits based on the ‘wayness’ of the SGE job, as specified in the ‘-pe’ argument of your job script. The default script uses ‘-pe 16way 16’, which permits 16 parallel processes, each with a memory limit of 1/16th available RAM (8GB on vis1-7, 16GB on visbig). Selecting a lower wayness will permit fewer parallel processes, each with a higher memory limit. For example, using ‘-pe 4way 16’ permits 4 parallel processes, each with a memory limit of 1/4th available RAM (32GB on vis1-7, 64GB on visbig). Note that the memory limit is enforced at the kernel level, while parallel process limit applies only to MPI environments created by the TACC ibrun script. Multi-threaded programs not using ibrun are subject only to the memory limit.

Amira cannot find a valid license

Amira only runs on vis6. Make sure that you are running on vis6 (look at the prompt in your VNC session terminals, or see your vncserver.out file). You must explicitly request vis6 through qsub, as described in "Running Amira on Spur", above.

Personal tools