FAQ

From CAC Documentation wiki
Revision as of 10:29, 31 July 2017 by Shl1 (talk | contribs)
Jump to navigation Jump to search

Account

How can I obtain a CAC account?

See How to Start a Project.

How can I determine the number of hours I have left before I reach my project limit?

Check links from the CAC Projects page.

My account is locked.

If it was locked after repeated password failures, it should automatically unlock after 30 minutes. Otherwise: Contact Support

I forgot my password, or have problems with a new password, or need a password reset.

Contact Support

Are my login id and password the same for all machines?

Yes. For an ssh connection give your login id at the prompt. With a Windows GUI, specify the username as CTC_ITH\<login_id> or <login_id>@tc.cornell.edu.

When I try to use a Remote Desktop client to connect to winlogin, it tells me that my username/password are incorrect.

Make sure that you are logging using the CTC_ITH domain. If you just put your username in the "username" box, it will try to log you into winlogin as a local user, which won't work. Put CTC_ITH\<username> in the "username" box.


Files

How can I copy files to my desktop from H:?

Use SSH client to sftp files. See File_Transfer_To_Clusters.

Can't use scp to transfer files to the CAC.

Use sftp.

Problems using WinSCP.

Use sftp.

Needed to share a file with a colleague outside the university. This is typically available on to CAC personnel.

Showed how to use outgoing ftp folder and sent detailed instructions by email.

Can't access files.

System problem. Send email to consult@tc.cornell.edu.

Can see files in explorer, but sees files only in home directory with dir at command prompt.

User had navigated Start | Run, then typed the command command. Needs to use the command cmd.

How Do I Transfer Files To and From CAC Machines?

  1. Use a program to send them - SecureShell
    • Faster over slower connections.
    • Less hassle.
  2. Make your CAC home directory look like a local drive - FileAccess
    • Works fine on campus.
    • Convenient for editing.

If you have any questions, please Web site contact Send email or call 607.254.8686.

Why use a temporary directory

It is faster to perform local file I/O and copy complete data files to/from $HOME at the beginning and the end of the job, rather than perform I/O over the network ($HOME is network mounted on the compute nodes).

  • Torque creates a uniquely named directory (/tmp/$PBS_JOBID) when a job starts and stores the path of this directory in the $TMPDIR environment variable. This directory is cleaned up when the job exits.
    • To use this feature, reference $TMPDIR
  • You may create directories for file read/writes outside your /tmp/$PBS_JOBID in /tmp. You do risk leaving any data there; it may be deleted at any time we see /tmp getting full.

Red Cloud

How secure is Red Cloud

Red Cloud Security

Red Cloud, CAC's infrastructure as a service cloud, runs Eucalyptus cloud management software. Because Eucalyptus implements an Amazon Web Service (AWS) compatible private cloud, Red Cloud's security model follows closely after AWS.

User Interface and API

User Authentication

Red Cloud accepts two types of user authentication: password and AWS-style keys consisting of 2 randomly generated strings. Users log into the web management console using passwords. The user name and password is authenticated against CAC's Active Directory via Kerberos. For making AWS compatible API calls, users can obtain their keys from the web console. All API calls are SSL encrypted, as are web console sessions.

User Access Management

Eucalyptus fully implements AWS's Identity and Access Management (IAM) features. Group and user polices can be used for controlling access on per resource and API call basis. See AWS's IAM documentation for details.

Instance Access Control

Red Cloud runs Eucalyptus in "Managed" mode to implement security group and elastic IP address features described below. In Managed mode, all user data passed within the cloud infrastructure are VLAN tagged according to the security groups. The network switch connecting the cloud controller and physical nodes running the instances performs layer 2 switching guaranteeing network isolation between security groups. Instances have no access to network packets belonging to other instances outside their own security groups.

To provide elastic IP addresses, Eucalyptus configures iptables running on the cloud controller host to perform the required source and destination network address translation (SNAT and DNAT).

These features are implemented in Red Cloud infrastructure, independent of the configurations by the users on their instances.

Security Group

Each instance (virtual machine) is assigned a security group at launch time. A security group is a private network in the cloud where network access between instances in the same security group is unrestricted.

Access to an instance from outside its security group is subject to the group's access rules. Users can define the access rules by protocol, source IP address and destination port.

Instances have unrestricted outbound access to the Internet.

Elastic IP Address

Each instance is assigned a private IP address belonging to its security group at launch time. An ephemeral routable public IP address is also assigned so the instance can be accessed from the Internet. Users can optionally reserve fixed public IP addresses that they can assign to their instances. Assigning a reserved public address to a running instance takes just a few seconds and does not require rebooting the instance.

Cloud Infrastructure

Cloud infrastructure hosts (cloud controller, storage controller, and the physical nodes running the instances) run CentOS 6 Linux distribution on a private network isolated from cloud user traffic.

How do I access the EBS volume that I have attached to an instance?

Attached volumes show up as block devices (i.e. directly attached disks) from inside the instance.

  1. You can see the attached volume using the "lsblk" command inside the instance.
  2. Then you can format the disk with the file system of your choice like this: mkfs -t <file system> <block device name>, e.g. mkfs -t ext3 /dev/vdc.
  3. Mount the file system: mkdir /mnt/data0; mount /dev/vdc /mnt/data0

How do I give my virtual server a domain name?

A virtual server in Red Cloud is randomly assigned an IP address from 128.84.8.101 to 128.84.8.196 every time it is booted. If you want to create a domain name for your virtual server (e.g. mycloudserver.cac.cornell.edu) that stays consistent, follow the instructions on Using Dynamic DNS with Red Cloud page.

Why won't ssh let me log in to my virtual server?

  • You may not have given your instance a keypair for root access when you started it up. You should always use the -k option to assign one of the keypairs named in euca-describe-keypairs to your instance:
euca-run-instances -n 1 -k mykey [...etc...]
  • If you get a response that looks like this:
-sh-3.2$ ssh -X -i mykey.private root@128.84.8.105
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is...

...most likely this means that the numeric IP address for your instance (128.84.8.105 in the above example) has been assigned to you previously for a different instance. For a typical Linux ssh client, the way to fix this is to edit ~/.ssh/known_hosts on your machine, deleting the line that contains the re-used numeric IP address together with its old RSA fingerprint. For an ssh client on Windows or Mac, you might need to consult the documentation for that particular client.

How do I run the latest version of my favorite Linux distro in Red Cloud?

FIrst you will need to create a bootable OS image of the distro/version in Red Cloud. Specific instructions will vary depending on distro/version. You will need to be handy with euca2ools command line tools and basic Linux system administration. Here are some sample distros:

  • Ubuntu
  • Fedora: Fedora cloud images from here should work out-of-the-box.

Red Cloud and Amazon Web Services (AWS)

How do I migrate an Amazon Web Services (AWS) EC2 image to Red Cloud

  1. Download the bundle from AWS and decrypt it. You will end up with an image file:

    ec2-download-bundle -b <S3 bucket name> -d .

    ec2-unbundle -s . -d . <manifest>

  2. Mount this image somewhere using "mount -o loop" option.
    1. Edit <image mount point>/etc/fstab and <image mount point>/etc/grub.conf such that the root disk is /dev/vda instead of /dev/xvde used by AWS.
    2. Download the tarball corresponding to your Linux distribution here. Unpack the tarball in /lib/modules

      cd <image mount point>/lib/modules; tar jxvf <path to the tarball>

    3. Unmount image
  3. Bundle image for Red Cloud:

    euca-bundle-image -i <path to image file> -d <working directory> --kernel <eki> --ramdisk <eri>

    Use the following <eki> and <eri> according to your Linux distribution:
    • CentOS 5.10: eki-CE97382C and eri-91003AD3
    • CentOS 6.5: eki-921637A4 and eri-52B4381E
  4. Upload the bundle to Red Cloud:

    euca-upload-bundle -b <bucket name> -m <manifest from the previous euca-bundle-image command>

  5. Register the image in Red Cloud:

    euca-register -a x86_64 <bucket name>/<manifest>

How do I migrate a Red Cloud (instance store) image to Amazon Web Services (AWS)

  1. Download the bundle from Red Cloud and decrypt it. You will end up with an img file.

    euca-download-bundle -b <bucket name> -d .

    euca-unbundle -s . -d . <manifest>

  2. Mount this image somewhere using "mount -o loop" option.
    1. Edit /etc/fstab and /etc/grub.conf such that the root disk is /dev/xvde instead of /dev/vda like on Red Cloud.
    2. Make sure your instance store image has the latest and greatest CentOS 6 kernel installed. If not, do

      yum --installroot <mount point of the image> install kernel

    3. Check <mount point of the image>/etc/grub.conf to make sure it looks right to you. Add "console=hvc0" to the end of the kernel line (reference: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UserProvidedKernels.html)
    4. unmount image
    5. Create an AWS bundle and upload it using "ec2-bundle-image" and "ec2-upload-bundle" commands.
  3. According to this article, You will want to register the image with kernel aki-919dcaf8 in your ec2-register -k command, assuming you want to run it in us-east-1 region. Or select the appropriate aki ID for the region you want to run.

Euca-describe-instances or the web console says my instance is running, but why is it not responding to ping or ssh connections?

External access from the Internet to Red Cloud instances is blocked by default. Follow the instructions in the Manage Network Access section to enable network access to the instance.

How do I move an image from one Red Cloud account to another account or another cloud?

  1. Pre-requisites:
    1. A host with euca2ools installed: you can use a Red Cloud instance running standard CentOS 7 image (which has euca2ools installed).
    2. AWS-style credentials to use with euca2ools for both the accounts.
  2. Use the following instructions for moving an instance store image or EBS image

Linux Batch

Scheduler Frequently Asked Questions

Contact Support

Why are you using Maui and Torque now?

We have switched to using a nationally recognized resource manager and scheduler in order to make the usage of our systems align more closely with the national community. This also allows us to leverage the considerable capabilities of the Maui software to ensure optimal and flexible use of our systems.

When's my job going to run?

If you have already submitted your job and you'd like to know that, use the showstart command to find estimated start times. If you are trying to decide where to run your job so that it runs the soonest, you'll want to examine the showbf command. This allows you to search for when a job with particular resource requirements will run.

Why is my job stuck in the queue?

Sometimes your job doesn't run, even though it looks like it should. Maybe there are few jobs running in the cluster, and your job still won't run.

  1. Find your jobids with "showq -u username"
  2. Use "checkjob -v jobid" to examine one of the jobs. Examining Checkjob -v discusses how to read this output.

Jobs in the "Batch Hold" state initiate emails to the system administrators. For other problems, contact CAC help.

Why is my job deferred?

There can be several reasons for a job to defer. Sometimes when the Maui scheduler's queue is full, two jobs attempt to start on a node at the same time, and one will switch to being deferred. On this occasion, if you type "checkjob -v <jobid>", you will see, at the bottom, the message:

Message[0] job rejected by RM 'scheduler' - job started on hostlist
compute-3-40.v4linux,compute-3-37.v4linux,compute-3-35.v4linux,compute-3-34.v4linux
at time 13:11:22_07/20, job reported idle at time 13:11:53_07/20 (see RM logs for details)

In this case, the only way to make this job run is to notify help at CAC.

What are the queues/affiliations?

Affiliations was the term used by the vsched scheduler to indicate the name of the queue that jobs were submitted to. Most schedulers use the term queue (The scheduler also uses the term "class" to represent the same entity), so you can substitute the word you prefer. V4 queues are listed on the v4 Linux Cluster page.

When I try to run mpdboot I get an error regarding bad python version

This type of message goes on to say, "You can't run mpdboot on ['compute-3-44.v4linux'] version of python must be >= 2.4, current..." Mpdboot uses python and ssh to start MPI daemons on all nodes of your job. It begins by using ssh to ask what version of python is running on each node.

Usually, this error means that ssh is having a problem establishing communication for the mpds. First, make sure you added "-r ssh" to your mpdboot line. If that looks OK, then try to rename (mv) the .ssh directory in your home directory to something like .ssh_bak. Log out, and log back in. A new .ssh directory should be recreated for you automatically (you can verify with "ls -la") which should have valid keys in it.

You may also get this error if you are using a version of Python which does not work with mpdboot. In general, mpdboot needs python 2.3 or newer, but it gets very picky about versions newer than 2.4, as well. If you are trying to run Python 2.5 or 2.6 from your own directory, sometimes mpdboot will find only older versions when it does ssh to the other nodes in your job (because a non-interactive ssh can have a different path). One way to ensure mpdboot runs properly in this case is to ensure it uses the system copy of python. In bash, you can set the path for a command before you invoke it, here so that the system Python is used.

PATH=/usr/bin:/bin:/opt/intel/impi/3.1/bin64/ mpdboot ...

What variables does PBS define in the job script?

Some of the variables are listed in qsub documentation but a good way to see the working environment is to submit a batch job which just does "env>variables.txt" and look for the ones starting in "PBS_".

No Job Control Warning for CSH and TCSH

The output file from the script starts with the error:

 Warning: no access to tty (Bad file descriptor).
 Thus no job control in this shell.

This warning means that the fg, bg, and ampersand will not work in your script files. If your default user shell is csh or tcsh, the job will try to execute your script using csh or tcsh, and you'll get this warning. Bash doesn't have this problem.

You can force your script to start with the Bash shell using a PBS directive:

#PBS -S /bin/sh

When Torque starts your job, it will now use Bash, but it won't actually call your .bashrc. If you have any startup files to modify the path or set other variables, you can add to the start of your script, after the PBS directives:

source ~/.bashrc

Another nice way to ensure your favorite variables are defined is to submit the script with the -V option:

nsub -V batch.sh

This option copied whatever environment variables you have defined on the command line to the script when it runs. In short, if you could run something interactively, it should run when the scheduler executes the job.

Mpiexec Won't Accept -ppn Argument

The default MPI, Intel MPI, requires that you put the -ppn argument before the -np argument. The nodes have at least three versions of mpiexec installed. The default is Intel MPI under /opt/intel. If you modify your shell's path, in .bashrc or .cshrc, to put /usr/local/bin before the default path, then you may be getting the OSC mpiexec. This version does not depend on mpdboot. It talks directly with Torque to start jobs. A drawback is that the OSC mpiexec, on our system, cannot start more than one job per node. That's why it's not the default one to use.

I cannot find my output file

If you do not specify an output file when submitting a batch script, then it will automatically produce a file with a name like 110432.scheduler.v4linux.OU in the directory which was the working directory when you submitted your job. If you specify an output file with a command like "#PBS -o out.txt", then that file will be in your $HOME directory. This behavior has changed in recent versions of the scheduler.


Contact Support

Microsoft Visual Studio

Has CAC installed Visual Studio and the Intel compilers on winlogin?

No, not at the present time. This section of the FAQ pertains to Red Cloud users who have installed this software.

Where is nmake?

C:\Program Files\Microsoft Visual Studio\VC98\bin\nmake. Call setup_visualc.bat

How can you find the cl compiler?

Call setup_visualc.bat

Can't find uuid.lib.

It's in C:\Program Files\Microsoft SDK\lib.

LINK fatal error LNK1201: error writing to program database H:\users\...\some.pdb; check for insufficient disk space, invalid path, or insufficient privilege.

Suspicion is that there is an older version of the file some.pdb. Delete that file and rebuild.

How do I use Intel Fortran at the command line?

First, call setup_intelf32.bat. The compilation command is ifort.

Fortran program gives an access violation. What to do? forrtl: severe (157): Program Exception - access violation

Segmentation fault. Look for a place where you are writing past the end of an array.

Fortran program gives stack overflow. What to do? forrtl: severe (170): Program Exception - stack overflow

Increase the space available on the stack with the flag /F, where is the size of the stack in bytes. The default is 1000000. Try /F10000000. Increase as necessary.

What is the command line syntax to compile a Fortran code with OpenMP?

See the info provided by "ifort -h". There are 4 options beginning with /Qopenmp.

Fortran program gives convergence errors when compiled with with /O1, /O2, /O3.

Add /Op flag to enable better floating point precision.

For a Fortran code, how do I set up debugging, either for the Release version in VS or at a command prompt?

Let's say you would like to debug an optimized Intel Fortran code, created either as a Release version in Visual Studio (VS) or at a command prompt with /O2. A Debug version in VS sets the correct debugging flags, but disables optimization. Add the command-line flags /Zi /debug:full /traceback to the Release version. Specify the linker option /pdbfile:filename.pdb to create the program database file. This file and the executable must be copied into the same directory when you run the program.

Can the Intel C compiler handle makefile dependencies without having to use cygwin's makedepend?

Yes.  You can use the /QMM compiler option, which is OFF by default.
  • /QM - Generates makefile dependency lines for each source file, based on the #include lines found in the source file.
  • /QMD - Preprocess and compile. Generate output file (.d extension) containing dependency information.
  • /QMF file - Generate makefile dependency information in file. Must specify /QM or /QMM.
  • /QMG - Similar to /QM, but treats missing header files as generated files.
  • /QMM - Similar to /QM, but does not include system header files.
  • /QMMD - Similar to /QMD, but does not include system header files.