Difference between revisions of "FAQ"
Line 217: | Line 217: | ||
Cloud infrastructure hosts (cloud controller, storage controller, and the physical nodes running the instances) run CentOS 6 Linux distribution on a private network isolated from cloud user traffic. | Cloud infrastructure hosts (cloud controller, storage controller, and the physical nodes running the instances) run CentOS 6 Linux distribution on a private network isolated from cloud user traffic. | ||
− | == I have attached a volume to my image | + | == I have attached a volume to my image. How do I see it? == |
Attached volumes show up as block devices (i.e. directly attached disks) from inside the instance. | Attached volumes show up as block devices (i.e. directly attached disks) from inside the instance. | ||
# You can see the attached volume using the "lsblk" command inside the instance. | # You can see the attached volume using the "lsblk" command inside the instance. |
Revision as of 15:12, 22 February 2016
Account
How can I determine number of hours left on my allocation?
- Check the account management page at [1].
- When logged into one of the v4 linuxlogin nodes, you can run 'showbalance' to view remaining compute time. (If you have jobs currently running, the showbalance result has deducted the time requested for the current running job(s) and adjusts to time used once the current running job(s) complete.)
How can I obtain a CAC account?
See Project Requests.
My account is locked.
If it was locked after repeated password failures, it should automatically unlock after 30 minutes. Otherwise: Contact Support
I changed my password. Now I'm locked out.
Forgot password.
Problems with new password.
Need password reset.
Having trouble logging in but know your username and password are correct
To resolve, clear your browser cache and then login again.
Are my login id and password the same for all machines?
Yes. For an ssh connection give your login id at the prompt. With a Windows GUI, specify the User Name as <login_id>@tc.cornell.edu or CTC_ITH\<login_id>.
When I use a Remote Desktop Client to connect to winx64login, it says that my username/password are incorrect.
Make sure that you are logging using the CTC_ITH domain. If you just put your username in the "username" box, it will try to log you into winx64login as a local user, which won't work. Put CTC_ITH\<username> in the "username" box.
Login
Which machines are the login nodes?
ctclogina, ctcloginb, ctcloginc, ctclogind.
Can't use login machine because of compute-bound processes on the machine.
Can't get to login node with RDC. Times out.
rdesktop gives an error message: $ rdesktop ctclogina.tc.cornell.edu ERROR: connect: Connection timed out
A firewall may be blocking outgoing connections.
Can't connect to login node.
Can't login using ssh.
Try different type of connection and see if need to change password. Otherwise send email to useracct@tc.cornell.edu and ask to have password reset.
Could not get to scicenter2(sp?) machine, could yesterday
Terminal serve to login node to change password.
Can't get to ctclogina. Gets error msg "The specified remote computer could not be found."
Use complete name ctclogina.tc.cornell.edu.
mpirun command not found on login node.
This is as expected. Don't run jobs on login node.
Connect from login node to batch node. Disconnect from login node. At reconnect, session hung. Can't close window or logoff.
I have a disconnected session on a login node. When I reconnect, the login screen is blank. What should I do?
Issue ctrl-shift-esc to bring up Task Manager. Select the Applications tab, then New Task. Enter "explorer" and click OK. A normal desktop should reappear. If it doesn't, send e-mail to consult@tc.cornell.edu and ask to be logged off.
I have a login process on ctcloginb that I can not log off.
Wants to debug on login nodes in visual studio.
Told user why debugging is not permitted on login nodes. Suggested collaboratory.
Can't use rdc to login node.
Can't close command windows on login node.
Why does RDC to ctclogina fail?
It could be that you need to use the completely qualified name ctclogina.tc.cornell.edu.
How Do I Connect to CAC Machines to Run Programs?
There are two options:
- Command-line access - SecureShell (Windows or Linux clusters from any computer)
- More efficient over slower network connections.
- WindowsXWindows X-Windows provides pointing-and-clicking, if you want.
- Work with a desktop of the CAC computer - RemoteDesktop (to Windows), VNCAccess (to Linux)
- Maybe more familiar for moving files and editing.
- You will still have to use the command line to submit jobs.
Batch
My batch job includes vii0047, but I can't login and MPI/Pro says: MPI/Pro error: Failed to login the user on server: vii0047.tc.cornell.edu System Error: Logon failure: the user has not been granted the requested logon type at this com.
Output from batch is not copied to H:.
Allocated 2 nodes, only allowed to use remote desktop connection to master node.
MPI/Pro Error:SocketException System Error:No connection could be made because the target machine actively refused it.
Copied files to T:\%USERNAME%, but job doesn't give output.
Must cd to T:\%USERNAME% before running job.
Can't move or delete some files in T: on some batch nodes.
Batch jobs that just disappear from queue, having done nothing.
User had set some parameters with a space before and after the = sign, putting a trailing space on the parameter. Remove the spaces.
What can users do about the long time it takes for jobs to clear?
See the "MPI Cleanup" tip at http://www.tc.cornell.edu/services/support/batch/faster_cleanup.asp
Is there a way to make the copyback.bat (which copies the output files periodocally ) file to copy output from all the nodes to the H: drive
Yes. Start /b mpirun -np N parallel_copyback.bat
Need to have different files for each process. How to do this? Problem doing this by a system call in C++ program.
As part of setup file, use commands
cd /D T: del /Q T:\%USERNAME% mkdir T:\%USERNAME%\%MSTI_RANK% copy files.* T:\%USERNAME%\%MSTI_RANK%
Jobs are stuck clearing.
How to direct jobs from remote machines to CAC for batch? Need software on CAC batch nodes.
Explained that we can do this and how.
What do I need to do to use v3?
See http://www.tc.cornell.edu/Services/Policies/Pages/usage.htm
Copy of executable and input files failed on vi0004.
System problem. Contact Support
I have an error about the path when connecting to a batch machine.
Check your userlogin.bat file. There may be a reference to Visual Studio(VS) in userlogin.bat, but VS is not on batch nodes. Change syntax to "call setup_visualc.bat" or call a different setup file as appropriate.
Can I telnet to batch machines?
No. You need to use a remote desktop connection from a login node to login interactively to a machine on which you have a job running.
Files
How can I copy files to my desktop from H:?
Use SSH client to sftp files. See File_Transfer_To_Clusters.
Can't use scp to transfer files to the CAC.
Use sftp.
Problems using WinSCP.
Use sftp.
Showed how to use outgoing ftp folder and sent detailed instructions by email.
Can't access files.
System problem. Send email to consult@tc.cornell.edu.
Can see files in explorer, but sees files only in home directory with dir at command prompt.
User had navigated Start | Run, then typed the command command. Needs to use the command cmd.
How Do I Transfer Files To and From CAC Machines?
- Use a program to send them - SecureShell
- Faster over slower connections.
- Less hassle.
- Make your CAC home directory look like a local drive - FileAccess
- Works fine on campus.
- Convenient for editing.
If you have any questions, please Web site contact Send email or call 607.254.8686.
Why use a temporary directory
It is faster to perform local file I/O and copy complete data files to/from $HOME at the beginning and the end of the job, rather than perform I/O over the network ($HOME is network mounted on the compute nodes).
- Torque creates a uniquely named directory (/tmp/$PBS_JOBID) when a job starts and stores the path of this directory in the $TMPDIR environment variable. This directory is cleaned up when the job exits.
- To use this feature, reference $TMPDIR
- You may create directories for file read/writes outside your /tmp/$PBS_JOBID in /tmp. You do risk leaving any data there; it may be deleted at any time we see /tmp getting full.
Red Cloud
How secure is Red Cloud
Red Cloud Security
Red Cloud, CAC's infrastructure as a service cloud, runs Eucalyptus cloud management software. Because Eucalyptus implements an Amazon Web Service (AWS) compatible private cloud, Red Cloud's security model follows closely after AWS.
User Interface and API
User Authentication
Red Cloud accepts two types of user authentication: password and AWS-style keys consisting of 2 randomly generated strings. Users log into the web management console using passwords. The user name and password is authenticated against CAC's Active Directory via Kerberos. For making AWS compatible API calls, users can obtain their keys from the web console. All API calls are SSL encrypted, as are web console sessions.
User Access Management
Eucalyptus fully implements AWS's Identity and Access Management (IAM) features. Group and user polices can be used for controlling access on per resource and API call basis. See AWS's IAM documentation for details.
Instance Access Control
Red Cloud runs Eucalyptus in "Managed" mode to implement security group and elastic IP address features described below. In Managed mode, all user data passed within the cloud infrastructure are VLAN tagged according to the security groups. The network switch connecting the cloud controller and physical nodes running the instances performs layer 2 switching guaranteeing network isolation between security groups. Instances have no access to network packets belonging to other instances outside their own security groups.
To provide elastic IP addresses, Eucalyptus configures iptables running on the cloud controller host to perform the required source and destination network address translation (SNAT and DNAT).
These features are implemented in Red Cloud infrastructure, independent of the configurations by the users on their instances.
Security Group
Each instance (virtual machine) is assigned a security group at launch time. A security group is a private network in the cloud where network access between instances in the same security group is unrestricted.
Access to an instance from outside its security group is subject to the group's access rules. Users can define the access rules by protocol, source IP address and destination port.
Instances have unrestricted outbound access to the Internet.
Elastic IP Address
Each instance is assigned a private IP address belonging to its security group at launch time. An ephemeral routable public IP address is also assigned so the instance can be accessed from the Internet. Users can optionally reserve fixed public IP addresses that they can assign to their instances. Assigning a reserved public address to a running instance takes just a few seconds and does not require rebooting the instance.
Cloud Infrastructure
Cloud infrastructure hosts (cloud controller, storage controller, and the physical nodes running the instances) run CentOS 6 Linux distribution on a private network isolated from cloud user traffic.
I have attached a volume to my image. How do I see it?
Attached volumes show up as block devices (i.e. directly attached disks) from inside the instance.
- You can see the attached volume using the "lsblk" command inside the instance.
- Then you can format the disk with the file system of your choice like this: mkfs -t <file system> <block device name>, e.g. mkfs -t ext3 /dev/vdc.
- Mount the file system: mkdir /mnt/data0; mount /dev/vdc /mnt/data0
How do I give my virtual server a domain name?
A virtual server in Red Cloud is randomly assigned an IP address from 128.84.8.101 to 128.84.8.196 every time it is booted. If you want to create a domain name for your virtual server (e.g. mycloudserver.cac.cornell.edu) that stays consistent, follow the instructions on Using Dynamic DNS with Red Cloud page.
Why won't ssh let me log in to my virtual server?
- You may not have given your instance a keypair for root access when you started it up. You should always use the -k option to assign one of the keypairs named in euca-describe-keypairs to your instance:
euca-run-instances -n 1 -k mykey [...etc...]
- If you get a response that looks like this:
-sh-3.2$ ssh -X -i mykey.private root@128.84.8.105 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is...
...most likely this means that the numeric IP address for your instance (128.84.8.105 in the above example) has been assigned to you previously for a different instance. For a typical Linux ssh client, the way to fix this is to edit ~/.ssh/known_hosts on your machine, deleting the line that contains the re-used numeric IP address together with its old RSA fingerprint. For an ssh client on Windows or Mac, you might need to consult the documentation for that particular client.
Can I create clones of my office workstation in Red Cloud?
Yes, if you have a Linux workstation. Your goal will be to take an image of your hard disk drive and combine it with a kernel and a ramdisk (from an official Eucalyptus image) that gives it virtio and iscsi support. Here is the outline of what you do:
- Check euca-describe-images. If your preferred Linux kernel doesn't appear in our pre-registered list of kernels, please contact us.
- Make sure the appropriate kernel modules for your chosen kernel and ramdisk are preloaded in your /lib/modules directory.
- Use dd to capture your hard disk into an image. Here are some helpful links: Linux Backup: Hard Disk Clone with "dd", and Image Your Hard Drive using dd.
- Transfer your workstation's root disk image to cloud-login. This could take some time, assuming your disk image has a typical size in the multi-GB range.
- Follow the procedure for Uploading a Root Disk Image in the Red Cloud user guide.
How do I migrate my data and image from Eucalyptus 3
When to migrate
If there is little customization in your instance or image, it may be easier to copy over files using rsync or sftp, which have widely available documentation online and in your system's man pages. However, if your instance has been highly customized and you think it may take more than a few hours to recreate your system, then you may wish to follow the migration instructions below. If you wish to migrate but do not wish to perform the migration yourself, consulting is available.
Migrate Image from Euca 3
EBS Image
In Euca 3 Cloud
* Find the snapshot ID of the EBS image you are migrating: [shl1@localhost shl1]$ euca-describe-images emi-5A38465A IMAGE emi-5A38465A 448419271023/centos-6-ebs 448419271023 available public x86_64 machine ebs hvm BLOCKDEVICEMAPPING /dev/sdb BLOCKDEVICEMAPPING /dev/sda snap-F3FA421D 10 true
- If you want to migrate an existing EBS instance,
- Stop your EBS instance, and
- Take a snapshot of EBS volume that hosts your instance's root disk.
- Create a volume from the snapshot:
[shl1@localhost shl1]$ euca-create-volume --snapshot snap-F3FA421D -z redcloud VOLUME vol-42E23E2E 10 snap-F3FA421D redcloud creating 2014-10-02T18:52:51.868Z [shl1@localhost shl1]$ euca-describe-volumes vol-42E23E2E VOLUME vol-42E23E2E 10 snap-F3FA421D redcloud available 2014-10-02T18:52:51.868Z standard
- Attach the volume to an Euca 3instance:
[shl1@localhost shl1]$ euca-attach-volume -i i-0DBC41A2 -d /dev/vdb vol-42E23E2E ATTACHMENT vol-42E23E2E i-0DBC41A2 /dev/vdb attaching 2
- ssh into the Euca 3 instance to which the volume is attached.
[shl1@localhost shl1]$ ssh -i ~/euca3/mykey.private root@128.84.9.138 Last login: Thu Apr 3 14:22:37 2014 from stevenleemac.cac.cornell.edu [root@euca-172-16-173-49 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 100G 0 disk ├─vda1 253:1 0 10G 0 part / ├─vda2 253:2 0 89.5G 0 part └─vda3 253:3 0 512M 0 part [SWAP] vdb 253:16 0 10G 0 disk └─vdb1 253:17 0 10G 0 part [root@euca-172-16-173-49 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 9.9G 1.2G 8.3G 13% / tmpfs 2.0G 0 2.0G 0% /dev/shm /dev/vda2 89G 184M 84G 1% /mnt/ephemeral0
- If your volume is from a snapshot of a live instance's root disk running CentOS 6, 7, and Ubuntu, mount the root file system on the EBS volume somewhere in the instance and remove the etc/udev/rules.d/70-persistent-net.rules file or your migrated image will not be able to boot the instance in Euca 4. Unmount the root file system on the EBS volume before proceeding to the next step.
- Create an img file of the attached EBS volume:
[root@euca-172-16-173-49 ~]# cd /mnt/ephemeral0/ [root@euca-172-16-173-49 ephemeral0]# dd if=/dev/vdb of=./centos-6.img 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB) copied, 332.985 s, 32.2 MB/s
- Copy the .img file to an Euca 4 instance using the "rsync --sparse" command:
[root@euca-172-16-173-49 ephemeral0]# rsync --sparse centos-6.img shl1@128.84.8.100:/localdisk/shl1/
- Note: to login as root with ssh, you will want to continue using the same key that was used in Euca3, as this is not updated in the above process.
In Euca 4
- Create an EBS volume matching the size of the EBS image:
[shl1@ip-128-84-11-214 shl1]$ euca-create-volume -s 10 -z redcloud-ith VOLUME vol-317a4980 10 redcloud-ith creating 2014-10-02T19:49:45.283Z [shl1@ip-128-84-11-214 shl1]$ euca-describe-volumes vol-317a4980 VOLUME vol-317a4980 10 redcloud-ith available 2014-10-02T19:49:45.283Z standard
- Attach the EBS volume to an Euca4 instance:
[shl1@ip-128-84-11-214 shl1]$ euca-attach-volume -i i-dfbb062f -d /dev/vdc vol-317a4980 ATTACHMENT vol-317a4980 i-dfbb062f /dev/vdc attaching 2014-10-03T14:05:01.694Z [shl1@ip-128-84-11-214 shl1]$ ls /dev/vd* /dev/vda /dev/vda1 /dev/vda2 /dev/vdb /dev/vdb1 /dev/vdb2 /dev/vdc
- As root in the Euca 4 instance, use dd to restore the image from Euca 3 the attached volume. Detach the volume from the instance when done.
-bash-4.1# dd if=./centos-6.img of=/dev/vdc
and then:
[shl1@ip-128-84-11-214 shl1]$ euca-detach-volume vol-317a4980 ATTACHMENT vol-317a4980 i-dfbb062f /dev/vdc detaching 2014-10-03T14:05:01.697Z [shl1@ip-128-84-11-214 shl1]$ euca-describe-volumes vol-317a4980 VOLUME vol-317a4980 10 redcloud-ith available 2014-10-02T19:49:45.283Z standard
- Take a snapshot of the volume:
[shl1@ip-128-84-11-214 shl1]$ euca-create-snapshot -d "CentOS 6 Image from Euca 3" vol-317a4980 SNAPSHOT snap-cdcc5768 vol-317a4980 pending 2014-10-03T14:38:11.727904951369483 10 CentOS 6 Image from Euca 3 [shl1@ip-128-84-11-214 shl1]$ euca-describe-snapshots snap-cdcc5768 SNAPSHOT snap-cdcc5768 vol-317a4980 completed 2014-10-03T14:38:11.727Z 100% 904951369483 10 CentOS 6 Image from Euca 3
- Register the image:
[shl1@ip-128-84-11-214 shl1]$ euca-register -n centos-6-ebs-from-euca-3 -a x86_64 -d "CentOS 6 EBS image from Euca 3" -b /dev/sdb=ephemeral0 -s snap-cdcc5768 IMAGE emi-655661f2 [shl1@ip-128-84-11-214 shl1]$ euca-describe-images IMAGE emi-655661f2 904951369483/centos-6-ebs-from-euca-3 904951369483 available private x86_64 machine ebs hvm BLOCKDEVICEMAPPING EPHEMERAL /dev/sdb ephemeral0 BLOCKDEVICEMAPPING EBS /dev/sda snap-cdcc5768 10 true
Instance Store Image
In Euca 3 Cloud
- On a host with euca2ools 3.0.x installed, or a m1.small instance running emi-E2A53625 (CentOS 6.6 with euca2ools)., download and unbundle the instance-store image you want to migrate to Euca 4. You should get a .img file after running the eaca-unbundle command.
source <path to your Euca 3 credentials>/eucarc euca-download-bundle -b <bucket name> -m <manifest> -d <working directory> euca-unbundle -m <manifest> -s <working directory> -d <working directory>
- Copy the .img file to an instance running the same image (or launch a new one with the same image).
- Log into that instance as root.
- Mount the .img file you just copied over somewhere.
mkdir /mnt/target mount -o loop <image file> /mnt/target
- Install the kernel version for your distribution listed in the following table:
yum --installroot=/mnt/target install 2.6.18-400.1.1.el5.x86_64
Distribution | Kernel Version |
CentOS 5 | 2.6.18-400.1.1.el5.x86_64 |
CentOS 6 | 2.6.32-504.8.1.el6.x86_64 |
CentOS 7 | 3.10.0-123.20.1.el7.x86_64 |
Ubuntu 14.04 | 3.13.0-44-generic |
- Umount the image.
umount /mnt/target
In Euca 4 Cloud
- Copy the image to a host running euca2ools 3.1.2 to upload to Euca 4 cloud. This host could be an instance in Euca 4 running emi-4404c688 image.
- Bundle and upload the image to Euca 4. Use the eki and eri ID listed in the following table
euca-bundle-image -i <image file> --kernel <eki> --ramdisk <eri> -r x86_64 -d <working directory path> euca-upload-bundle -b <bucket name> -m <manifest> eaca-register -n <image name> -a x86_64 <manifest path from the euca-upload-bundle command>
Distribution | Kernel Image | RAM Disk Image |
CentOS 5 | eki-076589e4 | eri-819f562a |
CentOS 6 | eki-a04f8296 | eri-683a4412 |
CentOS 7 | eki-08e13f6f | eri-f0e9b392 |
Ubuntu 14.04 | eki-8609afc5 | eri-01ed05dd |
- Launch your Euca 4 instance with the migrated image. The image will need to be converted when it is run the first time so it will take a few minutes. eaca-describe-instances <instance ID> will show the progress of the image conversion in the instance's tags:
TAG instance i-daba2262 euca:image-conversion-state active TAG instance i-daba2262 euca:image-conversion-status Converting images
Red Cloud et Amazon Web Services (AWS)
How do I migrate an Amazon Web Services (AWS) EC2 image to Red Cloud
- Download the bundle from AWS and decrypt it. You will end up with an image file:
ec2-download-bundle -b <S3 bucket name> -d .
ec2-unbundle -s . -d . <manifest>
- Mount this image somewhere using "mount -o loop" option.
- Edit <image mount point>/etc/fstab and <image mount point>/etc/grub.conf such that the root disk is /dev/vda instead of /dev/xvde used by AWS.
- Download the tarball corresponding to your Linux distribution here. Unpack the tarball in /lib/modules
cd <image mount point>/lib/modules; tar jxvf <path to the tarball>
- Unmount image
- Bundle image for Red Cloud:
euca-bundle-image -i <path to image file> -d <working directory> --kernel <eki> --ramdisk <eri>
- Use the following <eki> and <eri> according to your Linux distribution:
- CentOS 5.10: eki-CE97382C and eri-91003AD3
- CentOS 6.5: eki-921637A4 and eri-52B4381E
- Upload the bundle to Red Cloud:
euca-upload-bundle -b <bucket name> -m <manifest from the previous euca-bundle-image command>
- Register the image in Red Cloud:
euca-register -a x86_64 <bucket name>/<manifest>
How do I migrate a Red Cloud (instance store) image to Amazon Web Services (AWS)
- Download the bundle from Red Cloud and decrypt it. You will end up with an img file.
euca-download-bundle -b <bucket name> -d .
euca-unbundle -s . -d . <manifest>
- Mount this image somewhere using "mount -o loop" option.
- Edit /etc/fstab and /etc/grub.conf such that the root disk is /dev/xvde instead of /dev/vda like on Red Cloud.
- Make sure your instance store image has the latest and greatest CentOS 6 kernel installed. If not, do
yum --installroot <mount point of the image> install kernel
- Check <mount point of the image>/etc/grub.conf to make sure it looks right to you. Add "console=hvc0" to the end of the kernel line (reference: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UserProvidedKernels.html)
- unmount image
- Create an AWS bundle and upload it using "ec2-bundle-image" and "ec2-upload-bundle" commands.
- According to this article, You will want to register the image with kernel aki-919dcaf8 in your ec2-register -k command, assuming you want to run it in us-east-1 region. Or select the appropriate aki ID for the region you want to run.
Euca-describe-instances or the web console says my instance is running, but why is it not responding to ping or ssh connections?
External access from the Internet to Red Cloud instances is blocked by default. Follow the instructions in the Manage Network Access section to enable network access to the instance.
Red Cloud with Matlab
Why does the cacscheduler configuration seem to fail MATLAB's built-in validation test?
MATLAB allows you to "validate" a configuration via the Parallel > Manage Configurations menu. If you do this for cacscheduler, the first few steps will work fine. But upon reaching the Parallel section of the procedure the validation will appear to fail with a message like, "Please set the maximum number of workers to a finite value prior to submission of a parallel job." This is expected behavior. The CAC parallel configuration purposely specifies a ClusterSize of Inf (infinity). This allows for total flexibility in adding or subtracting hardware from the resource and/or its various queues over time.
In spite of this alarming-looking message from MATLAB's built-in test, it is not at all a sign that your setup is somehow defective. To make the full validation procedure work, simply do the following: in the Parallel menu, go to Manage Configuations, then double-click "cacscheduler". This will allow you to edit the CAC configuration directly. Change the ClusterSize parameter from Inf to (say) 4 and click "Save". Re-run the validation; you should find that cacscheduler now passes. When you're done with the test, change the ClusterSize back to Inf and save again.
From the above, it should be clear that in any parallel job you submit, you'll want to set job.MaximumNumberOfWorkers appropriately for the queue to which you are submitting your job.
Generally, if you are concerned about whether you have a working configuration, it's best to try running cac_runtests. This will test more aspects of Red Cloud's functionality.
How many MATLAB workers can you use at a time?
The answer depends on both the job type and the queue to which you submit.
For a parallel job, the workers must all be able to communicate with each other; therefore, the max size is limited to the number of cores that are present in your chosen queue. In the Default queue, there are 52 cores, so you could have up to 52 parallel workers in there. In the Quick queue, the max is 4; in the GPU queue, it's 8.
For a pool job, the max is again limited by the number of cores. But bear in mind that one worker must take the place of your local MATLAB session, and its only role is to run the main job function for you. This means that the matlabpool size will be 1 less than the number of workers. Therefore, in the Default queue the max is 51 labs (52 workers), in Quick it's 3 labs, and in GPU it's 7 labs.
For a distributed job, there is no limit, essentially. All tasks are independent, and they will make their way through any of the queues singly--task by task by task--until the list is exhausted. However, at any given instant, a maximum of 52 of these tasks could be running simultaneously in the Default queue, etc.
How do I save extremely large arrays in my pool job?
In a pool job, if a distributed array is too large to gather() into the memory of the master process, you can save() it piecemeal from within a spmd or parfor loop, using the technique described here. Subsequently, you may reassemble the array on your local workstation, after transferring its various pieces via gridFTP.
Linux Batch
Scheduler Frequently Asked Questions
Why are you using Maui and Torque now?
We have switched to using a nationally recognized resource manager and scheduler in order to make the usage of our systems align more closely with the national community. This also allows us to leverage the considerable capabilities of the Maui software to ensure optimal and flexible use of our systems.
When's my job going to run?
If you have already submitted your job and you'd like to know that, use the showstart command to find estimated start times. If you are trying to decide where to run your job so that it runs the soonest, you'll want to examine the showbf command. This allows you to search for when a job with particular resource requirements will run.
Why is my job stuck in the queue?
Sometimes your job doesn't run, even though it looks like it should. Maybe there are few jobs running in the cluster, and your job still won't run.
- Find your jobids with "showq -u username"
- Use "checkjob -v jobid" to examine one of the jobs. Examining Checkjob -v discusses how to read this output.
Jobs in the "Batch Hold" state initiate emails to the system administrators. For other problems, contact CAC help.
Why is my job deferred?
There can be several reasons for a job to defer. Sometimes when the Maui scheduler's queue is full, two jobs attempt to start on a node at the same time, and one will switch to being deferred. On this occasion, if you type "checkjob -v <jobid>", you will see, at the bottom, the message:
Message[0] job rejected by RM 'scheduler' - job started on hostlist compute-3-40.v4linux,compute-3-37.v4linux,compute-3-35.v4linux,compute-3-34.v4linux at time 13:11:22_07/20, job reported idle at time 13:11:53_07/20 (see RM logs for details)
In this case, the only way to make this job run is to notify help at CAC.
What are the queues/affiliations?
Affiliations was the term used by the vsched scheduler to indicate the name of the queue that jobs were submitted to. Most schedulers use the term queue (The scheduler also uses the term "class" to represent the same entity), so you can substitute the word you prefer. V4 queues are listed on the v4 Linux Cluster page.
When I try to run mpdboot I get an error regarding bad python version
This type of message goes on to say, "You can't run mpdboot on ['compute-3-44.v4linux'] version of python must be >= 2.4, current..." Mpdboot uses python and ssh to start MPI daemons on all nodes of your job. It begins by using ssh to ask what version of python is running on each node.
Usually, this error means that ssh is having a problem establishing communication for the mpds. First, make sure you added "-r ssh" to your mpdboot line. If that looks OK, then try to rename (mv) the .ssh directory in your home directory to something like .ssh_bak. Log out, and log back in. A new .ssh directory should be recreated for you automatically (you can verify with "ls -la") which should have valid keys in it.
You may also get this error if you are using a version of Python which does not work with mpdboot. In general, mpdboot needs python 2.3 or newer, but it gets very picky about versions newer than 2.4, as well. If you are trying to run Python 2.5 or 2.6 from your own directory, sometimes mpdboot will find only older versions when it does ssh to the other nodes in your job (because a non-interactive ssh can have a different path). One way to ensure mpdboot runs properly in this case is to ensure it uses the system copy of python. In bash, you can set the path for a command before you invoke it, here so that the system Python is used.
PATH=/usr/bin:/bin:/opt/intel/impi/3.1/bin64/ mpdboot ...
What variables does PBS define in the job script?
Some of the variables are listed in qsub documentation but a good way to see the working environment is to submit a batch job which just does "env>variables.txt" and look for the ones starting in "PBS_".
No Job Control Warning for CSH and TCSH
The output file from the script starts with the error:
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.
This warning means that the fg, bg, and ampersand will not work in your script files. If your default user shell is csh or tcsh, the job will try to execute your script using csh or tcsh, and you'll get this warning. Bash doesn't have this problem.
You can force your script to start with the Bash shell using a PBS directive:
#PBS -S /bin/sh
When Torque starts your job, it will now use Bash, but it won't actually call your .bashrc. If you have any startup files to modify the path or set other variables, you can add to the start of your script, after the PBS directives:
source ~/.bashrc
Another nice way to ensure your favorite variables are defined is to submit the script with the -V option:
nsub -V batch.sh
This option copied whatever environment variables you have defined on the command line to the script when it runs. In short, if you could run something interactively, it should run when the scheduler executes the job.
Mpiexec Won't Accept -ppn Argument
The default MPI, Intel MPI, requires that you put the -ppn argument before the -np argument. The nodes have at least three versions of mpiexec installed. The default is Intel MPI under /opt/intel. If you modify your shell's path, in .bashrc or .cshrc, to put /usr/local/bin before the default path, then you may be getting the OSC mpiexec. This version does not depend on mpdboot. It talks directly with Torque to start jobs. A drawback is that the OSC mpiexec, on our system, cannot start more than one job per node. That's why it's not the default one to use.
I cannot find my output file
If you do not specify an output file when submitting a batch script, then it will automatically produce a file with a name like 110432.scheduler.v4linux.OU in the directory which was the working directory when you submitted your job. If you specify an output file with a command like "#PBS -o out.txt", then that file will be in your $HOME directory. This behavior has changed in recent versions of the scheduler.
Compilers
Where is nmake?
C:\Program Files\Microsoft Visual Studio\VC98\bin\nmake. Call setup_visualc.bat
How can you find the cl compiler?
Call setup_visualc.bat
forrtl: severe (157): Program Exception - access violation
Segmentation fault. Look for a place where writing more than declared.
Trouble with stack overflow in a Compaq Visual Fortran program.
Increase the stack reserve quota, through a flag to nmake or using editbin.
Intel 8.1 compiler gives stack overflow. Intel 7.1 fine. What to do? 0: forrtl: severe (170): Program Exception - stack overflow
Increase the space available on the stack with the flag /F, where is the size of the stack in bytes. The default is 1000000. Try /F10000000. Increase as necessary.
Can't find uuid.lib.
It's in C:\Program Files\Microsoft SDK\lib on the login nodes.
LINK fatal error LNK1201: error writing to program database H:\users\...\some.pdb; check for insufficient disk space, invalid path, or insufficient privilege.
Suspicion is that there is an older version of the file some.pdb. Delete that file and rebuild.
How do I use Intel Fortran at the command line?
First, call setup_intelf32.bat. The compilation command is ifort.
What is the command line syntax to compile with OpenMP?
See the info provided by "ifort -h". There are 4 options beginning with /Qopenmp.
Does the CAC have a tutorial on OpenMP with Fortran?
No, we don't. The focus is on MPI.
Getting convergence errors with Intel 8.1 Fortran with /O1, /O2, /O3. Answer comes out OK. Performance not obviously degraded. How can I fix this so that I don't get the errors?
Add /Op flag to enable better floating point precision. The convergence errors disappear.
I would like to debug an optimized Intel Fortran code, compiled with a flag such as /O2 , created either as a Release version in Visual Studio (VS) or at a command prompt. A Debug version in VS sets the correct debugging flags, but disables optimization. How do I set the appropriate debugging environment for a Release version in VS or at a command prompt?
Add the command-line flags /Zi /debug:full /traceback. Specify the linker option /pdbfile:filename.pdb to create the program database file. This file and the executable must be copied to the same directory on T: when you run the program.
Can the Intel C compiler handle makefile dependencies without having to use cygwin's makedepend?
Yes. You can use the /QMM compiler option, which is OFF by default.
- /QM - Generates makefile dependency lines for each source file, based on the #include lines found in the source file.
- /QMD - Preprocess and compile. Generate output file (.d extension) containing dependency information.
- /QMF file - Generate makefile dependency information in file. Must specify /QM or /QMM.
- /QMG - Similar to /QM, but treats missing header files as generated files.
- /QMM - Similar to /QM, but does not include system header files.
- /QMMD - Similar to /QMD, but does not include system header files.
Network Drive
H: Network Drive
Mapping H:, can't see files.
Make sure that the DNS settings are correct. Look under Home_Directory_Access for DNS instructions.
Can't map H: any more. Nothing changed.
Could be that the password had expired. Connect to login node with RDC to change password, then map drive.
Can't find H:.
Send e-mail to consult@tc.cornell.edu.
Problems mapping H:. Can see files in CAC Tools but not home directory.
Disconnect H: and remap.
At home, can see his home directory, but no files.
Only certain domains can map H: (need vpn)
Can't see the files in one of his directories.
Permissions problem. Send email to useracct@tc.cornell.edu
Mapping H: with correct DNS settings, but can's see files.
Send email to consult@tc.cornell.edu.
Cannot see files on H:
Send email to consult@tc.cornell.edu.
User can now map drive but cannot enter directory. Files are located on ctcfsrv8\tc_k.
User needs correct DNS settings. User resolved by pointing to 128.84.5.28 (ctcfsrv8) in his host file.
Can't map H: with DFS in MAC OS X.
MAC user's need to obtain Thursby to map H:.
Using rover, pointing to the ctc winsserver does not allow him to see files when mapping H:.
Try mapping ctcfsrv8, which is where the files are. This worked. Can't use the DNS settings with rover unless using vpn. It isn't trusted the way Cornell ip addresses are.
Can't access files.
System problem. Send email to consult@tc.cornell.edu.
Can see files in explorer, but sees files only in home directory with dir at command prompt.
User had navigated Start | Run, then typed the command command. Needs to use the command cmd.
Web services
User wants access to CAC web space for a personal web page.
This is available only for CAC personnel.
Old links break on new CAC web site.
Navigate from the CAC home page.