Red Cloud Windows GPU instances

From CAC Documentation wiki
Revision as of 09:31, 19 November 2021 by Cjc73 (talk | contribs) (→‎Install the NVIDIA Drivers for the GPU)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes how to setup and install CUDA-enabled pytorch on a Red Cloud GPU instance running Windows. These instructions were tested with the Windows Server 2019 base image on the c4.t1.m20 Red Cloud VM flavor and with the Windows Server 2016 base image on the c14.g1.m60 Red Cloud VM flavor. The boot volume should be increased above the 50GB default size --- 100GB was plenty and 80GB might have been sufficient.

Before you begin, please note:

1. This procedure requires an account with administrator privileges.

2. At the reboot steps, use the Horizon web interface to POWER OFF the instance and power it on again to guarantee that the memory is completely cleared. A simple reboot is not always sufficient.

Install Visual Studio 2019 (Optional)

Visual Studio installation is optional but will prevent installation warnings during the CUDA step. The CUDA installer wants Visual Studio 2019 (v. 2022 does not satisfy the check as of 2021-11-18). Locate the installer at docs.microsoft.com - Visual Studio 2019.

Install the NVIDIA Drivers for the GPU

Install the NVIDIA Drivers from [1] Select the following options depending on the instance flavor:

NVIDIA Download Selection Options
Instance Flavor c4.t1.m20 c14.g1.m60
Product Type: Data Center / Tesla Data Center / Tesla
Product Series: T-Series V-Series
Product: Tesla T4 Tesla V100
Operating System: Windows Server 2016 or 2019, (match instance OS) Windows Server 2016 or 2019, (match instance OS)
CUDA Toolkit: 11.4 11.4
Language: English (US) English (US)

Once the installer has downloaded, run the installer. The "express" option, which installs all available items, seems to work.

Restart and power cycle the instance

Begin by rebooting the virtual machine using the operating system on the VM. Allow any boot-time installations to complete. Once the machine is fully rebooted, use the Red Cloud web console (Horizon) to shut down the instance. When this step has completed, use the web console to power on the instance.

Install the CUDA Drivers and tools

Install the CUDA drivers and tools from https://developer.nvidia.com/cuda-downloads. These instructions were tested with CUDA Toolkit 11.5.

CUDA Download Selection Options
Operating System: Windows
Architecture: x86_64
Version: Server 2016 or Server 2019 (match instance OS)
Installer Type exe (network)

Once the installer has downloaded, run the installer. The "express" option, which installs all available items, seems to work. If the installer complains about not finding Visual Studio, select the option to proceed anyway.

Allow access to the GPU performance counters (probably optional)

Allow "access to the GPU performance counters to all users" in the NVIDIA Control Panel (accessible through Control Panel > Hardware: NVIDIA Control Panel). Select the "Manage GPU Performance Counters" in the sidebar. This step might make the GPU usage more accessible to non-admin processes.

Install Miniconda

Download the latest Miniconda3 installer for Windows 64-bit and run the installer.

At this point, open the newly installed Anaconda prompt and run `nvidia-smi`. This should show GPU information and this means the driver and CUDA installation was successful.

Create a conda environment and install packages

The strict channel priority and use of an environment (called cforge in these directions) helps prevent the base install from corruption. The mamba package installer is a faster drop-in replacement for the conda installer.

Specifying the exact version numbers for pytorch is a work-around for mamba, which will otherwise install the cpu-only pytorch. There is a chance that you will need to search for the current version numbers of pytorch below (The version numbers are correct as of 2021-11-18). Note that we install python 3.9 in the steps below and CUDA 11.5 in the steps above so we need a pytorch binary that matches our python version (3.9) uses CUDA Version 11.x where 11.x is <= 11.5. The following command will list all available versions.

$ mamba search pytorch -c pytorch

To setup an environment and install the pytorch packages, enter the following in the the Anaconda Prompt, one input line at a time. The $ represents the command prompt (and the start of an input line) and should not be typed.

$ conda config --set channel_priority strict  
$ conda create --name cforge  
$ conda activate cforge  
$ conda config --add channels conda-forge  
$ conda config --env --set channel_priority strict

Next, install python and mamba into the newly created (and active) environment.

$ conda install python=3.9 mamba pip

The next set of packages install a fairly complete data science setup. These are optional but you will probably want them!

$ mamba install pandas scikit-learn matplotlib \
jupyterlab plotnine nodejs tqdm regex dask scikit-learn \
statsmodels bokeh networkx ipywidgets jupytext

Finally, install pytorch, the Huggingface transformers (optional) and fast.ai (optional). If needed, use mamba search pytorch -c pytorch to identify available packages and version numbers as described above.

$ mamba install -c pytorch pytorch==1.10.0=py3.9_cuda11.3_cudnn8_0
$ mamba install -c pytorch torchvision torchaudio
$ mamba install transformers
$ mamba install -c fastchan fastai

Verify that Pytorch can see the GPU

In the Anaconda Prompt, with the cforge environment active, execute the following command to see if CUDA is available to pytorch. It will return True or False:

$ python -c "import torch; print(torch.cuda.is_available())"

The command should return

True