Machine Learning
Set up a conda environment for machine learning.
This page gives a brief guidance on setting up a new conda environment for running machine learning jobs on perlmutter. But you may find general info on:
  • how to change default conda settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem)
  • how to clone an existing conda environment
  • example slurm script to submit a job to perlmutter

Put conda on PATH

First, log in to perlmutter (without typing the $, same below)
1
$ ssh <username>@cori.nersc.gov # replace <username> by your NERSC user name
2
$ ssh perlmutter
Copied!
By default, conda is not on PATH. Get it on PATH by:
1
$ module load python
Copied!

Change default conda settings (optional)

Change environment directory

By default, if we create a new enviroment, it will be stored in the $HOME directory (e.g. /global/homes/m/<username>/.conda/envs). Each of us has a quota of 40G for $HOME, and sometimes conda environments can get quite big, which can cause out-of-quota problem. So, let's change the default environment directory to avoid this.
You should have access to /global/common/software/matgen/ (or /global/common/software/jcesr, depending on the account you have access to). Create a directory under your username (to store all your software), e.g.
1
$ cd /global/common/software/matgen
2
$ mkdir <username> # change <username> to your NERSC user name
3
$ chmod g-w <username> # remove group write access to avoid others changing it
Copied!
Within your directory, create a directory to store conda environemts (assuming we want to store it at .../<username>/conda/envs):
1
$ cd <username>
2
$ mkdir conda && mkdir conda/envs
Copied!
Then, config conda to prepend to envs_dirs what we've created:
1
$ conda config --prepend envs_dirs /global/common/software/matgen/<username>/conda/envs
Copied!
This is all you need to do.
To ensure it's successful, you can view conda settings by
1
$ conda config --show
Copied!
You will find something like
1
envs_dirs:
2
- /global/common/software/matgen/<username>/conda/envs
3
- /global/homes/m/<username>/.conda/envs
4
- /global/common/software/nersc/pm-2021q4/sw/python/3.9-anaconda-2021.11/envs
Copied!
Alternatively, you can open ~/.condarc to see all the changes you've made. You can even directly edit it to remove the changes or add new ones.

Change package directory

When you install a package, the package will first be downloaded to $HOME, (e.g. /global/homes/m/<username>/.conda/pkgs). You can change the default package storage directory as well:
1
$ mkdir /global/common/software/matgen/<username>/conda/pkgs
2
$ conda config --prepend pkgs_dirs /global/common/software/matgen/<username>/conda/pkgs
Copied!
Agian, you may need to change matgen to the accout you have access to, and, of course, change <username> to your username.

Clone an existing conda environment

Let's create an environment called matml, short for materials machine learning.
We can do $ conda create --name matml to create it, and then $ conda install python <other_package_name> to install packages. But if you intend to use PyTorch or TensorFlow, there are pre-installed versions on perlmutter that are built from source. These could be optimized for the hardware, so it would be better to use them. NERSC folks have already put them in a conda environment, and we just need to clone it.
First, check what versions are available (using pytorch as an example, but it is similar for tensorflow)
1
$ module spider pytorch
Copied!
Then pick the version we want to use (e.g. pytorch/1.10.0, jot it down, you will use it later) and load it
1
$ module load pytorch/1.10.0
2
$ which python
Copied!
The purpose of which python is to find out the path to the pytorch conda environment, and you will see something like /global/common/software/nersc/shasta2105/pytorch/1.10.0/bin/python. This means the conda environment is at /global/common/software/nersc/shasta2105/pytorch/1.10.0 (NOTE, /bin/python is excluded).
Finally, clone the pytorch conda environment using its path to create the matml environment,
1
$ conda create --name matml --clone /global/common/software/nersc/shasta2105/pytorch/1.10.0
Copied!

Submit a job

Test it works

Assume you have finished the above steps, and logged out of perlmutter.
The next time you log back in, the first thing is to get the necesary modules loaded:
1
$ module load python
2
$ module load pytorch/1.10.0
Copied!
and activate your environment:
1
$ conda activate matml
Copied!
Note, the order of the above three commands matters here. The purpose is to ensure the correct python on your PATH. Make sure to load python first, then the torch environment (e.g. pytorch/1.10.0), and finally activate your own conda environment. You can check it by $ which python, and make sure it is from the matml enviroment (e.g. /global/common/software/matgen/<username>/conda/envs/matml/bin/python).
To test that everying works
1
$ python
2
>>> import torch
3
>>>
4
>>> torch.cuda.is_available()
5
True
6
>>>
7
>>> device = torch.device("cuda") # use gpu
8
>>> a = torch.tensor([1., 1.], device=device)
9
>>> b = torch.tensor([2., 2.], device=device)
10
>>> a+b
11
tensor([3., 3.], device='cuda:0')
Copied!

Submit a batch job

Create a python script, named, e.g. my_first_torch_job.py
1
import torch
2
3
device = torch.device("cuda") # use gpu
4
a = torch.tensor([1., 1.], device=device)
5
b = torch.tensor([2., 2.], device=device)
6
7
print(a + b)
Copied!
Create a slurm batch script for it, named, e.g. submit.sh
1
#!/bin/bash -l
2
3
#SBATCH --job-name=test_job
4
#SBATCH --time=0-00:10:00
5
#SBATCH --ntasks=1
6
#SBATCH --cpus-per-task=2 # 2 cpus for the job
7
#SBATCH --gpus=1 # 1 gpu for the job
8
#SBATCH --constraint=gpu
9
#SBATCH --qos=regular # use the `regular` queue
10
#SBATCH --account=matgen_g # don't forget the `_g`; you may want to use `jcesr_g`
11
12
module load python
13
module load pytorch/1.10.0
14
15
conda activate matml
16
17
python my_first_torch_job.py
Copied!
Submit your job (Ensure no conda environment is activated, otherwise module load python in submit.sh won't work. Use $ conda deactivate to deactivate if you are in an environment. Alternatively, you can log out and log in again before running the below command):
1
$ sbatch submit.sh
Copied!

Contact

If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.