Machine Learning (ML)
Set up a conda environment for machine learning.
This page gives a brief guidance on setting up a new
conda
environment for running machine learning jobs on perlmutter
. But you may find general info on:- how to change default
conda
settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem) - how to clone an existing
conda
environment - example slurm script to submit a job to
perlmutter
First, log in to
perlmutter
(without typing the $
, same below)$ ssh <username>@cori.nersc.gov # replace <username> by your NERSC user name
$ ssh perlmutter
By default,
conda
is not on PATH.
Get it on PATH
by:$ module load python
Let's create an environment called
matml,
short for materials machine learning.We can do
$ conda create --name matml
to create it, and then $ conda install python <other_package_name>
to install packages. But if you intend to use PyTorch
or TensorFlow,
there are pre-installed versions on perlmutter
that are built from source. These could be optimized for the hardware, so it would be better to use them. NERSC
folks have already put them in a conda environment, and we just need to clone it.First, check what versions are available (using
pytorch
as an example, but it is similar for tensorflow
)$ module spider pytorch
Then pick the version we want to use (e.g.
pytorch/1.10.0,
jot it down, you will use it later) and load it$ module load pytorch/1.10.0
$ which python
The purpose of
which python
is to find out the path to the pytorch conda environment, and you will see something like /global/common/software/nersc/shasta2105/pytorch/1.10.0/bin/python
. This means the conda environment is at /global/common/software/nersc/shasta2105/pytorch/1.10.0
(NOTE, /bin/python
is excluded).Finally, clone the pytorch conda environment using its path to create the
matml
environment,$ conda create --name matml --clone /global/common/software/nersc/shasta2105/pytorch/1.10.0
Assume you have finished the above steps, and logged out of
perlmutter.
The next time you log back in, the first thing is to get the necesary modules loaded:
$ module load python
$ module load pytorch/1.10.0
and activate your environment:
$ conda activate matml
Note, the order of the above three commands matters here. The purpose is to ensure the correct python on your
PATH
. Make sure to load python
first, then the torch environment (e.g. pytorch/1.10.0
), and finally activate your own conda environment. You can check it by $ which python,
and make sure it is from the matml
enviroment (e.g. /global/common/software/matgen/<username>/conda/envs/matml/bin/python
).To test that everying works
$ python
>>> import torch
>>>
>>> torch.cuda.is_available()
True
>>>
>>> device = torch.device("cuda") # use gpu
>>> a = torch.tensor([1., 1.], device=device)
>>> b = torch.tensor([2., 2.], device=device)
>>> a+b
tensor([3., 3.], device='cuda:0')
Create a python script, named, e.g.
my_first_torch_job.py
import torch
device = torch.device("cuda") # use gpu
a = torch.tensor([1., 1.], device=device)
b = torch.tensor([2., 2.], device=device)
print(a + b)
Create a slurm batch script for it, named, e.g.
submit.sh
#!/bin/bash -l
#SBATCH --job-name=test_job
#SBATCH --time=0-00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2 # 2 cpus for the job
#SBATCH --gpus=1 # 1 gpu for the job
#SBATCH --constraint=gpu
#SBATCH --qos=regular # use the `regular` queue
#SBATCH --account=matgen_g # don't forget the `_g`; you may want to use `jcesr_g`
module load python
module load pytorch/1.10.0
conda activate matml
python my_first_torch_job.py
Submit your job (Ensure no conda environment is activated, otherwise
module load python
in submit.sh
won't work. Use $ conda deactivate
to deactivate if you are in an environment. Alternatively, you can log out and log in again before running the below command):$ sbatch submit.sh
If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.
Last modified 6mo ago