1 of 1

Machine Learning (ML)

Set up a conda environment for machine learning.

Lawrencium "Condo" Nodes

The group has some deidicated GPU nodes on Lawrencium that you can use. More information here.

ML with GPUs on Perlmutter

This page gives a brief guidance on setting up a new conda environment for running machine learning jobs on perlmutter. But you may find general info on:

how to change default conda settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem)
how to clone an existing conda environment
example slurm script to submit a job to

Put conda on PATH

First, log in to perlmutter (without typing the $, same below)

By default, conda is not on PATH. Get it on PATH by:

Clone an existing conda environment

Let's create an environment called matml, short for materials machine learning.

We can do $ conda create --name matml to create it, and then $ conda install python <other_package_name> to install packages. But if you intend to use PyTorch or TensorFlow, there are pre-installed versions on perlmutter that are built from source. These could be optimized for the hardware, so it would be better to use them. NERSC folks have already put them in a conda environment, and we just need to clone it.

First, check what versions are available (using pytorch as an example, but it is similar for tensorflow)

Then pick the version we want to use (e.g. pytorch/1.10.0, jot it down, you will use it later) and load it

The purpose of which python is to find out the path to the pytorch conda environment, and you will see something like /global/common/software/nersc/shasta2105/pytorch/1.10.0/bin/python. This means the conda environment is at /global/common/software/nersc/shasta2105/pytorch/1.10.0 (NOTE, /bin/python is excluded).

Finally, clone the pytorch conda environment using its path to create the matml environment,

Submit a job

Test it works

Assume you have finished the above steps, and logged out of perlmutter.

The next time you log back in, the first thing is to get the necesary modules loaded:

and activate your environment:

Note, the order of the above three commands matters here. The purpose is to ensure the correct python on your PATH. Make sure to load python first, then the torch environment (e.g. pytorch/1.10.0), and finally activate your own conda environment. You can check it by $ which python, and make sure it is from the matml enviroment (e.g. /global/common/software/matgen/<username>/conda/envs/matml/bin/python).

To test that everying works

Submit a batch job

Create a python script, named, e.g. my_first_torch_job.py

Create a slurm batch script for it, named, e.g. submit.sh

Submit your job (Ensure no conda environment is activated, otherwise module load python in submit.sh won't work. Use $ conda deactivate to deactivate if you are in an environment. Alternatively, you can log out and log in again before running the below command):

Contact

If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.

Machine Learning (ML)

Set up a conda environment for machine learning.

Lawrencium "Condo" Nodes

The group has some deidicated GPU nodes on Lawrencium that you can use. More information here.

ML with GPUs on Perlmutter

This page gives a brief guidance on setting up a new conda environment for running machine learning jobs on perlmutter. But you may find general info on:

how to change default conda settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem)
how to clone an existing conda environment
example slurm script to submit a job to

Put conda on PATH

First, log in to perlmutter (without typing the $, same below)

By default, conda is not on PATH. Get it on PATH by:

Clone an existing conda environment

Let's create an environment called matml, short for materials machine learning.

First, check what versions are available (using pytorch as an example, but it is similar for tensorflow)

Then pick the version we want to use (e.g. pytorch/1.10.0, jot it down, you will use it later) and load it

Finally, clone the pytorch conda environment using its path to create the matml environment,

Submit a job

Test it works

Assume you have finished the above steps, and logged out of perlmutter.

The next time you log back in, the first thing is to get the necesary modules loaded:

and activate your environment:

To test that everying works

Submit a batch job

Create a python script, named, e.g. my_first_torch_job.py

Create a slurm batch script for it, named, e.g. submit.sh

Contact

If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.

Machine Learning (ML)

hashtagLawrencium "Condo" Nodes

hashtagML with GPUs on Perlmutter

hashtagPut conda on PATH

hashtagClone an existing conda environment

hashtagSubmit a job

hashtagTest it works

hashtagSubmit a batch job

hashtagContact

Machine Learning (ML)

hashtagLawrencium "Condo" Nodes

hashtagML with GPUs on Perlmutter

hashtagPut conda on PATH

hashtagClone an existing conda environment

hashtagSubmit a job

hashtagTest it works

hashtagSubmit a batch job

hashtagContact

Lawrencium "Condo" Nodes

ML with GPUs on Perlmutter

Put conda on PATH

Clone an existing conda environment

Submit a job

Test it works

Submit a batch job

Contact

Lawrencium "Condo" Nodes

ML with GPUs on Perlmutter

Put conda on PATH

Clone an existing conda environment

Submit a job

Test it works

Submit a batch job

Contact