# Machine Learning (ML)

## Lawrencium "Condo" Nodes

The group has some deidicated GPU nodes on Lawrencium that you can use. More information [here](https://materialsproject.gitbook.io/persson-group-handbook/computing/high-performance-computing/lrc-lawrencium#persson-group-es1-gpu-node-specs).

## ML with GPUs on Perlmutter

This page gives a brief guidance on setting up a new `conda` environment for running machine learning jobs on `perlmutter`. But you may find general info on:

* how to change default `conda` settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem)
* how to clone an existing `conda` environment
* example slurm script to submit a job to `perlmutter`

## Put conda on PATH

First, log in to `perlmutter` (without typing the `$`, same below)

```
$ ssh <username>@cori.nersc.gov   # replace <username> by your NERSC user name 
$ ssh perlmutter 
```

By default, `conda` is not on `PATH.` Get it on `PATH` by:

```
$ module load python
```

## Clone an existing conda environment

Let's create an environment called `matml,` short for materials machine learning.

We can do `$ conda create --name matml` to create it, and then `$ conda install python <other_package_name>` to install packages. But if you intend to use `PyTorch` or `TensorFlow,` there are pre-installed versions on `perlmutter` that are built from source. These could be optimized for the hardware, so it would be better to use them. `NERSC` folks have already put them in a conda environment, and we just need to clone it.

First, check what versions are available (using `pytorch` as an example, but **it is similar for `tensorflow`**)

```
$ module spider pytorch
```

Then pick the version we want to use (e.g. `pytorch/1.10.0,` jot it down, you will use it later) and load it

```
$ module load pytorch/1.10.0
$ which python 
```

The purpose of `which python` is to find out the path to the pytorch conda environment, and you will see something like `/global/common/software/nersc/shasta2105/pytorch/1.10.0/bin/python`. This means the conda environment is at `/global/common/software/nersc/shasta2105/pytorch/1.10.0` (NOTE, `/bin/python` is excluded).

Finally, clone the pytorch conda environment using its path to create the `matml` environment,

```
$ conda create --name matml --clone /global/common/software/nersc/shasta2105/pytorch/1.10.0
```

## Submit a job

### Test it works

Assume you have finished the above steps, and logged out of `perlmutter.`

The next time you log back in, the first thing is to get the necesary modules loaded:

```
$ module load python
$ module load pytorch/1.10.0 
```

and activate your environment:

```
$ conda activate matml
```

Note, the order of the above three commands matters here. The purpose is to ensure the correct python on your `PATH`. Make sure to load `python` first, then the torch environment (e.g. `pytorch/1.10.0`), and finally activate your own conda environment. You can check it by `$ which python,` and make sure it is from the `matml` enviroment (e.g. `/global/common/software/matgen/<username>/conda/envs/matml/bin/python`).

To test that everying works

```python
$ python
>>> import torch 
>>>
>>> torch.cuda.is_available()
True
>>>
>>> device = torch.device("cuda")  # use gpu
>>> a = torch.tensor([1., 1.], device=device)
>>> b = torch.tensor([2., 2.], device=device)
>>> a+b
tensor([3., 3.], device='cuda:0')
```

### Submit a batch job

Create a python script, named, e.g. `my_first_torch_job.py`

```python
import torch 

device = torch.device("cuda")  # use gpu
a = torch.tensor([1., 1.], device=device)
b = torch.tensor([2., 2.], device=device)

print(a + b)
```

Create a slurm batch script for it, named, e.g. `submit.sh`

```bash
#!/bin/bash -l

#SBATCH --job-name=test_job
#SBATCH --time=0-00:10:00  
#SBATCH --ntasks=1      
#SBATCH --cpus-per-task=2     # 2 cpus for the job
#SBATCH --gpus=1              # 1 gpu for the job 
#SBATCH --constraint=gpu
#SBATCH --qos=regular         # use the `regular` queue
#SBATCH --account=matgen_g    # don't forget the `_g`; you may want to use `jcesr_g`

module load python
module load pytorch/1.10.0

conda activate matml

python my_first_torch_job.py
```

Submit your job (Ensure no conda environment is activated, otherwise `module load python` in `submit.sh` won't work. Use `$ conda deactivate` to deactivate if you are in an environment. Alternatively, you can log out and log in again before running the below command):

```
$ sbatch submit.sh
```

## Contact

If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.
