# Machine Learning (ML)

## Lawrencium "Condo" Nodes

The group has some deidicated GPU nodes on Lawrencium that you can use. More information [here](https://materialsproject.gitbook.io/persson-group-handbook/computing/high-performance-computing/lrc-lawrencium#persson-group-es1-gpu-node-specs).

## ML with GPUs on Perlmutter

This page gives a brief guidance on setting up a new `conda` environment for running machine learning jobs on `perlmutter`. But you may find general info on:

* how to change default `conda` settings (e.g. environment directory and package directory to avoid home directory out-of-quota problem)
* how to clone an existing `conda` environment
* example slurm script to submit a job to `perlmutter`

## Put conda on PATH

First, log in to `perlmutter` (without typing the `$`, same below)

```
$ ssh <username>@cori.nersc.gov   # replace <username> by your NERSC user name 
$ ssh perlmutter 
```

By default, `conda` is not on `PATH.` Get it on `PATH` by:

```
$ module load python
```

## Clone an existing conda environment

Let's create an environment called `matml,` short for materials machine learning.

We can do `$ conda create --name matml` to create it, and then `$ conda install python <other_package_name>` to install packages. But if you intend to use `PyTorch` or `TensorFlow,` there are pre-installed versions on `perlmutter` that are built from source. These could be optimized for the hardware, so it would be better to use them. `NERSC` folks have already put them in a conda environment, and we just need to clone it.

First, check what versions are available (using `pytorch` as an example, but **it is similar for `tensorflow`**)

```
$ module spider pytorch
```

Then pick the version we want to use (e.g. `pytorch/1.10.0,` jot it down, you will use it later) and load it

```
$ module load pytorch/1.10.0
$ which python 
```

The purpose of `which python` is to find out the path to the pytorch conda environment, and you will see something like `/global/common/software/nersc/shasta2105/pytorch/1.10.0/bin/python`. This means the conda environment is at `/global/common/software/nersc/shasta2105/pytorch/1.10.0` (NOTE, `/bin/python` is excluded).

Finally, clone the pytorch conda environment using its path to create the `matml` environment,

```
$ conda create --name matml --clone /global/common/software/nersc/shasta2105/pytorch/1.10.0
```

## Submit a job

### Test it works

Assume you have finished the above steps, and logged out of `perlmutter.`

The next time you log back in, the first thing is to get the necesary modules loaded:

```
$ module load python
$ module load pytorch/1.10.0 
```

and activate your environment:

```
$ conda activate matml
```

Note, the order of the above three commands matters here. The purpose is to ensure the correct python on your `PATH`. Make sure to load `python` first, then the torch environment (e.g. `pytorch/1.10.0`), and finally activate your own conda environment. You can check it by `$ which python,` and make sure it is from the `matml` enviroment (e.g. `/global/common/software/matgen/<username>/conda/envs/matml/bin/python`).

To test that everying works

```python
$ python
>>> import torch 
>>>
>>> torch.cuda.is_available()
True
>>>
>>> device = torch.device("cuda")  # use gpu
>>> a = torch.tensor([1., 1.], device=device)
>>> b = torch.tensor([2., 2.], device=device)
>>> a+b
tensor([3., 3.], device='cuda:0')
```

### Submit a batch job

Create a python script, named, e.g. `my_first_torch_job.py`

```python
import torch 

device = torch.device("cuda")  # use gpu
a = torch.tensor([1., 1.], device=device)
b = torch.tensor([2., 2.], device=device)

print(a + b)
```

Create a slurm batch script for it, named, e.g. `submit.sh`

```bash
#!/bin/bash -l

#SBATCH --job-name=test_job
#SBATCH --time=0-00:10:00  
#SBATCH --ntasks=1      
#SBATCH --cpus-per-task=2     # 2 cpus for the job
#SBATCH --gpus=1              # 1 gpu for the job 
#SBATCH --constraint=gpu
#SBATCH --qos=regular         # use the `regular` queue
#SBATCH --account=matgen_g    # don't forget the `_g`; you may want to use `jcesr_g`

module load python
module load pytorch/1.10.0

conda activate matml

python my_first_torch_job.py
```

Submit your job (Ensure no conda environment is activated, otherwise `module load python` in `submit.sh` won't work. Use `$ conda deactivate` to deactivate if you are in an environment. Alternatively, you can log out and log in again before running the below command):

```
$ sbatch submit.sh
```

## Contact

If you find something wrong, please fix it. If you are unsure, let Mingjian Wen know and he can probably help.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://materialsproject.gitbook.io/persson-group-handbook/computing/machine-learning-ml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
