High Performance Computing
This page describes how to set up and run calculations on various computing resources that our group has access to.
Our group’s main computing resources are:
  • NERSC (the LBNL supercomputing center, one of the biggest in the world)
  • Lawrencium / Berkeley Lab Research Computing
  • Savio / Berkeley Research Computing
  • Peregrine (the NREL supercomputing center)
  • Argonne Leadership Computing Facility (sometimes)
  • Oak Ridge Leadership Computing Facility (sometimes)
At any time, if you feel you are computing-limited, please contact Kristin so she can work with you on finding solutions.

NERSC

To get started with calculations at NERSC:

  1. 1.
    Ask Kristin about whether you will be running at NERSC and, if so, under what account / repository to charge.
  2. 2.
    Request a NERSC account through the NERSC homepage (Google “NERSC account request”).
  3. 3.
    A NERSC Liason or PI Proxy will validate your account and assign you an initial allocation of computing hours.
  4. 4.
    At this point, you should be able to log in, check CPU-hour balances, etc. through “NERSC NIM” and “My NERSC” portals
  5. 5.
    In order to log in and run jobs on the various machines at NERSC, review the NERSC documentation.
  6. 6.
    In order to load and submit scripts for various codes (VASP, ABINIT, Quantum Espresso), NERSC has lots of information to help. Try Google, e.g. “NERSC VASP”.
    ... * Note that for commercial codes such as VASP, there is an online form that allows you to enter your VASP license, which NERSC will confirm and then allow you access to.
  7. 7.
    Please make a folder inside your project directory and submit all your jobs there, as your home folder has only about 40GB of space. For example, for matgen project, your work folder path should be something like the following:
    /global/project/projectdirs/matgen/YOUR_NERSC_USERNAME
  8. 8.
    You can also request a mongo database for your project to be hosted on NERSC. Google “MongoDB on NERSC” for instructions. Donny Winston or Patrick Huck can also help you get set up and provide you with a preconfigured database suited for running Materials Project style workflows.

Running Jobs on NERSC

This tutorial provides a brief overview of setting yourself up to run jobs on NERSC. If any information is unclear or missing, feel free to edit this document or contact Kara Fong.

Setting up a NERSC account:

Contact the group’s NERSC Liaison (currently Eric Sivonxay, see Group Jobs list). They will help you create an account and allocate you computational resources. You will then receive an email with instructions to fill out the Appropriate Use Policy form, set up your password, etc.
Once your account is set up, you can manage it at the NERSC Information Management (NIM) website.

Logging on (Setup):

You must use the SSH protocol to connect to NERSC. Make sure you have SSH installed on your local computer (you can check this by typing which ssh).

Connecting with SSH:

You must use the SSH protocol to connect to NERSC. Make sure you have SSH installed on your local computer (you can check this by typing which ssh) and that you have a directory named $HOME/.ssh on your local computer (if not, make it).
You will also need to set up multi-factor authentication with NERSC. This will allow you to generate "one time passwords" (OTPs). You will need append a OTP to the end of your NIM password each time you log on to a NERSC cluster.
We also advise you to configure a ssh socket so that you only have to log into NERSC with a OTP only once per session (helpful if you are scp-ing things). To do this:
  1. 1.
    Create the directory ~/.ssh/sockets if it doesn't already exist.
  2. 2.
    Open your ssh config file /.ssh/config (or create one if it doesn't exist) and add the following:
    1
    Host *.nersc.gov
    2
    ControlMaster auto
    3
    ControlPath ~/.ssh/sockets/%[email protected]%h-%p
    4
    ControlPersist 600
    Copied!
You should now be ready to log on!
You must store your SSH public key on the NERSC NIM database. Go to the NIM website, navigate to “My Stuff” -> “My SSH Keys”. Click on the SSH Keys tab. Copy your key (from id_rsa.pub) into the website’s text box, click Add.

Logging on:

Log on to Cori, for example, by submitting the following command in the terminal:
Copied!
You will be prompted to enter your passphrase, which is your NIM password+OTP (e.g. <your_password><OTP> without any spaces). This will log you onto the cluster and take you to your home directory. You may also find it useful to set up an alias for signing on to HPC resources. To do this, add the following line to your bash_profile:
1
alias cori="ssh <your_username>@cori.nersc.gov"
Copied!
Now you will be able to initialize a SSH connection to cori just by typing cori in the command line and pressing enter.

Transferring files to/from NERSC:

For small files, you can use SCP (secure copy). To get a file from NERSC, use:
1
scp [email protected]:/remote/path/myfile.txt /local/path
Copied!
To send a file to NERSC, use:
1
scp /local/path/myfile.txt [email protected]:/remote/path
Copied!
To move a larger quantity of data using a friendlier interface, use Globus Online.

Mounting NERSC's file system locally:

If you regularly work with files from NERSC and you find it annoying to scp them, you could mount your NERSC home folder locally. First you need to install sshfs by downloading the stable releases of macFUSE and SSHFS from this website. Next, make a folder to mount onto:
1
mkdir ~sshfs
2
mkdir ~sshfs/cori
Copied!
Next, add this to your bash profile, making sure to add your own username to replace NERSC_USERNAME:
1
alias sshfs-cori="sshfs -o 'IdentityFile=~/.ssh/nersc,reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,auto_cache,Compression=no,follow_symlinks' [email protected]: ~/sshfs/cori"
2
function unmount {
3
pkill -9 sshfs
4
sudo umount -f $HOME/sshfs/cori
5
}
Copied!
Source your bash profile or restart your terminal. Try it out by running
1
sshfs-cori
Copied!
Then you should have access to your cori home folder within the ~sshfs/cori/ folder. There are 4 data transfer nodes, so if the command is hanging or timing out, try a different data transfer node.
To terminate the connection run
1
unmount
Copied!

Running and monitoring jobs:

The following instructions are for running on Cori. Analogous information for running on Edison can be found here.
Most jobs are run in batch mode, in which you prepare a shell script telling the batch system how to run the job (number of nodes, time the job will run, etc.). NERSC’s batch system software is called SLURM. Below is a simple batch script example, copied from the NERSC website:
1
#!/bin/bash -l
2
3
#SBATCH -N 2 #Use 2 nodes
4
#SBATCH -t 00:30:00 #Set 30 minute time limit
5
#SBATCH -q regular #Submit to the regular QOS
6
#SBATCH -L scratch #Job requires $SCRATCH file system
7
#SBATCH -C haswell #Use Haswell nodes
8
9
srun -n 32 -c 4 ./my_executable
Copied!
Here, the first line specifies which shell to use (in this case bash). The keyword #SBATCH is used to start directive lines (click here for a full description of the sbatch options you can specify). The word “srun” starts execution of the code.
To submit your batch script, use sbatch myscript.sl in the directory containing the script file.
Below are some useful commands to control and monitor your jobs:
1
sqs -u username (Lists jobs for your account)
2
scancel job_id (Cancels a job from the queue)
Copied!

Choosing a QOS (quality of service):

You specify which queue to use in your batch file. Use the debug queue for small, short test runs, the regular queue for production runs, and the premium queue for high-priority jobs.

Choosing a node type (haswell vs knl):

You may also specify the resource type you would like your job to run on, witnin your batch file. When running on Cori, there are two CPU architectures available, Haswell, and Knights Landing (known as KNL). Running on a Haswell node will afford you high individual core performance with up to 32 cores per node (or 64 threads per node). A KNL node provides a large core count (68 cores per node or 272 threads per node) which is suitable for programs capable of effectively utilizing multithreading. On Cori, there are 2388 Haswell nodes and 9688 KNL nodes.

Automatic job submission on NERSC: crontab

In order to automatically manage job submission at NERSC, you can use crontab. You can submit jobs periodically even when you are not signed in to any NERSC systems and perhaps reduce the queue time from 5-10 days to a few hours. This is possible because of the way jobs are managed in atomate/fireworks. Please make sure you feel comfortable submitting individual jobs via atomate before reading this section.
In atomate, by using --maxloop 3 for example when setting rocket_launch in your my_qadapter.yaml, after 3 trials in each minute if there are still no READY jobs available in your Launchpad Fireworks would stop the running job on NERSC to avoid wasting computing resources. On the other hand, if you have Fireworks available with the READY state and you have been using crontab for a few days, even if the jobs you submitted a few days ago start running on NERSC, they would pull any READY Fireworks and start RUNNING them reducing the turnaround from a few days to a few hours! So how to setup crontab? Please follow the instructions here: 1. ssh to the node where you want to setup the crontab; try one that is easy to remember such as cori01 or edison01; for logging in to a specific node just do for example “ssh cori01” after you log in to the system (Cori in this example).
  1. 1.
    Type and enter: crontab -e
  2. 2.
    Now you can setup the following command in the opened vi editor. What it does is basically running the SCRIPT.sh file every 120 minutes of every day of every week of every month of every year (or simply /120 *):
    1
    */120 * * * * /bin/bash -l PATH_TO_SCRIPT.sh >> PATH_TO_LOGFILE
    Copied!
  3. 3.
    Setup your SCRIPT.sh like the following: (as a suggestion, you can simply put this file and the log file which keeps a log of submission states in your home folder):
    1
    source activate YOUR_PRODUCTION_CONDA_ENVIRONMENT FW_CONFIG_FILE=PATH_TO_CONFIG_DIR/FW_config.yaml
    2
    cd PATH_TO_YOUR_PRODUCTION_FOLDER
    3
    qlaunch --fill_mode rapidfire -m 1000 --nlaunches 1
    Copied!
    The last line of this 3-line file is really what submitting your job inside your production folder with the settings that you set in FW_config.yaml file. See atomate documentation for more info.
  4. 4.
    Please make sure to set your PRODUCTION_FOLDER under /global/project/projectdirs/ that has much more space than your home folder and it is also backed up. Make sure to keep an eye on how close you are to disk space and file number limitations by checking https://my.nersc.gov/ periodically.

Running Jupyter Notebooks on Cori

Jupyter notebooks are quickly becoming an indispensable tool for doing computational science. In some cases, you might want to (or need to) harness NERSC computing power inside of a jupyter notebook. To do this, you can use NERSC’s new Jupyterhub system at https://jupyter-dev.nersc.gov/. These notebooks are run on a large memory node of Cori and can also submit jobs to the batch queues (see http://bit.ly/2A0mqrl for details). All of your files and the project directory will be accessible from the Jupyterhub, but your conda envs won’t be available before you do some configuration.
To set up a conda environment so it is accessible from the Jupyterhub, activate the environment and setup an ipython kernel. To do this, run the command “pip install ipykernel”. More info can be found at http://bit.ly/2yoKAzB.

Automatic Job Packing with FireWorks

DISCLAIMER: Only use job packing if you have trouble with typical job submission. The following tip is not 100% guaranteed to work., and is based on limited, subjective experience on Cori. Talk to Alex Dunn ([email protected]) for help if you have trouble.
The Cori queue system can be unreasonably slow when submitting many (e.g., hundreds, thousands) of small (e.g., single node or 2 nodes) jobs with qos-normal priority on Haswell. In practice, we have found that the Cori job scheduler will give your jobs low throughput if you have many jobs in queue, and you will often only be able to run 5-30 jobs at a time, while the rest wait in queue for far longer than originally expected (e.g., weeks). While there is no easy way to increase your queue submission rate (AFAIK), you can use FireWorks job-packing to “trick” Cori’s SLURM scheduler into running many jobs in serial on many parallel compute nodes with a single queue submission, vastly increasing throughput.
You can use job packing with the “multi” option to rlaunch. This command launches N parallel python processes on the Cori scheduling node, each which runs a job using M compute nodes.
The steps to job packing are: 1. Edit your my_qadapter.yaml file to reserve N * M nodes for each submission. For example, if each of your jobs takes M = 2 nodes, and you want a N = 10 x speedup, reserve 20 nodes per queue submission. 2. Change your rlaunch command to:
1
rlaunch -c /your/config multi N
Copied!
To have each FireWorks process run as many jobs as possible in serial before the walltime, use the --nlaunches 0 option. To prevent FireWorks from submitting jobs with little walltime left (causing jobs to frequently get stuck as “RUNNING”), set the --timeout option. Make sure --timeout is set so that even a long running job submitted at the end of your allocation will not run over your walltime limit. Your my_qadapter.yaml should then have something similar to the following lines:
1
rocket_launch: rlaunch -c /your/config multi 10 --nlaunches 0 --timeout 169200
2
nodes: 20
Copied!
Typically, setting N <= 10 will give you a good N-times speedup with no problems. There are no guarantees, however, when N > 10-20. Use N > 50 at your own risk!

Jobs on Premium QOS

By default, premium QOS access is turned off for everyone in the group. When there is a scientific emergency (for example, you need to complete a calculation ASAP for a meeting with collaborators the next day), premium QOS can be utilized. In such cases, please contact Howard ([email protected] or on Slack) to request premium QOS access. The access will be then turned off automatically after three weeks or after the emergency has been dealt with.

Berkeley Research Computing

Berkeley Research Computing (BRC) hosts the Savio supercomputing cluster. Savio operates on a condo computing model, in which many PI's and researchers purchase nodes to add to the system. Nodes are accessible to all who have access to the system, though priority access will be given to contributors of the specific nodes. BRC provides 3 types of allocations: Condo - Priority access for nodes contributed by the condo group. Faculty Computing Allowance (FCA) - Limited computing time provided to each Faculty member using Savio.

Setting up a BRC account

Please make sure you will actually be performing work on Savio before requesting an account. To get an account on Savio, fill out the form linked below, making sure to select the appropriate allocation. Typically, most students and postdocs will be running on co_lsdi. For more information, visit (Berkeley Research Computing)[http://research-it.berkeley.edu/services/high-performance-computing]
After your account is made, you'll need to set up 2-factor authentication. This will allow you to generate "one time passwords" (OTPs). You will need append a OTP to the end of your NIM password each time you log on to a NERSC cluster. We recommend using Google Authenticator, although any OTP manager will work.

Logging on (Setup):

You must use the SSH protocol to connect to BRC. Make sure you have SSH installed on your local computer (you can check this by typing which ssh). Make sure you have a directory named $HOME/.ssh on your local computer (if not, make it).
We also advise you to configure a ssh socket so that you only have to log into BRC with a OTP only once per session (helpful if you are scp-ing things). To do this:
  1. 1.
    Create the directory ~/.ssh/sockets if it doesn't already exist.
  2. 2.
    Open your ssh config file /.ssh/config (or create one if it doesn't exist) and add the following:
    1
    Host *.brc.berekeley.edu
    2
    ControlMaster auto
    3
    ControlPath ~/.ssh/sockets/%[email protected]%h-%p
    4
    ControlPersist 600
    Copied!
After your account is made, you'll need to set up 2-factor authentication. We recommend using Google Authenticator, although any OTP manager will work.
You should now be ready to log on!

Logging on to BRC

To access your shiny new savio account, you'll want to SSH onto the system from a terminal.
Copied!
You will be prompted to enter your passphrase+OTP (e.g. <your_password><OTP> without any spaces). This will take you to your home directory. You may also find it useful to set up an alias for signing on to HPC resources. To do this, add the following line to your bash_profile:
1
alias savio="ssh [email protected]"
Copied!
Now you will be able to initialize a SSH connection to Savio just by typing savio in the command line and pressing enter.

Running on BRC

Under the condo account co_lsdi, we have exclusive access to 28 KNL nodes. Additionally, we have the ability to run on other nodes at low priority mode.

Accessing Software binaries

Software within BRC is managed through modules. You can access precompiled, preinstalled software by loading the desired module.
1
module load <module_name>
Copied!
To view a list of currently installed programs, use the following command:
1
module avail
Copied!
To view the currently loaded modules use the command:
1
module list
Copied!
Software modules can be removed by using either of the following two commands:
1
module unload <module_name>
2
module purge
Copied!
Accessing In-House software packages
The Persson Group maintains their own versions of popular codes such as VASP, GAUSSIAN, QCHEM and LAMMPS. To access these binaries, ensure that you have the proper licenses and permissions, then append the following line to the .bashrc file in your root directory:
1
export MODULEPATH=${MODULEPATH}:/global/home/groups/co_lsdi/sl7/modfiles
Copied!

Using Persson Group Owned KNL nodes

To run on the KNL nodes, use the following job script, replacing with the desired job executable name. To run vasp after loading the proper module, use vasp_std, vasp_gam, or vasp_ncl.
1
#!bin/bash -l
2
#SBATCH --nodes=1 #Use 1 node
3
#SBATCH --ntasks=64 #Use 64 tasks for the job
4
#SBATCH --qos=lsdi_knl2_normal #Set job to normal qos
5
#SBATCH --time=01:00:00 #Set 1 hour time limit for job
6
#SBATCH --partition=savio2_knl #Submit to the KNL nodes owned by the Persson Group
7
#SBATCH --account=co_lsdi #Charge to co_lsdi accout
8
#SBATCH --job-name=savio2_job #Name for the job
9
10
mpirun --bind-to core <executable>
Copied!

Running on Haswell Nodes(on Low Priority)

To run on Haswell nodes, use the following slurm submission script:
1
#!bin/bash -l
2
#SBATCH --nodes=1 #Use 1 node
3
#SBATCH --ntasks_per_core=1 #Use 1 task per core on the node
4
#SBATCH --qos=savio_lowprio #Set job to low priority qos
5
#SBATCH --time=01:00:00 #Set 1 hour time limit for job
6
#SBATCH --partition=savio2 #Submit to the haswell nodes
7
#SBATCH --account=co_lsdi #Charge to co_lsdi accout
8
#SBATCH --job-name=savio2_job #Name for the job
9
10
mpirun --bind-to core <executable>
Copied!

LBNL Laboratory Research Computing

Lawrence Berkeley National Laboratory's Laboratory Research Computing (LRC) hosts the Lawrencium supercomputing cluster. LRC operates on a condo computing model, in which many PI's and researchers purchase nodes to add to the system. Nodes are accessible to all who have access to the system, though priority access will be given to contributors of the specific nodes. BRC provides 3 types of allocations: Condo - Priority access for nodes contributed by the condo group. PI Computing Allowance (PCA) - Limited computing time provided to each PI member using Lawrencium.

Persson Group ES1 GPU Node Specs:

GPU Computing Node
Processors
Dual-socket, 4-core, 2.6GHz Intel 4112 processors (8 cores/node)
Memory
192GB (8 X 8GB) 2400Mhz DDR4 RDIMMs
Interconnect
56Gb/s Mellanox ConnectX5 EDR Infiniband interconnect
GPU
2 ea. Nvidia Tesla v100 accelerator boards
Hard Drive
500GB SSD (Local swap and log files)

Setting up a LRC account

Please make sure you will actually be performing work on Lawrencium before requesting an account. To get an account on Lawrencium, fill out this form and the user agreement to a one-time password token generator and your account. You will also need to set up a MFA token for your account.

Before logging on (setup)

You must use the SSH protocol to connect to Lawrencium. Make sure you have SSH installed on your local computer (you can check this by typing which ssh). Make sure you have a directory named $HOME/.ssh on your local computer (if not, make it).
After your account is made, you'll need to set up 2-factor authentication. We recommend using Google Authenticator, although any OTP manager will work.
You should now be ready to log on!

Logging on to LRC

To access your shiny new Lawrencium account, you'll want to SSH onto the system from a terminal.
Copied!
You will be prompted to enter your pin+OTP (e.g. <your_pin><OTP> without any spaces). This will take you to your home directory. You may also find it useful to set up an alias for signing on to HPC resources. To do this, add the following line to your bash_profile:
1
alias lawrencium="ssh <your_username>@lrc-login.lbl.gov"
Copied!
Now you will be able to initialize a SSH connection to Savio just by typing lawrencium in the command line and pressing enter.

Running on LRC

Under the condo accounts condo_mp_cf1 (56 cf1 nodes) and condo_mp_es1 (1 gpu node), we have exclusive access to certain Lawrencium nodes. If you do not know which of these node groups you are supposed to be running on, you probably shouldn't be running on Lawrencium. Additionally, we have the ability to run on ES1 GPU nodes at low priority mode (es_lowprio).

Accessing Software binaries

Software within LRC is managed through modules. You can access precompiled, preinstalled software by loading the desired module.
1
module load <module_name>
Copied!
To view a list of currently installed programs, use the following command:
1
module avail
Copied!
To view the currently loaded modules use the command:
1
module list
Copied!
Software modules can be removed by using either of the following two commands:
1
module unload <module_name>
2
module purge
Copied!

Using Persson Group Owned LRC nodes

To run on the nodes, use the following job script, replacing with the desired job executable name.
1
#!/bin/bash
2
# Job name:
3
#SBATCH --job-name=<job_name>
4
#
5
# Partition:
6
#SBATCH --partition=cf1
7
#
8
# QoS:
9
#SBATCH --qos=condo_mp_cf1
10
#
11
# Account:
12
#SBATCH --account=lr_mp
13
#
14
# Nodes (IF YOU CHANGE THIS YOU MUST CHANGE ntasks too!!!):
15
#SBATCH --nodes=1
16
#
17
# Processors (MUST BE 64xNUM_NODES ALWAYS!!!):
18
#SBATCH --ntasks=64
19
#
20
# Wall clock limit:
21
#SBATCH --time=24:00:00
22
23
## Run command
24
25
module load vasp/6.prerelease-vdw
26
export OMP_PROC_BIND=true
27
export OMP_PLACES=threads
28
export OMP_NUM_THREADS=1 # NEVER CHANGE THIS!!
29
30
mpirun --bind-to core <executable>
Copied!
1
#!/bin/bash
2
# Job name:
3
#SBATCH --job-name=<job_name>
4
#
5
# Partition:
6
#SBATCH --partition=es1
7
#
8
# QoS:
9
#SBATCH --qos=condo_mp_es1
10
#
11
# Account:
12
#SBATCH --account=lr_mp
13
#
14
# GPUs:
15
#SBATCH --gres=gpu:2
16
#
17
# CPU cores:
18
#SBATCH --cpus-per-task=8
19
#
20
# Constraints:
21
#SBATCH --constraint=es1_v100
22
#
23
# Wall clock limit:
24
#SBATCH --time=24:00:00
25
26
export CUDA_VISIBLE_DEVICES=0,1
27
module load cuda/10.0
Copied!
1
#!/bin/bash
2
# Job name:
3
#SBATCH --job-name=<job_name>
4
#
5
# Partition:
6
#SBATCH --partition=es1
7
#
8
# QoS:
9
#SBATCH --qos=es_lowprio
10
#
11
# Account:
12
#SBATCH --account=lr_mp
13
#
14
# GPUs:
15
#SBATCH --gres=gpu:2
16
#
17
# CPU cores:
18
#SBATCH --cpus-per-task=8
19
#
20
# Constraints:
21
#SBATCH --constraint=es1_v100
22
#
23
# Wall clock limit:
24
#SBATCH --time=24:00:00
25
26
export CUDA_VISIBLE_DEVICES=0,1
27
module load cuda/10.0
Copied!

Interactive Jobs on the Group GPU Condo Node

To run an interactive session on the GPU node, use the following two commands to provision and log in to the node: salloc --time=24:00:00 --nodes=1 -p es1 --gres=gpu:2 --cpus-per-task=8 --qos=condo_mp_es1 --account=lr_mp srun --pty -u bash -i

Back-Up Data Frequently

Mongo DB

You should back-up your Mongo DB data frequently. The Mongo DB NERSC offers is not backed-up automatically. It's important to run regular backups during the course of your research. For Mongo DB you can:
  • Get a free education license for Studio 3T, right click your database, and click export or,
  • use the “mongodump” command line tool, tutorials available online — it’s a one line command
module load mongodb; mongodump --host=<host e.g. mongo01.nersc.gov> --port=20717 --username=<username> --password="<password>" --db="<db_name>" --authenticationDatabase="<db_name (same as --db flag)>" --out="<path to directory>"

NERSC Cori, BRC Savio, LRC Lawrencium

You should back-up your "scratch" directory data frequently.
Each cluster has different methods of deleting old data on your scratch directory:
You can use one of the following popular tools for backing up raw data in our group:
Both of these tools can help backup your raw calculation files to your Google Drive, or Box Drive, or even external hard drives.

Additional resources

Other Persson group members and the NERSC website are both excellent resources for getting additional help. If that fails, you can reach out to the NERSC Operations staff:
  • 1-800-666-3772 (or 1-510-486-8600)
  • Computer Operations = menu option 1 (24/7)
  • Account Support = menu option 2, [email protected]
  • HPC Consulting = menu option 3, or [email protected]
Special thanks to the original authors of this page: Kara Fong, Eric Sivonxay, and John Dagdelen