Fireworks: Advanced Usage
This page builds upon the Fireworks + Atomate Setup page to provide practical tips and advice for more advanced use cases which may be helpful once your start running a larger number of calculations.
Note: atomate2 is a rising code that is still in development. Generally these research codes are under continuous improvement, so if something isn't working here please ask! It is likely the documentation needs to be updated.
For more information on best practices for running VASP on multiple nodes (i.e. how to set vasp_cmd in my_fworker.yaml based on the number of nodes requested in my_qadapter) see the NERSC vasp training.
Follow this flowchart to decide how many nodes to request, the number of parallel processes to use, and the number of cores to use for each parallel process. The rule of thumb to remember is that you should use all of the cores for every node that you request. For example, if there are 256 cores per node, you request N nodes, n processes, and c cores/process, then the following relationship must hold:
Specify
in my_qadapter.yaml and specify
and
in my_fworker.yaml within your configuration folder.
In research you may need the ability to run different types of calculations where you will want to easily switch between various set-ups. This section builds upon the previous content to extend it for setting-up running fireworks with multiple configurations.
For demonstrations purposes, here we will illustrate how to set-up two fireworks configurations for switching between running calculations with 1 node vs. 4 nodes. We will make a few other assumptions based on common best practices:
- You have applied tags (
"1N"
or"4N"
) to specify which fireworks should be run with the 1N vs. 4N firework configuration - You have created separate python environments (
cms_1N
andcms_4N
) for running each type of calculation
These basic steps can be adapted for making many more different custom fireworks configurations for every type of calculation you need!
Based on the Getting Started Fireworks Configuration section you should have a directory for your fireworks configuration files named
fw_config
in your home directory.The
db.json
file in your fw_config
directory set-up in the previous section can be used as is assuming you want all your results to be saved in the same database for your various fireworks configurations.We will need to create a directory for each firework configuration with the specific files that will outline the settings for that configuration (e.g.
FW_config.yaml
, my_fworker.yaml
, my_launchpad.yaml
, my_qadapter.yaml
). We will have a directory named 1N
for the single node configuration and another directory named 4N
for the four node configuration. So the fw_config
directory in your home directory should contain the following:- db.json (file)
- 1N (directory)
- 4N (directory)
Now within the
1N
and 4N
directories you will need to create the following firework configuration files with the specific settings you want for each fireworks configuration. Note given the variations between different supercomputers, specifics such as file paths, VASP commands, etc. have been omitted here and the format for NERSC has been used. If needed, see the Getting Started Fireworks Configuration section for examples of appropriate firework configuration files (FW_config.yaml
, my_fworker.yaml
, my_launchpad.yaml
, my_qadapter.yaml
) for each specific supercomputer system.FW_config.yaml
my_fworker.yaml
my_launchpad.yaml
my_qadapter.yaml
CONFIG_FILE_DIR: /<home_directory_path>/fw_config/1N/
ECHO_TEST: "1N fw_config activated!"
name: 1N_fworker
category: ''
query: '{"spec.tags":{"$all":["1N"]}}'
env:
db_file: /<home_directory_path>/fw_config/db.json
vasp_cmd: '<appropriate VASP command>'
gamma_vasp_cmd: '<appropriate VASP command>'
scratch_dir: /<scratch_directory_path>/
incar_update:
host: mongodb07.NERSC.gov
port: 27017
name: 'fw_db_name'
username: '<admin_username>'
password: ‘<admin_password>’
logdir: null
Istrm_lvl: DEBUG
user_indices: []
wf_user_indices: []
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -c /<home_directory_path>/fw_config/1N/ rapidfire --nlaunches 1 --max_loops 3
nodes: 1
walltime: '24:00:00'
account: matgen
job_name: 1N
signal: [email protected]
qos: regular
constraint: 'knl'
pre_rocket: |
source activate cms_1N
module load vasp-tpc/5.4.4-knl
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=1
FW_config.yaml
my_fworker.yaml
my_launchpad.yaml
my_qadapter.yaml
CONFIG_FILE_DIR: /<home_directory_path>/fw_config/4N/
ECHO_TEST: "4N fw_config activated!"
name: 4N_fworker
category: ''
query: '{"spec.tags":{"$all":["4N"]}}'
env:
db_file: /<home_directory_path>/fw_config/db.json
vasp_cmd: '<appropriate VASP command>'
gamma_vasp_cmd: '<appropriate VASP command>'
scratch_dir: /<scratch_directory_path>/
incar_update:
host: mongodb07.NERSC.gov
port: 27017
name: 'fw_db_name'
username: '<admin_username>'
password: ‘<admin_password>’
logdir: null
Istrm_lvl: DEBUG
user_indices: []
wf_user_indices: []
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -c /<home_directory_path>/fw_config/4N/ rapidfire --nlaunches 1 --max_loops 3
nodes: 4
walltime: '48:00:00'
account: matgen
job_name: 1N
signal: [email protected]
qos: regular
constraint: 'knl'
pre_rocket: |
source activate cms_4N
module load vasp-tpc/5.4.4-knl
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=1
Now that you have your firework configuration files set-up, we need the ability to specify which firework configuration we want to use upon launching queue jobs. This is done based on the value of the bash environment variable
FW_CONFIG_FILE
. You can print out what is the value of this variable by typing the following command...echo $FW_CONFIG_FILE
Note that in the Getting Started Fireworks Configuration instructions, you were instructed to modify your .bashrc or .bashrc.ext file located in your home directory to automatically activate a fireworks configuration by default. Here you should modify that file so you can type a command to activate a specific fireworks configuration such as
1N_fw_config
or 4N_fw_config
by adding the following lines:No Default fw_config
Default fw_config Specified
alias 1N_fw_config='export FW_CONFIG_FILE="/<home_directory_path>/fw_config/1N/FW_config.yaml"'
alias 4N_fw_config='export FW_CONFIG_FILE="/<home_directory_path>/fw_config/4N/FW_config.yaml"'
export FW_CONFIG_FILE='/<home_directory_path>/fw_config/FW_config.yaml'
alias 1N_fw_config='export FW_CONFIG_FILE="/<home_directory_path>/fw_config/1N/FW_config.yaml"'
alias 4N_fw_config='export FW_CONFIG_FILE="/<home_directory_path>/fw_config/4N/FW_config.yaml"'
Now you should be ready to use either firework configuration to launch jobs into the supercomputer queue! Here are some examples of what you might do...
Launch 10 Jobs with 1N_fw_config
Launch 2 Jobs with 4N_fw_config
conda activate cms_1N
1N_fw_config
echo $FW_CONFIG_FILE
qlaunch rapidfire --nlaunches 10
conda activate cms_4N
4N_fw_config
echo $FW_CONFIG_FILE
qlaunch rapidfire --nlaunches 2
We recommend using a tmux or screen when submitting jobs to preserve queue submission when the main ssh session is terminated. This allows one to keep the queue saturated with jobs. For example, for tmux:
module load tmux
tmux new -s background_launcher
source active cms
mkdir FireworksTest
cd FireworksTest
To exit the tmux session, press the “control” and “b” keys together, followed by “d”.
To re-enter the tmux session:
tmux attach -t background_launcher
Use qlaunch command to reserve jobs with SLURM’s scheduler. Qlaunch has 3 modes; singleshot, rapidfire, and multi. You will want to use rapidfire. Some useful flags to set are:
-m
to specify maximum # of jobs in queue at any given time--nlaunches
to specify how many fireworks to run.
Maintain 50 jobs in the queue at all times
Add Fill Mode (submit even if there are no ready fws)
qlaunch rapidfire --nlaunches infinite -m 50
qlaunch -fm rapidfire --nlaunches infinite -m 50
When you are running many fireworks in rapidfire mode, jobs may hang and not complete. This is referred to as "lost runs" in fireworks. To check for lost runs, use the command:
lpad detect_lostruns
To rerun these jobs, add the --rerun flag to the command.
lpad detect_lostruns --rerun
You can easily reset all fizzled fireworks back to the ready state to be run again with this command in terminal.
lpad rerun_fws -s FIZZLED
When you are running many fireworks, sometimes a calculation that fizzled the first time will successfully complete if you just rerun it. You can check the error message for why a firework fizzled through the MongoDB
launches
collection through the following path in the document: action -> stored_data -> _exception -> _stacktrace
Checking the error can help you decided whether simply rerunning the firework may work.
If you have hundreds or thousands of calculations to run, you may be interested in "job packing" to take advantage of the big job discount offered by NERSC or improve your queue wait times by reducing the number of jobs you submit to slurm (e.g. 1x 1024 node job will be scheduled faster than 1024x single node jobs). The group has python scripts that will automatically distribute running fireworks when multiple nodes are requested for a job. Consult Jason Munro for access to these github job packing scripts and an introduction if you are interested in setting up job packing for running calculations.
Last modified 10mo ago