Training on GPU(s)

Training networks on GPUs has exploded in popularity in recent years. As such, we have built mentat-lss with full gpu compatability on both linux and macos (arm64) architectures. The hardest part for you is thus making sure your enviornment is correctly setup to utilize your hardware.

Enviornment troubleshooting

As a first step, you can (and should!) check after install that PyTorch knows about any gpus available. To do so, run the following commands in a python terminal,

import torch
# If on linux, run this command
torch.cude.is_available() #<- should return True if correctly built for gpu
# If on macos, run this command
torch.backends.mps.is_available() #<- should return True if correctly built for gpu

If the above returns false and you have a gpu available, you probably installed the cpu build of PyTorch. That can happen if you don’t have cuda installed, which you can double-check by running nvidia-smi in the terminal. If you do have cuda properly installed, follow the pytorch local installation instructions.

Once you’ve verified pytorch can see your gpu, you can simply run our example training script with python ./scripts/train_emulator.py!

Multiple GPUs

The above example script will also attempt to train the emulator on multiple GPUs at once, potentially saving a significant amount of time. It does this by assigning each sub-network in the emulator to one GPU only, and periodically syncing up the results from all gpus together. Therefore, if you are running the above script then there “should” be no extra work required for utilizing more than one gpu.

Example slurm script

You will most likely want to train your emulator on an HPC system. To facilitate doing so, we’ve proved an example slurm script below, based off running on University of Arizona systems.

#!/bin/bash
#BATCH --job-name=<name of job>
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=8gb
#SBATCH --gres=gpu:2 <- number of gpus PER NODE>
#SBATCH --time=48:00:00
#SBATCH --partition=<queue name>
#SBATCH --account=<account name>
#SBATCH --output=<name_of_output_file>.out

# load any system-specific modules (ex: anaconda)
module load <modules>
source /path/to/home/.bashrc

# replace with your specific enviornment
conda activate <anaconda_enviornment>

cd /path/to/repo/scripts

config_file="/path/to/net/config/file.yaml"
echo "starting job"

python train_emulator.py $config_file

Note the line #SBATCH --gres=gpu:2 will request 2 GPUs on the same node. Using gpus across different nodes will be added in a future release, so do so at your own risk.

Optimizing Hyperparameters

There are many different methods for optimizing network hyperparameters, all with varying levels of complexity and efficiency. In v1.1, we have added a script that utilizes Optuna, a powerful library that performs hyperparameter optimization far more efficiently than a basic grid search. We highly recommend using this script to optimize your emulator, as it will likely save you a significant amount of time and computational resources.