VASP

Introduction

VASP is an ab-initio plane-wave density functional theory code which solves the quantum-mechanical Schrodinger equation from first-principles. VASP can exploit graphic processing units (GPU) to accelerate calculations.

Warning

Licensing: Usage of VASP is restricted to licensing. Users are responsible to ensure that their Principal Investigator (PI) has a valid VASP license prior to using VASP on NSCC’s systems.

Support

Log a ticket with help@nscc.sg should you face any issues with using VASP on NSCC’s systems, e.g., job submission script issues, compilation, performance-related issues.

Usage

ASPIRE 2A CPU

To run a CPU batch job on ASPIRE 2A, prepare a batch job script (see below) and submit it to the scheduler,

$ qsub batch.pbs

Sample PBS job script for ASPIRE 2A CPU nodes

#!/bin/sh
#PBS -N vasp 
#PBS -P <project-id>
#PBS -l select=2:ncpus=128:mpiprocs=128:ompthreads=1:mem=440gb
#PBS -l walltime=1:00:00
#PBS -j oe

# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1

# modules used
module swap PrgEnv-cray PrgEnv-intel 
module swap craype-x86-rome craype-x86-milan 
module load mkl/2024.0 
module load cray-hdf5-parallel 

# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/

# Run the job
mpirun -np 256 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam 

### Notes:
# 1. ncpus = mpiprocs * ompthreads. For instance, if want to use 2 ompthreads
#    running on 128 CPU cores per node, you would use the following line:
#
#    #PBS -l select=1:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb
#
#    Make sure that OMP_NUM_THREADS is set accordingly to the number of threads
#    requested.
# 
# 2. Modules used here may differ based on your choice of compiler toolchain
#    and libraries (MPI, math libraries, ...).
#
# 3. number of MPI processes (-np XX):
#
#    XX = select * mpiprocs
#
#    For example,
#
#    #PBS -l select=2:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb => XX = 128
#
#    mpirun -np 128 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam 
#
# 4. Reminder for PBS resource request:
#    "select": number of "chunks" of resource requested. The following parameters
#              refer to the amount of resource that is requested per chunk.
#              E.g. select=2:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb
#              requests for 2 chunks of 128 CPU cores, 64 MPI procs and 2 ompthreads per chunk. 
#              In this example, each chunk = 1 CPU node.
###

ASPIRE 2A GPU

To run a GPU batch job on ASPIRE 2A, prepare a batch job script (see below) and submit it to the scheduler,

$ qsub batch.pbs

Sample PBS job script for ASPIRE 2A GPU nodes (Cray MPICH)

#!/bin/sh
#PBS -N vasp 
#PBS -P <project-id>
#PBS -l select=2:ngpus=4:mpiprocs=4
#PBS -l walltime=3:00:00
#PBS -j oe

# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16

# uncomment for multi-node runs
unset CUDA_VISIBLE_DEVICES
export MPICH_GPU_SUPPORT_ENABLED=1

# modules used
module swap PrgEnv-cray PrgEnv-nvhpc
module swap craype-x86-rome craype-x86-milan 
module load craype-accel-nvidia80 
module swap nvhpc nvhpc/23.7 
module swap cuda cuda/11.8.0 
module load mkl/2024.0
module load hdf5/1.12.1-nvhpc
module rm cray-libsci

# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/

# Run the job
# Warning: Cray MPICH broken for multi-node runs. Observed to crash for hybrid
# functional runs!
mpirun -np 8 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam 

### Notes
# 1. number of MPI processes (-np XX):
#
#    XX = select * mpiprocs
#
#    For example,
#
#    #PBS -l select=2:ngpus=4:mpiprocs=4 => XX = 8
#
#    mpirun -np 8 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam 
#
#    Note that in this example we spawn 1 MPI process per GPU requested.
#
# 2. Number of OpenMP threads. 16 CPU cores are automatically assigned per GPU
#    requested. You can use OpenMP threads to run parts of the calculation that
#    are still running on the CPUs efficiently using OpenMP. The optimal number
#    of OpenMP threads can be tuned accordingly if needed by setting the variable
#    OMP_NUM_THREADS.
#
# 3. unset CUDA_VISIBLE_DEVICES
#    export MPICH_GPU_SUPPORT_ENABLED=1
#    
#    Above two lines are required to enable multi-node GPU runs.
# 
# 4. Modify modules used according to your compilation.
#
###

Sample PBS job script for ASPIRE 2A GPU nodes (Open MPI)

#!/bin/sh
#PBS -N vasp 
#PBS -P <project-id>
#PBS -l select=2:ngpus=4:mpiprocs=4
#PBS -l walltime=3:00:00
#PBS -j oe

# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16

# modules used
module load openmpi/5.0.5-nv22.11
module load mkl/2024.0 

# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/

# Run the job
mpirun -np 8 -hostfile $PBS_NODEFILE --map-by ppr:4:node:PE=$OMP_NUM_THREADS \
  -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS $VASP_DIR/vasp_std 

### Notes
# 1. number of MPI processes (-np XX):
#
#    XX = select * mpiprocs
#
#    For example,
#
#    #PBS -l select=2:ngpus=4:mpiprocs=4 => XX = 8
#
#    mpirun -np 8 ... $VASP_DIR/vasp_gam 
#
#    Note that in this example we spawn 1 MPI process per GPU requested.
#
# 2. Number of OpenMP threads. 16 CPU cores are automatically assigned per GPU
#    requested. You can use OpenMP threads to run parts of the calculation that
#    are still running on the CPUs efficiently using OpenMP. The optimal number
#    of OpenMP threads can be tuned accordingly if needed by setting the variable
#    OMP_NUM_THREADS.
# 
# 3. Modify modules used according to your compilation.
#
# 4. For Open MPI, need to export some environment variables via 
#    -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS to ensure that all processes
#    can find the executable.
###

ASPIRE 2A+

Info

ASPIRE 2A+ is a NVIDIA H100 GPU system and is not intended for CPU workloads.

Info

At present, ASPIRE 2A+ is primarily used for AI workloads. The information provided here is for reference when running VASP on advanced GPU architectures.

To run a batch job on ASPIRE 2A+, prepare a batch job script (see below) and submit it to the scheduler,

$ qsub batch.pbs

Sample PBS job script for ASPIRE 2A+

#!/bin/sh
#PBS -N vasp 
#PBS -P <project-id>
#PBS -l select=1:ngpus=8:mpiprocs=8:ompthreads=14
#PBS -l walltime=1:00:00
#PBS -j oe

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=14
unset CUDA_VISIBLE_DEVICES
export OMP_PLACES=threads
export OMP_PROC_BIND=close

### Load relevant modules used for VASP.
### ...
###

mpirun -np 8 -hostfile $PBS_NODEFILE --map-by ppr:8:node:PE=$OMP_NUM_THREADS \
 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -x OMP_PLACES -x OMP_PROC_BIND \
 vasp_std

# For multi-node jobs using openmpi, need to pass in relevant environment variables using -x
# if not other compute nodes will complain eg "executable not found"

Resource sizing recommendations

Info

Based on our benchmarks, we provide some “rules-of-thumb” below to estimate the compute resources to be requested corresponding to the problem size.

ASPIRE 2A CPU
- Regular DFT: Every 100 atoms ~ 1 CPU node per k-point
- Hybrid DFT: Every 100 atoms ~ 4 CPU node per k-point
Regular DFT: Every 200 atoms ~ 1 A100 card per k-point
- Hybrid DFT: Every 100 atoms ~ 4 A100 cards per k-point Using GPUs for VASP, the expected benefits include:
Faster time-to-solution:
- Performance of 1 A100 card is equivalent to 2 - 2.5 CPU nodes on ASPIRE 2A.
- Faster time-to-solution translates to cost savings and also energy efficient computing.

Tips and tricks

Optimising parallelisation parameters

Two main parameters in INCAR: NCORE and KPAR. Users are encouraged to perform their own testing to find out the most optimal settings for their calculation. Note that the most optimal parameters depend on the amount of HPC resource requested, the size of the system (number of electrons, cell size). Prior experience may inform on a go-to set of parallelisation parameters to be used, but it is good to check the best set of parameters before running a calculation, especially long geometry relaxations.

A 5% speedup by tuning the parallelisation parameters can translate to large cost savings. Consider a job running on 1,024 CPU cores for roughly 20 hours, with about 100 such instances required throughout the course of a project. This translates to a saving of 102,400 CPU core hours!

Monitoring GPU activity

To monitor whether GPUs are active during a calculation, one can make use of the nvidia-smi tool. This utility helps to see whether the executables are running on the GPUs (Processes at the bottom) and whether it is running ideally on the GPUs (column GPU-Util, higher is better).

Take note of the job ID of your batch job.
Export the job ID as an environment variable.

export PBS_JOBID=job_id

Find out the compute node hostname.

$ qstat -n1
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
***      ***      g3       ***         ***   1  64  440gb 02:00 R 00:01 hostname/0*64

ssh into compute node.

ssh hostname

run nvidia-smi.

Sample nvidia-smi output

$ nvidia-smi
Fri Jun  6 15:47:43 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   63C    P0             291W / 400W |   8010MiB / 40960MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   56C    P0             285W / 400W |   8082MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   62C    P0             306W / 400W |   8082MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   58C    P0             285W / 400W |   8010MiB / 40960MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3475239      C   ...c0-8de7-bc7e8d4ef790/files/vasp_std     8002MiB |
|    1   N/A  N/A   3475240      C   ...c0-8de7-bc7e8d4ef790/files/vasp_std     8074MiB |
|    2   N/A  N/A   3475241      C   ...c0-8de7-bc7e8d4ef790/files/vasp_std     8074MiB |
|    3   N/A  N/A   3475242      C   ...c0-8de7-bc7e8d4ef790/files/vasp_std     8002MiB |
+---------------------------------------------------------------------------------------+

Accelerating the wavefunction initialisation process

Users running large systems may have noticed that the job gets stuck for a long time before entering the SCF iteration cycle. This can be attributed to the wavefunction initialisation process, where the default algorithm is a serial one.There is an undocumented INCAR tag RANDOM_GENERATOR which allows for switching to a parallel version of the wavefunction initialisation. To use this version, set in your INCAR:

RANDOM_GENERATOR = pcg_32

The time taken for wavefunction initialisation generally does not occupy too much time in the entire calculation, but every second counts, which ultimately translates to cost savings to enable you to do more calculations!

Do check your total energies as a sanity check after convergence to make sure that results don't change. For those interested in the details, you may check out the following:

wave.F -- WFINIT
random.F -- random_reader

Feature available from vasp 6.4.0 onwards.

References to best practices

https://www.nsc.liu.se/support/Events/VASP_workshop_2024/

Building VASP

Users need to upload their own copy of the source code to NSCC’s systems for compilation. Kindly remember that access to the source code is only available to users who are tagged to a valid VASP license!

ASPIRE 2A CPU

Build instructions (VASP 6.4.3)

# setup environment
module swap PrgEnv-cray PrgEnv-intel
module swap craype-x86-rome craype-x86-milan
module load mkl/2024.0
module load cray-hdf5-parallel
module rm cray-libsci # cray-libsci may intefere with math libs

makefile.include:

# Adopt from makefile.include.intel_ompi_mkl_omp
# Replace mpif90 with ftn, icc with cc, icpc with CC

# Default precompiler options

CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dfock_dblbuf \
              -D_OPENMP

CPP         = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC          = ftn -qopenmp
FCL         = ftn

FREE        = -free -names lowercase
FFLAGS      = -assume byterecl -w

OFLAG       = -O2
OFLAG_IN    = $(OFLAG)
DEBUG       = -O0

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib

CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = cc
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = CC
LLIBS       = -lstdc++

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##

# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
# added by ftn
#VASP_TARGET_CPU ?= -xHOST
#FFLAGS     += $(VASP_TARGET_CPU)

# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL        += -qmkl
LLIBS      += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS        = -I$(MKLROOT)/include -I$(MKLROOT)/include/fftw

# HDF5-support (optional but strongly recommended)
CPP_OPTIONS+= -DVASP_HDF5
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include

make:

make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl

ASPIRE 2A GPU

Build instructions (VASP 6.4.3) For multi-node GPU jobs on VASP, using both Open MPI and Cray MPICH results in a build that can work on ASPIRE 2A. However, hybrid functionals with KPAR > 1 is broken for Cray MPICH. Moreover, the performance of Open MPI is better than Cray MPICH, as will be seen in the benchmark times below.

I will use Open MPI within HPC-X. To use HPC-X see release notes for more details on how to load the environment to prime for HPC-X use.

# setup environment

module swap PrgEnv-cray PrgEnv-nvhpc
module swap craype-x86-rome craype-x86-milan
module load craype-accel-nvidia80
module swap nvhpc nvhpc/22.11
module swap cuda cuda/11.8.0
module rm cray-libsci # cray-libsci may intefere with math libs
module load mkl/2024.0
module rm cray-mpich

source /app/apps/nvhpc/22.11/Linux_x86_64/22.11/comm_libs/hpcx/latest/hpcx-init.sh
hpcx_load

makefile.include:

CPP_OPTIONS = -DHOST=\"LinuxNV\" \
              -DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP \
              -D_OPENACC \
              -DUSENCCL -DUSENCCLP2P 

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
CC          = mpicc  -acc -gpu=cc80,cuda11.8 -mp
FC          = mpif90 -acc -gpu=cc80,cuda11.8 -mp
FCL         = mpif90 -acc -gpu=cc80,cuda11.8 -mp -c++libs

FREE        = -Mfree
FFLAGS      = -Mbackslash -Mlarge_arrays
OFLAG       = -fast
DEBUG       = -Mfree -O0 -traceback

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = $(CC)
CFLAGS_LIB  = -O -w
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = mpiCC --no_warnings

## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp zen3
FFLAGS     += $(VASP_TARGET_CPU)

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
OFLAG_IN   = -fast -Mwarperf
SOURCE_IN  := nonlr.o

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
LLIBS_MKL   = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
INCS       += -I$(MKLROOT)/include/fftw

LLIBS      += $(LLIBS_MKL)

make:

make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl

ASPIRE 2A+

Build instructions (VASP 6.4.3) Environment:

module load nvhpc/24.9-nompi
module load cuda/12.6.2
source /app/apps/nvhpc/24.9/Linux_x86_64/24.9/comm_libs/12.6/hpcx/latest/hpcx-init.sh
hpcx_load
echo "Checking mpif90:"
mpif90 --version
module load fftw/3.3.10

# Needed for cuSolverMP, see <https://docs.nvidia.com/hpc-sdk/archive/24.9/hpc-sdk-release-notes/index.html#known-limitations>
export LD_LIBRARY_PATH=$NVHPC_ROOT/comm_libs/12.6/hpcx/latest/ucc/lib:$NVHPC_ROOT/comm_libs/12.6/hpcx/latest/ucx/lib:$LD_LIBRARY_PATH

makefile.include:

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
              -DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP \
              -D_OPENACC \
              -DUSENCCL 

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
CC          = mpicc  -acc -gpu=cc90,cuda12.6 -mp
FC          = mpif90 -acc -gpu=cc90,cuda12.6 -mp
FCL         = mpif90 -acc -gpu=cc90,cuda12.6 -mp -c++libs

FREE        = -Mfree
FFLAGS      = -Mbackslash -Mlarge_arrays
OFLAG       = -fast
DEBUG       = -Mfree -O0 -traceback

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = $(CC)
CFLAGS_LIB  = -O -w
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp sapphirerapids
FFLAGS     += $(VASP_TARGET_CPU)

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
OFLAG_IN   = -fast -Mwarperf
SOURCE_IN  := nonlr.o

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# BLAS (mandatory)
BLAS        = -L/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/lib -lblas
INCS       += -I/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/include

# LAPACK (mandatory)
LAPACK        = -L/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/lib -llapack

# scaLAPACK (mandatory)
SCALAPACK   = -L/home/users/ORG/ORG2/USERID/software/scalapack-2.2.0-gpu/ -lscalapack

LLIBS      += $(SCALAPACK) $(LAPACK) $(BLAS)

# FFTW (mandatory)
FFTW_ROOT  ?= /home/users/ORG/ORG2/USERID/local/fftw/fftw-3.3.10
LLIBS      += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS       += -I$(FFTW_ROOT)/include


# Use cusolvermp (optional)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
CPP_OPTIONS+= -DCUSOLVERMP -DCUBLASMP
LLIBS      += -cudalib=cusolvermp,cublasmp -lnvhpcwrapcal

make:

make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl

Performance data

GPU (ASPIRE 2A/2A+)

Benchmark system: HfO2 used by NVIDIA blog post (hybrid functional calculation)

Specs of reference system: 8x A100-SXM4-80GB, 8x NVIDIA ConnectX-6 HDR InfiniBand network interface cards (NICs), 2x AMD EPYC 7742 CPUs

In NVIDIA’s blog post, the timing was projected from one SCF iteration to the full calculation which takes about 40 SCF iterations. The same methodology was used to calculate the estimated walltime needed.

ASPIRE 2A/2A+ timings

GPU cards	Projected walltime (min) on A100 (ASPIRE 2A)	Projected walltime (min) on H100 (ASPIRE 2A+)
1	323	166
2	-	85
4	85	44
8	45	24
16	23	13
32	13	8

Roughly 2x speedup for H100 compared with A100.

GPU effective utilisation

Info

GPU effective utilisation is defined as the percentage of time at which GPUs are active.

Based on our benchmarks, a baseline GPU effective utilisation is established for reference to assess if a VASP job is running effectively:

ASPIRE 2A GPU (A100): 70%
ASPIRE 2A+ (H100): 50%