VASP
Introduction
VASP is an ab-initio plane-wave density functional theory code which solves the quantum-mechanical Schrodinger equation from first-principles. VASP can exploit graphic processing units (GPU) to accelerate calculations.
Warning
Licensing: Usage of VASP is restricted to licensing. Users are responsible to ensure that their Principal Investigator (PI) has a valid VASP license prior to using VASP on NSCC’s systems.
Support
Log a ticket with help@nscc.sg should you face any issues with using VASP on NSCC’s systems, e.g., job submission script issues, compilation, performance-related issues.
Usage
ASPIRE 2A CPU
To run a CPU batch job on ASPIRE 2A, prepare a batch job script (see below) and submit it to the scheduler,
$ qsub batch.pbs
Sample PBS job script for ASPIRE 2A CPU nodes
#!/bin/sh
#PBS -N vasp
#PBS -P <project-id>
#PBS -l select=2:ncpus=128:mpiprocs=128:ompthreads=1:mem=440gb
#PBS -l walltime=1:00:00
#PBS -j oe
# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
# modules used
module swap PrgEnv-cray PrgEnv-intel
module swap craype-x86-rome craype-x86-milan
module load mkl/2024.0
module load cray-hdf5-parallel
# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/
# Run the job
mpirun -np 256 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam
### Notes:
# 1. ncpus = mpiprocs * ompthreads. For instance, if want to use 2 ompthreads
# running on 128 CPU cores per node, you would use the following line:
#
# #PBS -l select=1:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb
#
# Make sure that OMP_NUM_THREADS is set accordingly to the number of threads
# requested.
#
# 2. Modules used here may differ based on your choice of compiler toolchain
# and libraries (MPI, math libraries, ...).
#
# 3. number of MPI processes (-np XX):
#
# XX = select * mpiprocs
#
# For example,
#
# #PBS -l select=2:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb => XX = 128
#
# mpirun -np 128 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam
#
# 4. Reminder for PBS resource request:
# "select": number of "chunks" of resource requested. The following parameters
# refer to the amount of resource that is requested per chunk.
# E.g. select=2:ncpus=128:mpiprocs=64:ompthreads=2:mem=440gb
# requests for 2 chunks of 128 CPU cores, 64 MPI procs and 2 ompthreads per chunk.
# In this example, each chunk = 1 CPU node.
###
ASPIRE 2A GPU
To run a GPU batch job on ASPIRE 2A, prepare a batch job script (see below) and submit it to the scheduler,
$ qsub batch.pbs
Sample PBS job script for ASPIRE 2A GPU nodes (Cray MPICH)
#!/bin/sh
#PBS -N vasp
#PBS -P <project-id>
#PBS -l select=2:ngpus=4:mpiprocs=4
#PBS -l walltime=3:00:00
#PBS -j oe
# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16
# uncomment for multi-node runs
unset CUDA_VISIBLE_DEVICES
export MPICH_GPU_SUPPORT_ENABLED=1
# modules used
module swap PrgEnv-cray PrgEnv-nvhpc
module swap craype-x86-rome craype-x86-milan
module load craype-accel-nvidia80
module swap nvhpc nvhpc/23.7
module swap cuda cuda/11.8.0
module load mkl/2024.0
module load hdf5/1.12.1-nvhpc
module rm cray-libsci
# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/
# Run the job
# Warning: Cray MPICH broken for multi-node runs. Observed to crash for hybrid
# functional runs!
mpirun -np 8 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam
### Notes
# 1. number of MPI processes (-np XX):
#
# XX = select * mpiprocs
#
# For example,
#
# #PBS -l select=2:ngpus=4:mpiprocs=4 => XX = 8
#
# mpirun -np 8 --cpu-bind depth -d $OMP_NUM_THREADS $VASP_DIR/vasp_gam
#
# Note that in this example we spawn 1 MPI process per GPU requested.
#
# 2. Number of OpenMP threads. 16 CPU cores are automatically assigned per GPU
# requested. You can use OpenMP threads to run parts of the calculation that
# are still running on the CPUs efficiently using OpenMP. The optimal number
# of OpenMP threads can be tuned accordingly if needed by setting the variable
# OMP_NUM_THREADS.
#
# 3. unset CUDA_VISIBLE_DEVICES
# export MPICH_GPU_SUPPORT_ENABLED=1
#
# Above two lines are required to enable multi-node GPU runs.
#
# 4. Modify modules used according to your compilation.
#
###
Sample PBS job script for ASPIRE 2A GPU nodes (Open MPI)
#!/bin/sh
#PBS -N vasp
#PBS -P <project-id>
#PBS -l select=2:ngpus=4:mpiprocs=4
#PBS -l walltime=3:00:00
#PBS -j oe
# Set up
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16
# modules used
module load openmpi/5.0.5-nv22.11
module load mkl/2024.0
# Change accordingly to location of VASP executable.
VASP_DIR=$HOME/software/vasp.6.4.3/bin/
# Run the job
mpirun -np 8 -hostfile $PBS_NODEFILE --map-by ppr:4:node:PE=$OMP_NUM_THREADS \
-x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS $VASP_DIR/vasp_std
### Notes
# 1. number of MPI processes (-np XX):
#
# XX = select * mpiprocs
#
# For example,
#
# #PBS -l select=2:ngpus=4:mpiprocs=4 => XX = 8
#
# mpirun -np 8 ... $VASP_DIR/vasp_gam
#
# Note that in this example we spawn 1 MPI process per GPU requested.
#
# 2. Number of OpenMP threads. 16 CPU cores are automatically assigned per GPU
# requested. You can use OpenMP threads to run parts of the calculation that
# are still running on the CPUs efficiently using OpenMP. The optimal number
# of OpenMP threads can be tuned accordingly if needed by setting the variable
# OMP_NUM_THREADS.
#
# 3. Modify modules used according to your compilation.
#
# 4. For Open MPI, need to export some environment variables via
# -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS to ensure that all processes
# can find the executable.
###
ASPIRE 2A+
Info
ASPIRE 2A+ is a NVIDIA H100 GPU system and is not intended for CPU workloads.
Info
At present, ASPIRE 2A+ is primarily used for AI workloads. The information provided here is for reference when running VASP on advanced GPU architectures.
To run a batch job on ASPIRE 2A+, prepare a batch job script (see below) and submit it to the scheduler,
$ qsub batch.pbs
Sample PBS job script for ASPIRE 2A+
#!/bin/sh
#PBS -N vasp
#PBS -P <project-id>
#PBS -l select=1:ngpus=8:mpiprocs=8:ompthreads=14
#PBS -l walltime=1:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=14
unset CUDA_VISIBLE_DEVICES
export OMP_PLACES=threads
export OMP_PROC_BIND=close
### Load relevant modules used for VASP.
### ...
###
mpirun -np 8 -hostfile $PBS_NODEFILE --map-by ppr:8:node:PE=$OMP_NUM_THREADS \
-x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -x OMP_PLACES -x OMP_PROC_BIND \
vasp_std
# For multi-node jobs using openmpi, need to pass in relevant environment variables using -x
# if not other compute nodes will complain eg "executable not found"
Resource sizing recommendations
Info
Based on our benchmarks, we provide some “rules-of-thumb” below to estimate the compute resources to be requested corresponding to the problem size.
- ASPIRE 2A CPU
- Regular DFT: Every 100 atoms ~ 1 CPU node per k-point
- Hybrid DFT: Every 100 atoms ~ 4 CPU node per k-point
- Regular DFT: Every 200 atoms ~ 1 A100 card per k-point
- Hybrid DFT: Every 100 atoms ~ 4 A100 cards per k-point Using GPUs for VASP, the expected benefits include:
- Faster time-to-solution:
- Performance of 1 A100 card is equivalent to 2 - 2.5 CPU nodes on ASPIRE 2A.
- Faster time-to-solution translates to cost savings and also energy efficient computing.
Tips and tricks
Optimising parallelisation parameters
Two main parameters in INCAR
: NCORE
and KPAR
. Users are encouraged to perform their own testing to find out the most optimal settings for their calculation. Note that the most optimal parameters depend on the amount of HPC resource requested, the size of the system (number of electrons, cell size). Prior experience may inform on a go-to set of parallelisation parameters to be used, but it is good to check the best set of parameters before running a calculation, especially long geometry relaxations.
A 5% speedup by tuning the parallelisation parameters can translate to large cost savings. Consider a job running on 1,024 CPU cores for roughly 20 hours, with about 100 such instances required throughout the course of a project. This translates to a saving of 102,400 CPU core hours!
Monitoring GPU activity
To monitor whether GPUs are active during a calculation, one can make use of the nvidia-smi tool. This utility helps to see whether the executables are running on the GPUs (Processes at the bottom) and whether it is running ideally on the GPUs (column GPU-Util, higher is better).
-
Take note of the job ID of your batch job.
-
Export the job ID as an environment variable.
export PBS_JOBID=job_id
- Find out the compute node hostname.
$ qstat -n1
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
*** *** g3 *** *** 1 64 440gb 02:00 R 00:01 hostname/0*64
- ssh into compute node.
ssh hostname
nvidia-smi
.
Sample nvidia-smi output
$ nvidia-smi
Fri Jun 6 15:47:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:03:00.0 Off | 0 |
| N/A 63C P0 291W / 400W | 8010MiB / 40960MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 56C P0 285W / 400W | 8082MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 62C P0 306W / 400W | 8082MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 58C P0 285W / 400W | 8010MiB / 40960MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3475239 C ...c0-8de7-bc7e8d4ef790/files/vasp_std 8002MiB |
| 1 N/A N/A 3475240 C ...c0-8de7-bc7e8d4ef790/files/vasp_std 8074MiB |
| 2 N/A N/A 3475241 C ...c0-8de7-bc7e8d4ef790/files/vasp_std 8074MiB |
| 3 N/A N/A 3475242 C ...c0-8de7-bc7e8d4ef790/files/vasp_std 8002MiB |
+---------------------------------------------------------------------------------------+
Accelerating the wavefunction initialisation process
Users running large systems may have noticed that the job gets stuck for a long time before entering the SCF iteration cycle. This can be attributed to the wavefunction initialisation process, where the default algorithm is a serial one.There is an undocumented INCAR
tag RANDOM_GENERATOR
which allows for switching to a parallel version of the wavefunction initialisation. To use this version, set in your INCAR
:
RANDOM_GENERATOR = pcg_32
The time taken for wavefunction initialisation generally does not occupy too much time in the entire calculation, but every second counts, which ultimately translates to cost savings to enable you to do more calculations!
Do check your total energies as a sanity check after convergence to make sure that results don't change. For those interested in the details, you may check out the following:
wave.F -- WFINIT
random.F -- random_reader
Feature available from vasp 6.4.0 onwards.
References to best practices
https://www.nsc.liu.se/support/Events/VASP_workshop_2024/
Building VASP
Users need to upload their own copy of the source code to NSCC’s systems for compilation. Kindly remember that access to the source code is only available to users who are tagged to a valid VASP license!
ASPIRE 2A CPU
Build instructions (VASP 6.4.3)
# setup environment
module swap PrgEnv-cray PrgEnv-intel
module swap craype-x86-rome craype-x86-milan
module load mkl/2024.0
module load cray-hdf5-parallel
module rm cray-libsci # cray-libsci may intefere with math libs
makefile.include
:
# Adopt from makefile.include.intel_ompi_mkl_omp
# Replace mpif90 with ftn, icc with cc, icpc with CC
# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dfock_dblbuf \
-D_OPENMP
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = ftn -qopenmp
FCL = ftn
FREE = -free -names lowercase
FFLAGS = -assume byterecl -w
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB = linpack_double.o
# For the parser library
CXX_PARS = CC
LLIBS = -lstdc++
##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
# added by ftn
#VASP_TARGET_CPU ?= -xHOST
#FFLAGS += $(VASP_TARGET_CPU)
# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL += -qmkl
LLIBS += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS = -I$(MKLROOT)/include -I$(MKLROOT)/include/fftw
# HDF5-support (optional but strongly recommended)
CPP_OPTIONS+= -DVASP_HDF5
LLIBS += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS += -I$(HDF5_ROOT)/include
make
:
make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl
ASPIRE 2A GPU
Build instructions (VASP 6.4.3) For multi-node GPU jobs on VASP, using both Open MPI and Cray MPICH results in a build that can work on ASPIRE 2A. However, hybrid functionals with KPAR > 1 is broken for Cray MPICH. Moreover, the performance of Open MPI is better than Cray MPICH, as will be seen in the benchmark times below.
I will use Open MPI within HPC-X. To use HPC-X see release notes for more details on how to load the environment to prime for HPC-X use.
# setup environment
module swap PrgEnv-cray PrgEnv-nvhpc
module swap craype-x86-rome craype-x86-milan
module load craype-accel-nvidia80
module swap nvhpc nvhpc/22.11
module swap cuda cuda/11.8.0
module rm cray-libsci # cray-libsci may intefere with math libs
module load mkl/2024.0
module rm cray-mpich
source /app/apps/nvhpc/22.11/Linux_x86_64/22.11/comm_libs/hpcx/latest/hpcx-init.sh
hpcx_load
makefile.include
:
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
-DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dqd_emulate \
-Dfock_dblbuf \
-D_OPENMP \
-D_OPENACC \
-DUSENCCL -DUSENCCLP2P
CPP = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)
# N.B.: you might need to change the cuda-version here
# to one that comes with your NVIDIA-HPC SDK
CC = mpicc -acc -gpu=cc80,cuda11.8 -mp
FC = mpif90 -acc -gpu=cc80,cuda11.8 -mp
FCL = mpif90 -acc -gpu=cc80,cuda11.8 -mp -c++libs
FREE = -Mfree
FFLAGS = -Mbackslash -Mlarge_arrays
OFLAG = -fast
DEBUG = -Mfree -O0 -traceback
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
LLIBS = -cudalib=cublas,cusolver,cufft,nccl -cuda
# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o minimax_dependence.o
SOURCE_O2 := pead.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = $(CC)
CFLAGS_LIB = -O -w
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)
OBJECTS_LIB = linpack_double.o
# For the parser library
CXX_PARS = mpiCC --no_warnings
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp zen3
FFLAGS += $(VASP_TARGET_CPU)
# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')
## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
OFLAG_IN = -fast -Mwarperf
SOURCE_IN := nonlr.o
# Software emulation of quadruple precsion (mandatory)
QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L$(QD)/lib -lqdmod -lqd
INCS += -I$(QD)/include/qd
# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
LLIBS_MKL = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
INCS += -I$(MKLROOT)/include/fftw
LLIBS += $(LLIBS_MKL)
make
:
make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl
ASPIRE 2A+
Build instructions (VASP 6.4.3) Environment:
module load nvhpc/24.9-nompi
module load cuda/12.6.2
source /app/apps/nvhpc/24.9/Linux_x86_64/24.9/comm_libs/12.6/hpcx/latest/hpcx-init.sh
hpcx_load
echo "Checking mpif90:"
mpif90 --version
module load fftw/3.3.10
# Needed for cuSolverMP, see <https://docs.nvidia.com/hpc-sdk/archive/24.9/hpc-sdk-release-notes/index.html#known-limitations>
export LD_LIBRARY_PATH=$NVHPC_ROOT/comm_libs/12.6/hpcx/latest/ucc/lib:$NVHPC_ROOT/comm_libs/12.6/hpcx/latest/ucx/lib:$LD_LIBRARY_PATH
makefile.include
:
# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
-DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dqd_emulate \
-Dfock_dblbuf \
-D_OPENMP \
-D_OPENACC \
-DUSENCCL
CPP = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)
# N.B.: you might need to change the cuda-version here
# to one that comes with your NVIDIA-HPC SDK
CC = mpicc -acc -gpu=cc90,cuda12.6 -mp
FC = mpif90 -acc -gpu=cc90,cuda12.6 -mp
FCL = mpif90 -acc -gpu=cc90,cuda12.6 -mp -c++libs
FREE = -Mfree
FFLAGS = -Mbackslash -Mlarge_arrays
OFLAG = -fast
DEBUG = -Mfree -O0 -traceback
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
LLIBS = -cudalib=cublas,cusolver,cufft,nccl -cuda
# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o minimax_dependence.o
SOURCE_O2 := pead.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = $(CC)
CFLAGS_LIB = -O -w
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)
OBJECTS_LIB = linpack_double.o
# For the parser library
CXX_PARS = nvc++ --no_warnings
##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp sapphirerapids
FFLAGS += $(VASP_TARGET_CPU)
# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')
## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
OFLAG_IN = -fast -Mwarperf
SOURCE_IN := nonlr.o
# Software emulation of quadruple precsion (mandatory)
QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L$(QD)/lib -lqdmod -lqd
INCS += -I$(QD)/include/qd
# BLAS (mandatory)
BLAS = -L/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/lib -lblas
INCS += -I/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/include
# LAPACK (mandatory)
LAPACK = -L/app/apps/nvhpc/24.9/Linux_x86_64/24.9/compilers/lib -llapack
# scaLAPACK (mandatory)
SCALAPACK = -L/home/users/ORG/ORG2/USERID/software/scalapack-2.2.0-gpu/ -lscalapack
LLIBS += $(SCALAPACK) $(LAPACK) $(BLAS)
# FFTW (mandatory)
FFTW_ROOT ?= /home/users/ORG/ORG2/USERID/local/fftw/fftw-3.3.10
LLIBS += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS += -I$(FFTW_ROOT)/include
# Use cusolvermp (optional)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
CPP_OPTIONS+= -DCUSOLVERMP -DCUBLASMP
LLIBS += -cudalib=cusolvermp,cublasmp -lnvhpcwrapcal
make
:
make DEPS=1 -j8 std
make DEPS=1 -j8 gam
make DEPS=1 -j8 ncl
Performance data
GPU (ASPIRE 2A/2A+)
Benchmark system: HfO2 used by NVIDIA blog post (hybrid functional calculation)
Specs of reference system: 8x A100-SXM4-80GB, 8x NVIDIA ConnectX-6 HDR InfiniBand network interface cards (NICs), 2x AMD EPYC 7742 CPUs
In NVIDIA’s blog post, the timing was projected from one SCF iteration to the full calculation which takes about 40 SCF iterations. The same methodology was used to calculate the estimated walltime needed.
ASPIRE 2A/2A+ timings
GPU cards | Projected walltime (min) on A100 (ASPIRE 2A) | Projected walltime (min) on H100 (ASPIRE 2A+) |
---|---|---|
1 | 323 | 166 |
2 | - | 85 |
4 | 85 | 44 |
8 | 45 | 24 |
16 | 23 | 13 |
32 | 13 | 8 |
Roughly 2x speedup for H100 compared with A100.
GPU effective utilisation
Info
GPU effective utilisation is defined as the percentage of time at which GPUs are active.
Based on our benchmarks, a baseline GPU effective utilisation is established for reference to assess if a VASP job is running effectively:
-
ASPIRE 2A GPU (A100): 70%
-
ASPIRE 2A+ (H100): 50%