NVIDIA HPC Benchmarks

5 minute read

Published:

Benchmarking NVIDIA HPC Workloads on the Kempner AI Cluster

Overview

This benchmarking study aims to evaluate the floating-point compute performance, memory bandwidth, and system scalability of the Kempner AI Cluster using standard NVIDIA HPC Benchmark containers. The benchmarks cover four key workloads: HPL (High-Performance Linpack), HPL-MxP (mixed-precision Linpack), HPCG (High-Performance Conjugate Gradient), and STREAM (memory bandwidth test). Tests are conducted on NVIDIA Hopper GPUs (H100) in configurations ranging from single-GPU up to multi-GPU (1–8 GPUs) and multi-node scenarios. Results will provide a robust baseline for performance tuning, user expectations, and future hardware procurement decisions. Additional tests using Grace Hopper (H200) nodes may be added to compare next-generation CPU-GPU memory architectures.

Prerequisites

The following prerequisites must be met before preparing for benchmarking:

  • Access to the Kempner AI Cluster with valid user credentials and appropriate SLURM job submission permissions.
  • At least one allocated NVIDIA Hopper (H100) GPU per test; multi-GPU and multi-node runs require correct GPU partition and Infiniband network access.
  • Podman installed locally (or on a build node) to pull official NVIDIA HPC benchmark container images.
  • Singularity (or Apptainer) installed and configured for container execution in the cluster environment.
  • Basic familiarity with SLURM job submission (srun, sbatch), resource allocation flags (–gres, –ntasks), and container runtime options.
  • Sufficient scratch storage or a working directory to store benchmark input files, result logs, and Singularity image files (.sif).
  • Access to NVIDIA NGC account for pulling the most up-to-date benchmark containers. (A copy of the container images can also be provided by the cluster administrators.)
  • Recommended: Familiarity with HPL benchmarking concepts, including problem size (N), process grid (P x Q), and performance metrics (GFLOPS, efficiency).

NVIDIA HPC Container

For this benchmarking effort, we use the official NVIDIA HPC Benchmark container, which conveniently packages industry-standard workloads, including HPL, HPL-MxP, HPCG, and STREAM, with tuned libraries and drivers to deliver consistent, reproducible performance. This container ensures compatibility with the latest NVIDIA architectures and provides an easy way to run demanding HPC tests without building complex software stacks from scratch. See the link below for details on how we pull, convert, and deploy this container on the Kempner AI Cluster.

NVIDIA HPL Benchmark

HPL (High-Performance Linpack) is a widely used benchmark for measuring the floating-point compute performance of supercomputers. It solves a dense system of linear equations and is the basis for the TOP500 list of supercomputers. In this section, we will run the HPL benchmark using the NVIDIA HPC container on the Kempner AI Cluster.

Here is the summary of the HPL benchmark runs on the Kempner AI Cluster:

# Nodes# GPUsNNBPQTime (s)GFLOPS (Per. GPU GFLOPS)
119216010241112.404.208e+04 ( 4.208e+04)
1213619210242120.288.304e+04 ( 4.152e+04)
1419046410242227.351.684e+05 ( 4.210e+04)
2426419210244237.493.279e+05 ( 4.099e+04)

NVIDIA HPL-MxP Benchmark

HPL-MxP (High-Performance Linpack - Mixed Precision) is an enhanced version of the traditional HPL benchmark that leverages NVIDIA’s Tensor Core acceleration for mixed-precision matrix operations, delivering significantly higher performance on supported GPUs. HPL-MxP demonstrates how modern hardware can use mixed precision arithmetic combined with iterative refinement to solve dense linear systems faster while maintaining double-precision accuracy. In this section, we run the HPL-MxP benchmark using the NVIDIA HPC container on the Kempner AI Cluster.

Here is a summary of MxP benchmark (using FP16 for LU factorization) runs on the Kempner AI Cluster using NVIDIA Hopper GPUs (H100):

Metric1 N 1 GPU1 N 2 GPUs1 N 4 GPUs2 N 4 GPUs
Problem Size (N)92160136192190464264192
Block Size (NB)1024102410241024
Grid Size (P x Q)1 x 12 x 12 x 24 x 2
GFLOPS6.1567e+046.6512e+041.9768e+058.1169e+05
GFLOPS (Per GPU)61566.5133255.9749420.46101461.07
LU GFLOPS2.3786e+054.4344e+058.7595e+051.9651e+06
LU GFLOPS (Per GPU)237855.39221721.54218987.25245636.39

NVIDIA HPCG Benchmark

HPCG (High Performance Conjugate Gradients) is designed to complement HPL by providing a more realistic benchmark for modern HPC systems. Unlike HPL, which is compute-bound, HPCG stresses memory system performance, data movement, and interconnect efficiency, resembling the sparse linear algebra workloads found in many real-world scientific applications.

Here is a summary of the HPCG benchmark runs on the Kempner AI Cluster using NVIDIA Hopper GPUs (H100):

Compute ConfigHPCG GFLOP/s
1 N 1 GPU515.269
1 N 2 GPUs991.491
1 N 4 GPUs1969.54
2 N 4 GPUs3762.01

NVIDIA STREAM Benchmark

The STREAM benchmark is a simple yet powerful memory bandwidth test originally designed to measure sustainable memory throughput for CPUs. NVIDIA’s GPU-accelerated version extends this to modern GPUs, measuring the achievable bandwidth of on-device memory (HBM) by performing simple vector operations like COPY, SCALE, ADD, and TRIAD. It does not include PCIe or interconnect transfers — it focuses purely on how fast the GPU can read and write data within its own high-bandwidth memory. For more details on the STREAM benchmark, see the following resources:

Here is the summary of the STREAM benchmark runs on the Kempner AI Cluster using NVIDIA Hopper GPUs (H100):

FunctionFP32 Bandwidth (MB/s)% PeakFP64 Bandwidth (MB/s)% PeakECC
COPY3065453.449591.443071120.970391.61Off
SCALE3065829.606291.453059058.096891.25Off
ADD3119722.384293.063125260.473893.22Off
TRIAD3121150.500493.103127058.677993.28Off

Done!