Scaling Workloads Across Multiple GPUs with CUDA C++ (NSWAMGC-OD)

Writing CUDA C++ applications that efficiently and correctly utilize all available GPUs on a node drastically improves performance over single-GPU code, and makes the most cost-effective use out of compute nodes with multiple GPUs. In this workshop you will learn to utilize multiple GPUs on a single node by:

  • Learning how to launch kernels on multiple GPUs, each working on a subsection of the required work
  • Learning how to use concurrent CUDA Streams to overlap memory copy with computation on multiple GPUs

Upon completion, you will be able to build robust and efficient CUDA C++ applications that can leverage all available GPUs on a single node."


  • Professional experience programming CUDA C/C++ applications, including the use of the nvcc compiler, kernel launches, grid-stride loops, host-to-device and device-to-host memory transfers, CUDA Streams, copy/compute overlap, and CUDA error handling.
  • Familiarity with the Linux command line.
  • Experience using Makefiles to compile C/C++ code

Suggested Resources to Satisfy Prerequisites

Tools, Libraries, and Frameworks Used

  • CUDA C++
  • nvcc
  • Nsight Systems