What Happens When You Run a CUDA Kernel: From Source to SASS

What Happens When You Run a CUDA Kernel: From Source to SASS

Launching a CUDA kernel involves a complex orchestration between the host CPU, the NVIDIA user-mode driver, the kernel-mode driver, and the GPU hardware. The process transforms high-level C++ code into a stream of hardware-specific commands and SASS instructions, utilizing a specialized memory-mapped doorbell system to trigger execution across thousands of parallel threads.

The Compilation Pipeline: PTX and SASS

Executing a CUDA program requires multiple compilation stages to bridge the gap between device-agnostic code and hardware-specific machine instructions.

nvcc acts as a driver program that coordinates several compilers. The host code is sent to a standard host compiler, while the device code follows a specific path: cicc (an LLVM-based compiler) generates PTX (Parallel Thread Execution), which is then converted by ptxas into SASS (Streaming Assembler).

PTX vs. SASS

  • PTX is a virtual Instruction Set Architecture (ISA). It is device-agnostic and uses an infinite number of typed registers. It serves as a forward-compatibility fallback; if a binary is run on a GPU architecture not covered by the pre-compiled SASS, the driver can JIT-compile the PTX into fresh SASS at load time.
  • SASS is the actual machine code that runs on the GPU. It maps virtual registers to a finite set of physical registers and fuses complex PTX operations into single hardware instructions (e.g., IMAD.WIDE).

The final binary is a "fatbin," which bundles both the SASS (in an ELF container) and the compressed PTX. This fatbin is embedded within the host executable's .nv_fatbin section.

Triggering the GPU from the Host

Because the GPU resides across the PCIe bus, the CPU cannot simply "call" a GPU function. Instead, it must communicate via a driver-managed work queue.

The Host Launch Stub

When nvcc compiles a kernel launch (e.g., vadd<<<4096, 256>>>), it replaces the expression with a generated host launch stub. This stub packs kernel arguments into a buffer in host memory at specific byte offsets. It then calls __cudaLaunch, which uses the host-side function pointer as a lookup key to find the corresponding device-side symbol in the fatbin.

The Driver Bridge

The CUDA runtime interacts with the closed-source user-mode driver (libcuda.so), which in turn communicates with the kernel-mode driver (nvidia.ko) via ioctl calls on device files like /dev/nvidiactl.

Since CUDA 12.2, module loading is lazy by default. The driver defers uploading the SASS cubin to the GPU's memory until the first time that specific kernel is launched.

The Hardware Launch Mechanism

GPU execution is driven by a stream of commands read from host memory, coordinated through a "channel" consisting of three primary structures:

  1. Pushbuffer: A region of host memory where the driver writes "methods" (register addresses and values) that define GPU actions.
  2. GPFIFO: A ring buffer of pointers that tells the GPU which spans of the pushbuffer to read.
  3. USERD: A small structure in device memory containing cursors (GP_GET and GP_PUT) to track work consumption.

The Doorbell

Modern GPUs (Turing and later) do not snoop the GP_PUT cursor. Instead, the driver rings a doorbell—a memory-mapped register—by writing a work-submit token. This signals the GPU's host engine to fetch the updated GP_PUT and pull methods from the pushbuffer via DMA.

Queue Meta Data (QMD)

One of the most critical methods is the streaming of the Queue Meta Data (QMD). The QMD is the launch descriptor that tells the GPU:

  • The grid and block dimensions (e.g., 4096 blocks of 256 threads).
  • The required registers per thread and shared memory.
  • The memory address of the SASS code.
  • The address of the constant bank holding the kernel arguments.

Execution on the Streaming Multiprocessor (SM)

Once the QMD is handed to the compute work distributor (GigaThread Engine), the GPU maps the linear SASS instructions to the available Streaming Multiprocessors (SMs).

Resource Constraints and Occupancy

On an RTX 4090 (AD102), an SM is limited by thread capacity and register file size. For a kernel using 16 registers per thread and blocks of 256 threads:

  • Register Capacity: $65,536 / (256 \times 16) = 16$ blocks.
  • Thread Capacity: $1,536 / 256 = 6$ blocks.

Thread capacity is the tighter bottleneck, meaning each SM can hold at most 6 resident blocks (48 warps).

Warp Eligibility and Latency Hiding

Unlike CPUs, GPUs do not use complex out-of-order execution logic. Instead, they hide latency by switching between many resident warps. A warp is eligible to run based on control-code payloads packed into the SASS instructions by ptxas:

  • Static Stall Count: For fixed-latency operations, the compiler encodes exactly how many cycles the warp must park.
  • Yield Hint: A bit suggesting the scheduler prioritize other warps.
  • Dependency-Barrier Indices: For variable-latency operations (like global memory loads LDG), the hardware uses six physical scoreboard barriers. A warp is ineligible until the specific barrier it is waiting on is cleared.

Memory Hierarchy and Data Transfer

When a warp issues a load (LDG.E), the SM's load/store unit performs request coalescing. If 32 threads in a warp access consecutive 4-byte floats, the hardware merges these into four 32-byte sector requests to minimize bus traffic.

Data flows through the L1 Data Cache $\rightarrow$ L2 Cache $\rightarrow$ Memory Controllers $\rightarrow$ GDDR6X VRAM. In kernels with low arithmetic intensity (like vector addition), the performance is typically bound by the DRAM bus bandwidth rather than compute power.

Returning Results to the CPU

When the final block retires, the GPU posts a completion semaphore (defined in the QMD). The GPU's copy engine then performs a DMA transfer of the results from the L2 cache (or VRAM) back to host memory. Once the copy finishes, it posts its own semaphore, allowing the host's cudaMemcpy call to return and the CPU to resume execution (e.g., calling printf).

Sources