NanoEuler: A GPT-2 Scale LLM Built from Scratch in C and CUDA

NanoEuler: A GPT-2 Scale LLM Built from Scratch in C and CUDA

NanoEuler is a research and educational project that implements a GPT-2-class language model entirely from scratch using C and CUDA. By eschewing high-level ML libraries like PyTorch or autograd, the project provides a transparent, end-to-end pipeline covering everything from a hand-written byte-level BPE tokenizer to supervised fine-tuning (SFT), all validated by rigorous gradient checks.

Core Architecture and Design

NanoEuler implements a decoder-only transformer architecture utilizing several modern LLM optimizations to maximize efficiency on consumer hardware. The model is designed without biases to simplify the architecture and improve training stability.

Technical Specifications

  • RMSNorm: Used as pre-normalization without bias.
  • Rotary Position Embeddings (RoPE): Applied to queries and keys to handle positional information.
  • SwiGLU Feed-Forward: Implements the down(silu(gate(x)) * up(x)) activation pattern.
  • Grouped-Query Attention (GQA): Optimizes memory and compute by sharing a smaller set of key/value heads across multiple query heads.
  • Multi-Token Prediction (MTP): Employs K output heads to predict the next K tokens, which enhances learned representations and supports speculative decoding (though generation primarily uses head 0).
  • Byte-level BPE Tokenizer: A custom implementation using GPT-2-style pretokenization, ensuring spaces are not wasted as standalone tokens.

Model Configurations

Version Dimension Q/KV Heads Layers Context Vocab Parameters
Small (CPU) 128 4 / 2 4 128 512 ~1.05M
GPU Pipeline 768 12 / 4 16 512 4096 ~116M

The CUDA Engine and Performance

To enable training on a single consumer GPU (such as an RTX 4070), NanoEuler includes a full CUDA port that handles forward and backward passes, training, and inference.

Kernel Implementations

The GPU engine utilizes cuBLAS for matrix multiplications via TF32 tensor cores. A critical performance optimization is the hand-written FlashAttention kernel, which uses tiling and online softmax to avoid storing the $T \times T$ matrix in memory. This implementation resulted in a training speed increase of approximately 3$ imes$.

Verification and Stability

Because hand-written backpropagation is prone to subtle errors, NanoEuler employs a strict verification process. Every analytic gradient is compared against a central finite difference in double precision to prevent floating-point cancellation from masking errors. The GPU kernels are further validated against a CPU reference, with gradients matching to within $1e-6$.

Training Pipeline: Pretraining to SFT

NanoEuler demonstrates the complete lifecycle of a language model, moving from raw data to a functional (though small) chat interface.

Data Acquisition

The project avoids Python dependencies for data collection, using shell scripts and the DuckDB CLI to gather:

  • Books: Public-domain classics from Project Gutenberg.
  • Web: A high-quality educational slice of the FineWeb-Edu dataset.
  • Instructions: The Alpaca dataset for supervised fine-tuning.

Two-Stage Training Process

  1. Pretraining: The ~116M parameter base model is trained on the combined books and web corpus. This stage teaches the model grammar, register, and basic language patterns.
  2. Supervised Fine-Tuning (SFT): The pretrained base is fine-tuned using the Alpaca dataset. The loss is masked to response tokens only, meaning the model is trained to predict the answer while ignoring the prompt and padding. This teaches the model the form of an assistant (instruction $\rightarrow$ response) without requiring the massive scale needed for deep world knowledge.

Theoretical Foundation: The "Euler" Connection

The project is named after Leonhard Euler because of the mathematical relationship between residual networks and ordinary differential equations (ODEs).

In a residual block, the operation $x = x + f(x)$ mirrors the forward-Euler method for numerical integration, where $x(t+\Delta t) = x(t) + \Delta t \cdot f(x(t))$. With a step size of $\Delta t = 1$, a deep residual network can be viewed as a discretized ODE where depth represents integration time, and each layer integrates the hidden state forward by one Euler step.

Sources