LLaDA: a large-scale masked diffusion language model that rivals autoregressive LLMs in performance

LLaDA: a large-scale masked diffusion language model that rivals autoregressive LLMs in performance

What it solves

LLaDA addresses the limitation that high-performance large language models (LLMs) are almost exclusively autoregressive (predicting the next token). It explores whether a diffusion-based approach to language modeling can achieve comparable performance to models like LLaMA3 while potentially overcoming issues like the "reversal curse" and providing a different theoretical foundation for generative text.

How it works

Unlike standard LLMs that generate text token-by-token from left to right, LLaDA is a masked diffusion model. It uses a Transformer architecture but employs a probabilistic modeling approach where it learns to recover original tokens from a masked sequence. During training, it uses a masking ratio that varies randomly between 0 and 1, making its objective an upper bound on the negative log-likelihood of the model distribution. This allows it to function as a generative model capable of in-context learning and instruction-following.

Who it’s for

This project is primarily for AI researchers and developers interested in non-autoregressive generation, diffusion models for text, and those exploring the theoretical upper limits of language modeling architectures.

Highlights

  • Scale: An 8B parameter model trained from scratch that rivals LLaMA3 8B in performance.
  • Architectural Flexibility: Uses a standard Transformer architecture without needing time $t$ as an input.
  • Variants: Includes an Instruct version for chat, a vision-language version (LLaDA-V), and a Mixture-of-Experts version (LLaDA-MoE) that uses only ~1B active parameters during inference.
  • Efficiency Improvements: The iLLaDA version improves both benchmark performance and generation efficiency.

Sources