Leveraging Geometry in Robot Learning: Stanford Robotics Seminar

Leveraging Geometry in Robot Learning: Stanford Robotics Seminar

The Tension Between Model-Based and Generalist Robotics

Robot learning is currently split between two extremes: hand-coded geometric models and generalist Vision-Language-Action (VLA) models. While traditional model-based planning is highly data-efficient—sometimes requiring only a single demonstration (e.g., the YODO "You Only Demonstrate Once" approach)—it often fails when the model's assumptions do not reflect reality. Conversely, modern VLAs learn directly from data, overcoming the rigidity of hand-coded models, but they require massive amounts of training data to achieve proficiency.

The core thesis of this research is that a middle ground exists: machine learning models that incorporate geometric, mechanical, or physical priors. By structuring models to respect the laws of physics—specifically symmetry and equivariance—it is possible to retain the flexibility of learning from data while achieving the data efficiency of model-based systems.

Embedding Symmetry via Equivariance

To incorporate physical knowledge into neural networks, researchers can embed symmetries based on Noether's theorem, which establishes a correspondence between symmetries in the real world and conservation laws in physics (e.g., spatial translation symmetry corresponds to the conservation of momentum).

Equivariant Neural Network Layers

An equivariant function is one where transforming the input (e.g., rotating an image) results in a corresponding transformation of the output. In robotics, if the transition dynamics of a system are rotation invariant, the optimal policy should be rotation equivariant.

By constraining the weights of convolutional kernels to follow specific patterns, models can be forced to be equivariant. For example, a standard 3x3 convolution kernel has 18 free variables; an equivariant version constrained to the C4 group (90-degree increments) reduces this to five free parameters. This constraint ensures that if the input is rotated, the output is automatically rotated, preventing the model from having to "re-learn" the same task at different orientations.

Four Geometric Representation Strategies

Professor Platt presents four distinct methods for leveraging geometry to improve policy learning, primarily benchmarked on the MimicGen dataset.

1. Equivariant Diffusion Policy

This approach encodes the world as a point cloud and utilizes an equivariant point cloud transformer and a U-Net output. It is equivariant with respect to translation and a finite subgroup of SO(2).

  • Key Result: Achieved a 10x improvement in data efficiency. The model trained on 100 demonstrations outperformed a standard diffusion policy trained on 1,000 demonstrations.
  • Strength: Exceptional generalization over pose in high-variation tasks.
  • Weakness: Computationally expensive for large discrete groups and less precise than RGB-based methods due to point cloud sparsity.

2. Image-to-Sphere Embedding

To handle RGB images, this method projects image patches onto a two-sphere, allowing the application of SO(3) rotations.

  • Mechanism: It uses spherical harmonics (a Fourier basis for functions over a sphere) and Wigner D-matrices for convolutions in Fourier space before bringing the data back into a discrete subgroup of SO(3).
  • Key Result: Outperformed baselines by a factor of 2 in data efficiency.
  • Insight: By removing the need for the model to learn generalization over pose, the model can dedicate its capacity to learning the actual task logic (e.g., observing how many beans remain in a scoop).

3. Raven: 3D Ray Representations

Raven represents image patches as 3D rays—vectors pointing from the camera origin to the patch center—each associated with a coordinate frame.

  • Geometric Transform Attention (GTA): Instead of standard attention, GTA transforms queries, keys, and values into a common reference frame before performing the attention operation, then transforms them back.
  • Strength: Intellectually consistent for combining multiple views and modalities (e.g., pixels, points, and force data).
  • Weakness: Requires precise camera calibration.

4. Pix2Act: Planar Trajectories and Triangulation

This current work focuses on inferring keypoint trajectories directly in the image planes of multiple in-hand cameras and then triangulating them back into 3D space.

  • Data Augmentation: To force the model to ignore global structure and focus on local image features, the researchers use a unique augmentation where cameras are virtually rotated on their visual axes independently.
  • Key Result: Outperformed pre-trained LBM models (which use CLIP encoders) despite having no pre-training of its own.

Shifting the Scaling Law

Scaling laws in AI typically follow a power law, where performance increases as a function of data size. The goal of incorporating geometric priors is not to replace data, but to "shift the scaling curve to the left."

By biasing the model to fit the physical world (incorporating knowledge of translation and rotation invariance), the model becomes more "intelligent" in its baseline state. This means that for any given amount of data, a geometrically aware model should achieve higher performance than a generalist model. This approach effectively manages the bias-variance tradeoff by using physical constraints as a beneficial bias, reducing the amount of data required to reach a specific success rate.

Sources