Moebius: 0.2B Lightweight Image Inpainting Framework

Moebius: 0.2B Lightweight Image Inpainting Framework

Moebius is a lightweight image inpainting framework that achieves 10B-level performance using only 0.22 billion parameters. By combining a restructured diffusion backbone with an adaptive distillation strategy, Moebius delivers high-fidelity image completion and object removal with a >15× acceleration in total inference time compared to industrial generalist models.

Extreme Efficiency and Performance

Moebius reduces the computational overhead of high-quality inpainting, making the technology viable for consumer-grade and edge devices. Its primary performance metrics include:

  • Parametric Reduction: Moebius uses 0.22B (226M) parameters, which is less than 2% of the 11.9B parameters found in the FLUX.1-Fill-Dev model.
  • Inference Speed: The model achieves an inference latency of 26.01 ms per step on a single GPU, resulting in a total runtime acceleration of more than 15× compared to 10B-level models.
  • Quality Benchmarks: Across six benchmarks covering natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ), Moebius performs on par with or surpasses state-of-the-art generalist models such as FLUX.1-Fill-Dev and SD3.5 Large-Inpainting, particularly in complex textures and facial plausibility.

Core Technical Innovations

Moebius overcomes the representation bottleneck typically caused by extreme structural compression through two synergistic innovations: the LλMI block and adaptive multi-granularity distillation.

Local-λ Mix Interaction (LλMI) Block

To bypass the quadratic computational overhead of standard attention mechanisms, Moebius introduces the LλMI block. This architecture reformulates both self- and cross-attention by condensing spatial contexts and global semantic priors into fixed-size linear matrices. This allows the model to preserve complex latent interactions while drastically reducing the total number of parameters.

Adaptive Multi-Granularity Distillation

Moebius utilizes a distillation strategy to transfer representational capacity from a larger teacher model, PixelHacker, to the Moebius student model. Key features of this strategy include:

  • Latent Space Operation: Distillation occurs strictly within the latent space, avoiding the high computational cost of pixel-space decoding.
  • Multi-Granularity Supervision: The process aligns supervision ranging from microscopic intermediate features to macroscopic diffusion trajectories.
  • Gradient Norm Adaptive Weighting: A dynamic mechanism balances training losses to ensure the student model absorbs maximum semantic reasoning without reaching representation saturation.

Practical Applications and Community Feedback

Moebius is designed as a task-specific specialist, arguing that explicitly defined tasks (like inpainting) do not require the parameter bloat of generalist foundation models.

Community discussions and early tests have highlighted several practical considerations:

  • Deployment: The model's small size enables browser-based deployment. One developer successfully ported Moebius to ONNX for an interactive web demo with a ~1.3GB download.
  • Limitations: Some users noted that inpainted regions can appear visibly smoother than the surrounding areas and that the model is currently limited to 512x512 output resolution.
  • Visual Artifacts: Critical observers pointed out potential "structural confusion" in specific samples, such as the elongation of objects in natural scene showcases.

"While it is very impressive for 0.2B model it would be very hard to convince me that this matches with 10B models. It did work reasonably well with natural images but inpainted regions were visibly smoother than surroundings, and performed very badly on novel objects."

"I got this working with ONNX... and now I have an interactive demo of the model running entirely in the browser."

Sources