ZeroFS: A Log-Structured Filesystem for S3-Compatible Storage

ZeroFS: A Log-Structured Filesystem for S3-Compatible Storage

ZeroFS transforms S3-compatible object storage into a usable POSIX filesystem or raw block device. By utilizing a log-structured engine, ZeroFS allows users to treat cloud buckets as primary storage, providing a bridge between the scalability of object storage and the interface requirements of traditional filesystems.

Architecture and Storage Engine

ZeroFS uses a log-structured approach to manage data in S3-compatible buckets. Instead of rewriting data in place, writes are handled as immutable objects, and a compaction process reclaims space from deleted data.

Key technical properties of the engine include:

  • Immutable Segments: File data is organized into 32 KiB extents within immutable segment objects. A separate metadata index tracks these extents, ensuring that checkpoints and read replicas maintain a consistent view of the bucket.
  • Encryption and Compression: All data is encrypted using XChaCha20-Poly1305 before upload. The data key is wrapped with a key derived from a password via Argon2id. Before encryption, data is compressed using either zstd or lz4; the codec is detected on read, allowing for changes in compression settings without requiring data migration.
  • Caching: To mitigate the high latency of S3 round-trips (typically 50–300 ms), ZeroFS implements configurable memory and disk caches. Warm reads from these caches can return in microseconds.
  • TRIM Support: Discard commands from a filesystem or ZFS pool free the corresponding extents. Compaction then repacks live data and deletes empty segments from S3 to reduce storage costs.

Access Protocols and Interfaces

ZeroFS provides three primary ways to interact with the storage backend, all running within a single userspace process:

POSIX Filesystems (NFS and 9P)

  • NFS: Enables mounting from major operating systems (macOS, Linux, Windows, BSD) using native NFS support without requiring client-side software.
  • 9P: Offers closer adherence to POSIX semantics than NFS. It includes a bundled FUSE client that allows mounting without root privileges and supports automatic reconnection.

Raw Block Devices (NBD)

  • NBD (Network Block Device): Serves buckets as raw block devices. These devices can host ext4 filesystems, ZFS pools, or VM boot disks. New devices can be added at runtime without restarting the server.

High Availability and Data Integrity

ZeroFS implements several features to ensure durability and consistency:

  • Honest fsync: A successful fsync operation confirms that every acknowledged write is durable in S3. If a failover occurs and unflushed writes are lost, the subsequent fsync returns an error rather than a false success.
  • High Availability (HA): An optional standby instance tracks the leader via the same bucket. The standby holds writes acknowledged by the leader but not yet flushed, allowing it to take over automatically while preserving those writes during a failover.
  • Checkpoints: Named checkpoints allow the filesystem to be captured at a specific point in time and opened as read-only.
  • Read Replicas: Multiple read-only instances can serve the same bucket, automatically picking up changes made by a single writer.

Verification and Testing

ZeroFS employs an extensive public CI pipeline to validate its stability and performance:

  • POSIX Compliance: The pjdfstest suite is run on every change to verify permissions, ownership, links, and rename behavior.
  • Kernel Validation: The xfstests suite (used to validate ext4 and XFS) is run across NFS, 9P, and FUSE.
  • End-to-End ZFS Testing: CI builds a ZFS pool on ZeroFS block devices, extracts the Linux kernel source tree, and performs a full scrub to ensure no checksum errors.
  • Stress Testing: The system is tested using stress-ng and parallel Linux kernel compilations (make -j$(nproc)).
  • Model-Based Checking: Jepsen's local-fs suite is used to verify the filesystem model against random operation histories, including crash-recovery testing.

Community Insights and Counterpoints

While the project demonstrates high technical ambition, community discussion on Hacker News highlights several points of caution:

  • Latency Concerns: Some users argue that sub-millisecond write claims are misleading because they likely measure network or kernel latency rather than the time required for data to be durable in S3.

"The sub-millisecond writes with data in S3 is false and impossible. If you look at the benchmark the fsync is not timed, so this is just the latency of either the network or in kernel file operations..."

  • Abstraction Overhead: Critics suggest that abstracting S3 behind a filesystem is inherently inefficient due to the difference in how object stores and filesystems operate. Some recommend making applications "object store-aware" rather than using a filesystem abstraction.

  • Performance vs. Ceph: Some users claim that in local S3 environments, ZeroFS performance is significantly lower than alternatives like Ceph, particularly for small-IO operations.

  • Metadata Management: Questions were raised regarding how metadata is handled during failover and whether it is now stored entirely within the bucket to simplify high availability.

Sources