DwarFS

The Deduplicating Warp-speed Advanced Read-only File System.

A fast high compression read-only file system for Linux and Windows.

Overview
History
Building and Installing
Usage
Using the Libraries
Windows Support
- Building on Windows
macOS Support
- Building on macOS
Use Cases
- Astrophotography
Dealing with Bit Rot
Extended Attributes
Comparison
Performance Monitoring
Other Obscure Features
Stargazers over Time

Overview

Windows Screen Capture

Linux Screen Capture

DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data.

This probably doesn't sound very exciting, because if it's redundant, it should compress well. However, I found that other read-only, compressed file systems don't do a very good job at making use of this redundancy. See here for a comparison with other compressed file systems.

DwarFS also doesn't compromise on speed and for my use cases I've found it to be on par with or perform better than SquashFS. For my primary use case, DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources.

To give you an idea of what DwarFS is capable of, here's a quick comparison of DwarFS and SquashFS on a set of video files with a total size of 39 GiB. The twist is that each unique video file has two sibling files with a different set of audio streams (this is an actual use case). So there's redundancy in both the video and audio data, but as the streams are interleaved and identical blocks are typically very far apart, it's challenging to make use of that redundancy for compression. SquashFS essentially fails to compress the source data at all, whereas DwarFS is able to reduce the size by almost a factor of 3, which is close to the theoretical maximum:

$ du -hs dwarfs-video-test
39G     dwarfs-video-test
$ ls -lh dwarfs-video-test.*fs
-rw-r--r-- 1 mhx users 14G Jul  2 13:01 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 39G Jul 12 09:41 dwarfs-video-test.squashfs

Furthermore, when mounting the SquashFS image and performing a random-read throughput test using fio-3.34, both squashfuse and squashfuse_ll top out at around 230 MiB/s:

$ fio --readonly --rw=randread --name=randread --bs=64k --direct=1 \
      --opendir=mnt --numjobs=4 --ioengine=libaio --iodepth=32 \
      --group_reporting --runtime=60 --time_based
[...]
   READ: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=13.5GiB (14.5GB), run=60004-60004msec

In comparison, DwarFS manages to sustain random read rates of 20 GiB/s:

  READ: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1212GiB (1301GB), run=60001-60001msec

Distinct features of DwarFS are:

Clustering of files by similarity using a similarity hash function. This makes it easier to exploit the redundancy across file boundaries.
Segmentation analysis across file system blocks in order to reduce the size of the uncompressed file system. This saves memory when using the compressed file system and thus potentially allows for higher cache hit rates as more data can be kept in the cache.
Categorization framework to categorize files or even fragments of files and then process individual categories differently. For example, this allows you to not waste time trying to compress incompressible files or to compress PCM audio data using FLAC compression.
Highly multi-threaded implementation. Both the file system creation tool as well as the FUSE driver are able to make good use of the many cores of your system.

History

I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

Up until then, I had been using Cromfs for squeezing them into a manageable size. However, I was getting more and more annoyed by the time it took to build the filesystem image and, to make things worse, more often than not it was crashing after about an hour or so.

I had obviously also looked into SquashFS, but never got anywhere close to the compression rates of Cromfs.

This alone wouldn't have been enough to get me into writing DwarFS, but at around the same time, I was pretty obsessed with the recent developments and features of newer C++ standards and really wanted a C++ hobby project to work on. Also, I've wanted to do something with FUSE for quite some time. Last but not least, I had been thinking about the problem of compressed file systems for a bit and had some ideas that I definitely wanted to try.

The majority of the code was written in 2013, then I did a couple of cleanups, bugfixes and refactors every once in a while, but I never really got it to a state where I would feel happy releasing it. It was too awkward to build with its dependency on Facebook's (quite awesome) folly library and it didn't have any documentation.

Digging out the project again this year, things didn't look as grim as they used to. Folly now builds with CMake and so I just pulled it in as a submodule. Most other dependencies can be satisfied from packages that should be widely available. And I've written some rudimentary docs as well.

Building and Installing

Note to Package Maintainers

DwarFS should usually build fine with minimal changes out of the box. If it doesn't, please file a issue. I've set up CI jobs using Docker images for Ubuntu (22.04 and 24.04), Fedora Rawhide and Arch that can help with determining an up-to-date set of dependencies. Note that building from the release tarball requires less dependencies than building from the git repository, notably the ronn tool as well as Python and the mistletoe Python module are not required when building from the release tarball.

There are some things to be aware of:

There's a tendency to try and unbundle the folly and fbthrift libraries that are included as submodules and are built along with DwarFS. While I agree with the sentiment, it's unfortunately a bad idea. Besides the fact that folly does not make any claims about ABI stability (i.e. you can't just dynamically link a binary built against one version of folly against another version), it's not even possible to safely link against a folly library built with different compile options. Even subtle differences, such as the C++ standard version, can cause run-time errors. See this issue for details. Currently, it is not even possible to use external versions of folly/fbthrift as DwarFS is building minimal subsets of both libraries; these are bundled in the dwarfs_common library and they are strictly used internally, i.e. none of the folly or fbthrift headers are required to build against DwarFS' libraries.
Similar issues can arise when using a system-installed version of GoogleTest. GoogleTest itself recommends that it is being downloaded as part of the build. However, you can use the system installed version by passing -DPREFER_SYSTEM_GTEST=ON to the cmake call. Use at your own risk.
For other bundled libraries (namely fmt, parallel-hashmap, range-v3), the system installed version is used as long as it meets the minimum required version. Otherwise, the preferred version is fetched during the build.

Prebuilt Binaries

Each release has pre-built, statically linked binaries for Linux-x86_64, Linux-aarch64 and Windows-AMD64 available for download. These should run without any dependencies and can be useful especially on older distributions where you can't easily build the tools from source.

Universal Binaries

In addition to the binary tarballs, there's a universal binary available for each architecture. These universal binaries contain all tools (mkdwarfs, dwarfsck, dwarfsextract and the dwarfs FUSE driver) in a single executable. These executables are compressed using upx, so they are much smaller than the individual tools combined. However, it also means the binaries need to be decompressed each time they are run, which can have a signficant overhead. If that is an issue, you can either stick to the "classic" individual binaries or you can decompress the universal binary, e.g.:

upx -d dwarfs-universal-0.7.0-Linux-aarch64

The universal binaries can be run through symbolic links named after the proper tool. e.g.:

$ ln -s dwarfs-universal-0.7.0-Linux-aarch64 mkdwarfs
$ ./mkdwarfs --help

This also works on Windows if the file system supports symbolic links:

> mklink mkdwarfs.exe dwarfs-universal-0.7.0-Windows-AMD64.exe
> .\mkdwarfs.exe --help

Alternatively, you can select the tool by passing --tool=<name> as the first argument on the command line:

> .\dwarfs-universal-0.7.0-Windows-AMD64.exe --tool=mkdwarfs --help

Note that just like the dwarfs.exe Windows binary, the universal Windows binary depends on the winfsp-x64.dll from the WinFsp project. However, for the universal binary, the DLL is loaded lazily, so you can still use all other tools without the DLL. See the Windows Support section for more details.

Dependencies

DwarFS uses CMake as a build tool.

It uses both Boost and Folly, though the latter is included as a submodule since very few distributions actually offer packages for it. Folly itself has a number of dependencies, so please check here for an up-to-date list.

It also uses Facebook Thrift, in particular the frozen library, for storing metadata in a highly space-efficient, memory-mappable and well defined format. It's also included as a submodule, and we only build the compiler and a very reduced library that contains just enough for DwarFS to work.

Other than that, DwarFS really only depends on FUSE3 and on a set of compression libraries that Folly already depends on (namely lz4, zstd and liblzma).

The dependency on googletest will be automatically resolved if you build with tests.

A good starting point for apt-based systems is probably:

$ apt install \
    gcc \
    g++ \
    clang \
    git \
    ccache \
    ninja-build \
    cmake \
    make \
    bison \
    flex \
    fuse3 \
    pkg-config \
    binutils-dev \
    libacl1-dev \
    libarchive-dev \
    libbenchmark-dev \
    libboost-chrono-dev \
    libboost-context-dev \
    libboost-filesystem-dev \
    libboost-iostreams-dev \
    libboost-program-options-dev \
    libboost-regex-dev \
    libboost-system-dev \
    libboost-thread-dev \
    libbrotli-dev \
    libevent-dev \
    libhowardhinnant-date-dev \
    libjemalloc-dev \
    libdouble-conversion-dev \
    libiberty-dev \
    liblz4-dev \
    liblzma-dev \
    libzstd-dev \
    libxxhash-dev \
    libmagic-dev \
    libparallel-hashmap-dev \
    librange-v3-dev \
    libssl-dev \
    libunwind-dev \
    libdwarf-dev \
    libelf-dev \
    libfmt-dev \
    libfuse3-dev \
    libgoogle-glog-dev \
    libutfcpp-dev \
    libflac++-dev \
    nlohmann-json3-dev

Note that when building with gcc, the optimization level will be set to -O2 instead of the CMake default of -O3 for release builds. At least with versions up to gcc-10, the -O3 build is up to 70% slower than a build with