
高压缩比且快速读取的只读文件系统
DwarFS是一款专注于实现高压缩比的只读文件系统,尤其适合处理冗余数据。该系统在保持高速读取的同时,提供了优于SquashFS等压缩文件系统的压缩效果。DwarFS的特色功能包括文件相似度聚类、跨块分段分析和文件分类框架,可充分利用多核系统资源。支持Linux和Windows平台,适用于需要高压缩率和快速访问的应用场景。
The Deduplicating Warp-speed Advanced Read-only File System.
A fast high compression read-only file system for Linux and Windows.


DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data.
This probably doesn't sound very exciting, because if it's redundant, it should compress well. However, I found that other read-only, compressed file systems don't do a very good job at making use of this redundancy. See here for a comparison with other compressed file systems.
DwarFS also doesn't compromise on speed and for my use cases I've found it to be on par with or perform better than SquashFS. For my primary use case, DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources.
To give you an idea of what DwarFS is capable of, here's a quick comparison of DwarFS and SquashFS on a set of video files with a total size of 39 GiB. The twist is that each unique video file has two sibling files with a different set of audio streams (this is an actual use case). So there's redundancy in both the video and audio data, but as the streams are interleaved and identical blocks are typically very far apart, it's challenging to make use of that redundancy for compression. SquashFS essentially fails to compress the source data at all, whereas DwarFS is able to reduce the size by almost a factor of 3, which is close to the theoretical maximum:
$ du -hs dwarfs-video-test
39G dwarfs-video-test
$ ls -lh dwarfs-video-test.*fs
-rw-r--r-- 1 mhx users 14G Jul 2 13:01 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 39G Jul 12 09:41 dwarfs-video-test.squashfs
Furthermore, when mounting the SquashFS image and performing a random-read
throughput test using fio-3.34, both
squashfuse and squashfuse_ll top out at around 230 MiB/s:
$ fio --readonly --rw=randread --name=randread --bs=64k --direct=1 \
--opendir=mnt --numjobs=4 --ioengine=libaio --iodepth=32 \
--group_reporting --runtime=60 --time_based
[...]
READ: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=13.5GiB (14.5GB), run=60004-60004msec
In comparison, DwarFS manages to sustain random read rates of 20 GiB/s:
READ: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1212GiB (1301GB), run=60001-60001msec
Distinct features of DwarFS are:
Clustering of files by similarity using a similarity hash function. This makes it easier to exploit the redundancy across file boundaries.
Segmentation analysis across file system blocks in order to reduce the size of the uncompressed file system. This saves memory when using the compressed file system and thus potentially allows for higher cache hit rates as more data can be kept in the cache.
Categorization framework to categorize files or even fragments of files and then process individual categories differently. For example, this allows you to not waste time trying to compress incompressible files or to compress PCM audio data using FLAC compression.
Highly multi-threaded implementation. Both the file system creation tool as well as the FUSE driver are able to make good use of the many cores of your system.
I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.
Up until then, I had been using Cromfs for squeezing them into a manageable size. However, I was getting more and more annoyed by the time it took to build the filesystem image and, to make things worse, more often than not it was crashing after about an hour or so.
I had obviously also looked into SquashFS, but never got anywhere close to the compression rates of Cromfs.
This alone wouldn't have been enough to get me into writing DwarFS, but at around the same time, I was pretty obsessed with the recent developments and features of newer C++ standards and really wanted a C++ hobby project to work on. Also, I've wanted to do something with FUSE for quite some time. Last but not least, I had been thinking about the problem of compressed file systems for a bit and had some ideas that I definitely wanted to try.
The majority of the code was written in 2013, then I did a couple of cleanups, bugfixes and refactors every once in a while, but I never really got it to a state where I would feel happy releasing it. It was too awkward to build with its dependency on Facebook's (quite awesome) folly library and it didn't have any documentation.
Digging out the project again this year, things didn't look as grim as they used to. Folly now builds with CMake and so I just pulled it in as a submodule. Most other dependencies can be satisfied from packages that should be widely available. And I've written some rudimentary docs as well.
DwarFS should usually build fine with minimal changes out of the box.
If it doesn't, please file a issue. I've set up
CI jobs
using Docker images for Ubuntu (22.04
and 24.04),
Fedora Rawhide
and Arch
that can help with determining an up-to-date set of dependencies.
Note that building from the release tarball requires less dependencies
than building from the git repository, notably the ronn tool as well
as Python and the mistletoe Python module are not required when
building from the release tarball.
There are some things to be aware of:
There's a tendency to try and unbundle the folly
and fbthrift libraries that
are included as submodules and are built along with DwarFS.
While I agree with the sentiment, it's unfortunately a bad idea.
Besides the fact that folly does not make any claims about ABI
stability (i.e. you can't just dynamically link a binary built
against one version of folly against another version), it's not
even possible to safely link against a folly library built with
different compile options. Even subtle differences, such as the
C++ standard version, can cause run-time errors.
See this issue
for details. Currently, it is not even possible to use external
versions of folly/fbthrift as DwarFS is building minimal subsets of
both libraries; these are bundled in the dwarfs_common library
and they are strictly used internally, i.e. none of the folly or
fbthrift headers are required to build against DwarFS' libraries.
Similar issues can arise when using a system-installed version
of GoogleTest. GoogleTest itself recommends that it is being
downloaded as part of the build. However, you can use the system
installed version by passing -DPREFER_SYSTEM_GTEST=ON to the
cmake call. Use at your own risk.
For other bundled libraries (namely fmt, parallel-hashmap,
range-v3), the system installed version is used as long as it
meets the minimum required version. Otherwise, the preferred
version is fetched during the build.
Each release has pre-built,
statically linked binaries for Linux-x86_64, Linux-aarch64 and
Windows-AMD64 available for download. These should run without
any dependencies and can be useful especially on older distributions
where you can't easily build the tools from source.
In addition to the binary tarballs, there's a universal binary
available for each architecture. These universal binaries contain
all tools (mkdwarfs, dwarfsck, dwarfsextract and the dwarfs
FUSE driver) in a single executable. These executables are compressed
using upx, so they are much smaller than
the individual tools combined. However, it also means the binaries need
to be decompressed each time they are run, which can have a signficant
overhead. If that is an issue, you can either stick to the "classic"
individual binaries or you can decompress the universal binary, e.g.:
upx -d dwarfs-universal-0.7.0-Linux-aarch64
The universal binaries can be run through symbolic links named after the proper tool. e.g.:
$ ln -s dwarfs-universal-0.7.0-Linux-aarch64 mkdwarfs
$ ./mkdwarfs --help
This also works on Windows if the file system supports symbolic links:
> mklink mkdwarfs.exe dwarfs-universal-0.7.0-Windows-AMD64.exe
> .\mkdwarfs.exe --help
Alternatively, you can select the tool by passing --tool=<name> as
the first argument on the command line:
> .\dwarfs-universal-0.7.0-Windows-AMD64.exe --tool=mkdwarfs --help
Note that just like the dwarfs.exe Windows binary, the universal
Windows binary depends on the winfsp-x64.dll from the
WinFsp project. However, for the
universal binary, the DLL is loaded lazily, so you can still use all
other tools without the DLL.
See the Windows Support section for more details.
DwarFS uses CMake as a build tool.
It uses both Boost and Folly, though the latter is included as a submodule since very few distributions actually offer packages for it. Folly itself has a number of dependencies, so please check here for an up-to-date list.
It also uses Facebook Thrift,
in particular the frozen library, for storing metadata in a highly
space-efficient, memory-mappable and well defined format. It's also
included as a submodule, and we only build the compiler and a very
reduced library that contains just enough for DwarFS to work.
Other than that, DwarFS really only depends on FUSE3 and on a set of compression libraries that Folly already depends on (namely lz4, zstd and liblzma).
The dependency on googletest will be automatically resolved if you build with tests.
A good starting point for apt-based systems is probably:
$ apt install \
gcc \
g++ \
clang \
git \
ccache \
ninja-build \
cmake \
make \
bison \
flex \
fuse3 \
pkg-config \
binutils-dev \
libacl1-dev \
libarchive-dev \
libbenchmark-dev \
libboost-chrono-dev \
libboost-context-dev \
libboost-filesystem-dev \
libboost-iostreams-dev \
libboost-program-options-dev \
libboost-regex-dev \
libboost-system-dev \
libboost-thread-dev \
libbrotli-dev \
libevent-dev \
libhowardhinnant-date-dev \
libjemalloc-dev \
libdouble-conversion-dev \
libiberty-dev \
liblz4-dev \
liblzma-dev \
libzstd-dev \
libxxhash-dev \
libmagic-dev \
libparallel-hashmap-dev \
librange-v3-dev \
libssl-dev \
libunwind-dev \
libdwarf-dev \
libelf-dev \
libfmt-dev \
libfuse3-dev \
libgoogle-glog-dev \
libutfcpp-dev \
libflac++-dev \
nlohmann-json3-dev
Note that when building with gcc, the optimization level will be
set to -O2 instead of the CMake default of -O3 for release
builds. At least with versions up to gcc-10, the -O3 build is
up to 70% slower than a
build with


AI一键生成PPT,就用博思AIPPT!
博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。


AI赋能电商视觉革命,一站式智能商拍平台
潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。


企业专属的AI法律顾问
iTerms是法大大集团旗下法律子品牌,基 于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。


稳定高效的流量提升解决方案,助力品牌曝光
稳定高效的流量提升解决方案,助力品牌曝光


最新版Sora2模型免费使用,一键生成无水印视频
最新版Sora2模型免费使用,一键生成无水印视频


实时语音翻译/同声传译工具
Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。


选题、配图、成文,一站式创作,让内容运营更高效
讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。


AI辅助编程,代码自动修复
Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效 率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。


最强AI数据分析助手
小浣熊家族Raccoon,您的AI智能助手,致力于通过 先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。


像人一样思考的AI智能体
imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。
最新AI工具、AI资讯
独家AI资源、AI项目落地

微信扫一扫关注公众号