dwarfs

dwarfs

高压缩比且快速读取的只读文件系统

DwarFS是一款专注于实现高压缩比的只读文件系统,尤其适合处理冗余数据。该系统在保持高速读取的同时,提供了优于SquashFS等压缩文件系统的压缩效果。DwarFS的特色功能包括文件相似度聚类、跨块分段分析和文件分类框架,可充分利用多核系统资源。支持Linux和Windows平台,适用于需要高压缩率和快速访问的应用场景。

DwarFS文件系统压缩读取LinuxGithub开源项目

Latest Release Total Downloads DwarFS CI Build Build Status Codacy Badge codecov OpenSSF Best Practices

DwarFS

The Deduplicating Warp-speed Advanced Read-only File System.

A fast high compression read-only file system for Linux and Windows.

Table of contents

Overview

Windows Screen Capture

Linux Screen Capture

DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data.

This probably doesn't sound very exciting, because if it's redundant, it should compress well. However, I found that other read-only, compressed file systems don't do a very good job at making use of this redundancy. See here for a comparison with other compressed file systems.

DwarFS also doesn't compromise on speed and for my use cases I've found it to be on par with or perform better than SquashFS. For my primary use case, DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources.

To give you an idea of what DwarFS is capable of, here's a quick comparison of DwarFS and SquashFS on a set of video files with a total size of 39 GiB. The twist is that each unique video file has two sibling files with a different set of audio streams (this is an actual use case). So there's redundancy in both the video and audio data, but as the streams are interleaved and identical blocks are typically very far apart, it's challenging to make use of that redundancy for compression. SquashFS essentially fails to compress the source data at all, whereas DwarFS is able to reduce the size by almost a factor of 3, which is close to the theoretical maximum:

$ du -hs dwarfs-video-test
39G     dwarfs-video-test
$ ls -lh dwarfs-video-test.*fs
-rw-r--r-- 1 mhx users 14G Jul  2 13:01 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 39G Jul 12 09:41 dwarfs-video-test.squashfs

Furthermore, when mounting the SquashFS image and performing a random-read throughput test using fio-3.34, both squashfuse and squashfuse_ll top out at around 230 MiB/s:

$ fio --readonly --rw=randread --name=randread --bs=64k --direct=1 \
      --opendir=mnt --numjobs=4 --ioengine=libaio --iodepth=32 \
      --group_reporting --runtime=60 --time_based
[...]
   READ: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=13.5GiB (14.5GB), run=60004-60004msec

In comparison, DwarFS manages to sustain random read rates of 20 GiB/s:

  READ: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1212GiB (1301GB), run=60001-60001msec

Distinct features of DwarFS are:

  • Clustering of files by similarity using a similarity hash function. This makes it easier to exploit the redundancy across file boundaries.

  • Segmentation analysis across file system blocks in order to reduce the size of the uncompressed file system. This saves memory when using the compressed file system and thus potentially allows for higher cache hit rates as more data can be kept in the cache.

  • Categorization framework to categorize files or even fragments of files and then process individual categories differently. For example, this allows you to not waste time trying to compress incompressible files or to compress PCM audio data using FLAC compression.

  • Highly multi-threaded implementation. Both the file system creation tool as well as the FUSE driver are able to make good use of the many cores of your system.

History

I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

Up until then, I had been using Cromfs for squeezing them into a manageable size. However, I was getting more and more annoyed by the time it took to build the filesystem image and, to make things worse, more often than not it was crashing after about an hour or so.

I had obviously also looked into SquashFS, but never got anywhere close to the compression rates of Cromfs.

This alone wouldn't have been enough to get me into writing DwarFS, but at around the same time, I was pretty obsessed with the recent developments and features of newer C++ standards and really wanted a C++ hobby project to work on. Also, I've wanted to do something with FUSE for quite some time. Last but not least, I had been thinking about the problem of compressed file systems for a bit and had some ideas that I definitely wanted to try.

The majority of the code was written in 2013, then I did a couple of cleanups, bugfixes and refactors every once in a while, but I never really got it to a state where I would feel happy releasing it. It was too awkward to build with its dependency on Facebook's (quite awesome) folly library and it didn't have any documentation.

Digging out the project again this year, things didn't look as grim as they used to. Folly now builds with CMake and so I just pulled it in as a submodule. Most other dependencies can be satisfied from packages that should be widely available. And I've written some rudimentary docs as well.

Building and Installing

Note to Package Maintainers

DwarFS should usually build fine with minimal changes out of the box. If it doesn't, please file a issue. I've set up CI jobs using Docker images for Ubuntu (22.04 and 24.04), Fedora Rawhide and Arch that can help with determining an up-to-date set of dependencies. Note that building from the release tarball requires less dependencies than building from the git repository, notably the ronn tool as well as Python and the mistletoe Python module are not required when building from the release tarball.

There are some things to be aware of:

  • There's a tendency to try and unbundle the folly and fbthrift libraries that are included as submodules and are built along with DwarFS. While I agree with the sentiment, it's unfortunately a bad idea. Besides the fact that folly does not make any claims about ABI stability (i.e. you can't just dynamically link a binary built against one version of folly against another version), it's not even possible to safely link against a folly library built with different compile options. Even subtle differences, such as the C++ standard version, can cause run-time errors. See this issue for details. Currently, it is not even possible to use external versions of folly/fbthrift as DwarFS is building minimal subsets of both libraries; these are bundled in the dwarfs_common library and they are strictly used internally, i.e. none of the folly or fbthrift headers are required to build against DwarFS' libraries.

  • Similar issues can arise when using a system-installed version of GoogleTest. GoogleTest itself recommends that it is being downloaded as part of the build. However, you can use the system installed version by passing -DPREFER_SYSTEM_GTEST=ON to the cmake call. Use at your own risk.

  • For other bundled libraries (namely fmt, parallel-hashmap, range-v3), the system installed version is used as long as it meets the minimum required version. Otherwise, the preferred version is fetched during the build.

Prebuilt Binaries

Each release has pre-built, statically linked binaries for Linux-x86_64, Linux-aarch64 and Windows-AMD64 available for download. These should run without any dependencies and can be useful especially on older distributions where you can't easily build the tools from source.

Universal Binaries

In addition to the binary tarballs, there's a universal binary available for each architecture. These universal binaries contain all tools (mkdwarfs, dwarfsck, dwarfsextract and the dwarfs FUSE driver) in a single executable. These executables are compressed using upx, so they are much smaller than the individual tools combined. However, it also means the binaries need to be decompressed each time they are run, which can have a signficant overhead. If that is an issue, you can either stick to the "classic" individual binaries or you can decompress the universal binary, e.g.:

upx -d dwarfs-universal-0.7.0-Linux-aarch64

The universal binaries can be run through symbolic links named after the proper tool. e.g.:

$ ln -s dwarfs-universal-0.7.0-Linux-aarch64 mkdwarfs
$ ./mkdwarfs --help

This also works on Windows if the file system supports symbolic links:

> mklink mkdwarfs.exe dwarfs-universal-0.7.0-Windows-AMD64.exe
> .\mkdwarfs.exe --help

Alternatively, you can select the tool by passing --tool=<name> as the first argument on the command line:

> .\dwarfs-universal-0.7.0-Windows-AMD64.exe --tool=mkdwarfs --help

Note that just like the dwarfs.exe Windows binary, the universal Windows binary depends on the winfsp-x64.dll from the WinFsp project. However, for the universal binary, the DLL is loaded lazily, so you can still use all other tools without the DLL. See the Windows Support section for more details.

Dependencies

DwarFS uses CMake as a build tool.

It uses both Boost and Folly, though the latter is included as a submodule since very few distributions actually offer packages for it. Folly itself has a number of dependencies, so please check here for an up-to-date list.

It also uses Facebook Thrift, in particular the frozen library, for storing metadata in a highly space-efficient, memory-mappable and well defined format. It's also included as a submodule, and we only build the compiler and a very reduced library that contains just enough for DwarFS to work.

Other than that, DwarFS really only depends on FUSE3 and on a set of compression libraries that Folly already depends on (namely lz4, zstd and liblzma).

The dependency on googletest will be automatically resolved if you build with tests.

A good starting point for apt-based systems is probably:

$ apt install \
    gcc \
    g++ \
    clang \
    git \
    ccache \
    ninja-build \
    cmake \
    make \
    bison \
    flex \
    fuse3 \
    pkg-config \
    binutils-dev \
    libacl1-dev \
    libarchive-dev \
    libbenchmark-dev \
    libboost-chrono-dev \
    libboost-context-dev \
    libboost-filesystem-dev \
    libboost-iostreams-dev \
    libboost-program-options-dev \
    libboost-regex-dev \
    libboost-system-dev \
    libboost-thread-dev \
    libbrotli-dev \
    libevent-dev \
    libhowardhinnant-date-dev \
    libjemalloc-dev \
    libdouble-conversion-dev \
    libiberty-dev \
    liblz4-dev \
    liblzma-dev \
    libzstd-dev \
    libxxhash-dev \
    libmagic-dev \
    libparallel-hashmap-dev \
    librange-v3-dev \
    libssl-dev \
    libunwind-dev \
    libdwarf-dev \
    libelf-dev \
    libfmt-dev \
    libfuse3-dev \
    libgoogle-glog-dev \
    libutfcpp-dev \
    libflac++-dev \
    nlohmann-json3-dev

Note that when building with gcc, the optimization level will be set to -O2 instead of the CMake default of -O3 for release builds. At least with versions up to gcc-10, the -O3 build is up to 70% slower than a build with

编辑推荐精选

TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

热门AI工具生产力协作转型TraeAI IDE
蛙蛙写作

蛙蛙写作

AI小说写作助手,一站式润色、改写、扩写

蛙蛙写作—国内先进的AI写作平台,涵盖小说、学术、社交媒体等多场景。提供续写、改写、润色等功能,助力创作者高效优化写作流程。界面简洁,功能全面,适合各类写作者提升内容品质和工作效率。

AI助手AI工具AI写作工具AI辅助写作蛙蛙写作学术助手办公助手营销助手
问小白

问小白

全能AI智能助手,随时解答生活与工作的多样问题

问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。

聊天机器人AI助手热门AI工具AI对话
Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

热门AI工具AI办公办公工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图
讯飞星火

讯飞星火

深度推理能力全新升级,全面对标OpenAI o1

科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。

模型训练热门AI工具内容创作智能问答AI开发讯飞星火大模型多语种支持智慧生活
Spark-TTS

Spark-TTS

一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型

Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。

咔片PPT

咔片PPT

AI助力,做PPT更简单!

咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

AI助手热门AI工具AI创作AI辅助写作讯飞绘文内容运营个性化文章多平台分发
材料星

材料星

专业的AI公文写作平台,公文写作神器

AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。

下拉加载更多