DwarFS, a file system designed to reduce redundant data

Marcus Holland Moritz (a Facebook software engineer) made it known through a publication the first versions of DwarFS, a read-only file system designed to maximize compression and reduce redundant data.

This file system uses the FUSE mechanism and runs in user space, the code is written in C ++ and is distributed under the GPLv3 license.

About DwarFS

dwarfs resembles file systems like SquashFS, cramfs, and CromFS in your tasks, and can be used to create live images and reduce the size of files with a large number of duplicates and duplicate data (for example, storage of images of virtual machines or collections of different versions of programs).

In terms of speed access to data, DwarFS is roughly at the same level as SquashFS, but several times ahead of this FS in terms of compression efficiency and imaging speed.

The project was developed to solve the problem of optimizing storage with different versions of Perl (the author of DwarFS participates in the maintenance of the CPAN file).

Initially, we tried to use Cromfs for compression, But it took too long to build the image and the stability left a lot to be desired. SquashFS worked stably and rendered images noticeably faster, but the level of compression was unacceptable.

Most of the DwarFS code was written in 2013. This year, the author found time to bring the code to the public and write documentation. DwarFS uses the Boost and Folly libraries.

The frozen Thrift Facebook branch library is used to store metadata. Other dependencies include FUSE3 and the lz4, zstd, and liblzma compression libraries.

DwarFS outperformed SquashFS in terms of compression rate by 8 times, and in terms of image creation speed by 4 times when creating an image that includes 1139 different Perl installations, of which there are 284 versions of Perl.

dwarfs was able to shrink the benchmark from 47GB to 582MB (1,1% of original size), while the resulting SquashFS image size was 4,7 GB. SquashFS took 69 minutes to create the image, while DwarFS completed the job in 15 minutes.

Both file systems used the ZSTD algorithm for compression. Using LZMA, the size of the DwarFS image was reduced by another 18% (approx. 479MB), but the access speed to that image was significantly reduced.

Tests with data with fewer duplicates showed an advantage not as significant, but still remarkable, from DwarFS. For example, the image size for the Paspberry Pi OS root FS was 298MB for DwarFS and 364MB for SquashFS, and the build time was 1 minute 36 seconds and 1 minute 54 seconds, respectively.

Of the key features of DwarFS the following stand out:

  • Power ability eliminate redundancy by grouping similar data (regardless of file boundaries) using LSH hash functions to identify similar objects.
  • File system block segmentation analysis to reduce the size of the uncompressed file system and increase the efficiency of the processor cache usage due to the fact that more required data enters.
  • Multi-threaded implementation of the imaging utility and the FUSE module, which can use all available CPU cores when running.
  • Experimental support for the ability to connect Lua controllers that can be used to filter and sort content.
  • Repackaging mode that allows you to change the compression algorithm of an already created image (for example, you can repackage using LZMA or LZ4 instead of ZSTD).
  • Images are created using the mkdwarfs utility and mounted using the dwarfs utility.

Finally, if you want to know more about this file system or are interested in being able to compile its source code, you can consult the information or obtain the source code In the following link.


The content of the article adheres to our principles of editorial ethics. To report an error click here!.

Be the first to comment

Leave a Comment

Your email address will not be published.

*

*

  1. Responsible for the data: AB Internet Networks 2008 SL
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.