Asuran

Overview

Asuran is a new archive format, and an implementation (in rust) of both a read/write library as well as a command line tool.

Features Overview

Deduplication

Deduplication is achieved through a combination of content defined slicing and content addressable storage. The CAS backend will only store each chunk submitted to it once. The content defined slicing provides a reasonable assurance that objects will be broken up into chunks in shuch a way that duplicate chunks will occur, if possible, and not be stored twice.

Encryption

The encryption backend is plugable and can be changed on a chunk-by-chunk basis. Changing the default encryption will only change the encryption method for new chunks.

Each chunk is encrypted and then put in an Object along side a tag indicating the encryption algorithm used, as well as the IV if the algorithm requires one.

TODO Supported Encryption types

The supported encryption types are as follows. All types are authenticated with an HMAC tag, whether or not the algorithm includes authentication.

Key generation and storage

They keys used by the encryption and HMAC are generated randomly at repository creation, and are stored on disk encrypted with a key derived from a user supplied passphrase. The current KDF used is bcrypt, however this will be swapped out for a more modern one in the near future, most likely argon2id.

Compression

The compression backend is pluggable and supports a variety of compression algorithms. Compression takes place before encryption, and each chunk is tagged with the compression algorithm used to compress it¹.

TODO Intelligent Compression

Asuran will support (but currently does not) intelligent encrypt on, where it will do a trial compression of the first chunk or chunks of a file with lz4, and if reasonable compression is detected, use the selected compression algorithm too compress the chunks of the file, otherwise applying no compression.

Authentication

All data stored in the repository is authenticated through an encrypt-then-mac construction, with HMAC-SHA256 and Blake2b currently supported. Asuran will sternly refuse to unpack a chunk if the HMAC does not pass verification.

Content keys used by the CAS are generated with an HMAC of the plain text, using a different key than the HMAC used for verification. These are currently not verified.

Internals Overview

Repository

The repository is a low level key-value store that is commited to disk. All values (refered to as “chunks”) in the repository are encrypted, compressed, and HMACed with algorithms configurable on a per-value basis.

The repository only understands keys and values, and effectively operates as content addressable storage, all other data structures are implemented on top of the repository.

The repository structure itself is storage-independent. The repository object itself simply views the world as list of segments, which themselves are lists of sized cells containing values.

Repositories are not strictly required to have multiple segments, and segments are not strictly required to contain multiple chunks. This allows simple mapping of any random access storage as (possibly) a single segment, or an object type store (such as S3) as a number of segments each containing one or many chunks.

The repository has special methods for pulling the manifest and index out of itself, and it may or may not treat these pieces of data as special, depending on the backend implementation in use. Typically, the manifest will be stored as a normal Chunk with a special key that is all zero.

Chunks

A chunk is the representation of a value in the repository.

It is a compressed and encrypted sequence of bytes, along with a set of tags describing the encryption, compression, and HMAC algorithms used, as well as any IVs those algorithms require.

Chunks contain two HMAC values, id and hmac.

Compression and encryption are swappable on a per chunk basis.

TODO ID

The ID of the Chunk is the HMAC of its plain text content, ideally using a different key than hmac, but currently uses the same key. (Will be changed in a future version).

ID is used for deduplication, and is the key used to reference the chunk in the repository.
HMAC

The HMAC of the chunk is, as the name implies, an HMAC of the chunk’s encrypted contents. This is used for authentication and data integrity verification.

Repository Backend

The repository backend is responsible for translating the repository’s “list of lists” segment/chunk view onto whatever storage backend is desired. The backend is additionally responsible for providing a map from keys to (Segment, Offset) pairs.

Segments are stored as the concatenation of the bytes making up the MessagePack representation of their chunks.

As long as the methods return what they should, Asuran places no restrictions on how the underlying mapping occurs, or what side effects the methods should perform.

These methods are extremely likely to be side effect prone in any implementation, and, generally, should not be called directly by the consumer, and instead used indirectly through the repository API.

Filesystem Backend

The filesystem back end uses a configurable segment size², storing segments in folders with a configurable limit on the number of segments in a folder³ (to avoid filesystem operations bogging down).

Manifest

The manifest is the root of the repository’s object graph, and is the primary object through which repository access is managed.

The manifest contains a list of reference to Archive objects within the repository, as well as methods for managing them. The manifest also contains utility methods, that when paired with a Target driver, can be used to backup objects to and restore objects from a repository.

The manifest additionally contains a timestamp of its last modification, as well as the ability to load and commit itself from/to the repository.

Targets

Targets abstract the operation of creating and restoring archive to/from various types of storage. The API is written primarily to cater to the typical “files stored on a filesystem” use case, but is by no means limited to it.

As long as the target storage has objects, that can be serialized into a byte stream, and the “location” of those objects can be mapped to unix path style strings, then a valid target implementation can be written for the storage.

BackupTarget and RestoreTarget

BackupTarget and RestoreTarget are the traits that targets must be able to implement in order to backup data to and restore data from an archive, respectively. Most, if not all, targets will implement both traits.
- BackupTarget
  BackupTarget contains the following methods:
  1. Paths Returns a list of objects to be stored, as well as their paths
  2. Object Returns a reader for the object given its path (from the Paths method)
  3. Listing Returns a serialized listing of all the objects stored. Typically stored in the archive at “archive:listing”
- RestoreTarget
  RestoreTarget contains the following methods:
  1. Load listing Parses the listing produced by BackupTarget::Paths
  2. Object Returns a writer to the object’s real location on the storage
  3. Listing Provides a list of all paths to be restore.
TODO Sparse Data

The target API is written to support the concept of sparse data, but currently no targets actually have support for sparse data.

Once complete, dense data will just be handled as the degenerate case of sparse data that has only one contiguous chunk. This will be implemented through describing BackupObjects and RestoreObjects as lists of pre-seeked readers and writers, and dense data will simply be the case where those lists only have one element.

TODO Target Drivers

The target driver trait specifies a collection of methods for writing objects to and reading objects from storage. The driver should handle the process of reading and writing the objects in their entirety, with the consumer only having to supply the repository, the archive, the root path to restore relative too, and the target object path.

Chunker

The chunker is responsible for dividing objects into chunks of bytes, using some well-defined method.

The chunker framework is pluggable, and while support is planned for several chunkers, both special and general purpose, is planned, currently Asuran only implements one, a content defined chunker based on the BuzHash algorithm.

BuzHash

The buzhash chunker used a modification of the buzhash rolling hash algorithm to perform content defined slicing.

It runs a rolling hash down the data, and slices it when the last n bits of the hash are 0, as long as other requirements are met.

This chunker has three settings:

Window Size Adjusts the sizof the data window considered by the rolling hash
Mask Bits How many bits of the hash have to be 0 to determine a slice.

With a Mask Bits value of n, the chunker will not split the data if it would result in a chunk less than 2^{n - 2} bytes in size, and will always split the data if the chunk is about to exceed 2^{n + 2} in size
Nonce This implementation randomizes the buzhash table to help prevent chunk size based fingerprinting attacks. The Nonce is the seed used for the random number generator that fills the table.

TODO FastCDC

The FastCDC chunker has the potential to be up to 100x faster than our buzhash implementation

TODO Static Size

The static size chunker will always may the chunks the same, configurable, size

TODO Disk Image Chunker

The disk image chunker will understand disk image formats, and chunk them in an intelligent way.

TODO Raw Image Chunker

The raw image chunker will attempt to detect raw disk images (e.g. iso, img, etc..) and put any metadata in its own chunks, and then attempt to make the chunk size match up with the block size of the image.
TODO VMA Chunker

This chunker should understand the Proxmox VMA format and be able to chunk it intelligently to maximize dedeuplication.

Development Process / Contributing

As it is only me developing at the moment, the current development model isn’t very structured. In the future it will consist of a branch-per-featured model with branches being required to past a minimum set of tests before being merged into master.

Pull requests and issues are welcome, and by contributing to this project you agree to license your work under the MIT license.

Inspiration/Motivation

This project is inspired by both Borg and Restic. Both are very good pieces of software, and perfectly suitable for many use cases, but my use case seems to lie in between the two.

In many ways, this project is intended to be a mashup of what I consider to be the best features of the two applications, while attempting to make a modifiable and extendable framework that can be embedded in other applications easily.

Features Borg has that Restic is missing

Performance Borg generally has way better performance than Restic, in my work load I have personally found this to be to a disturbing extent.
Optional/Switchable Encryption Don’t get me wrong, being able to safely store sensitive data on untrusted storage is really nice, but sometimes you really are backing up to trusted storage (e.g. an external hard drive that is already encrypted at the file system or drive level), and double encryption is just extra overhead.
Optional/Switchable Compression Restic doesn’t support compression at all, which, in my opinion, makes it a no-go for many workloads

Features Restic has that Borg is missing

Switchable Storage Backends This one is a big deal for me. As a home gamer, being able to directly backup my datahoarder levels of files to an unlimited GDrive or the like is a huge deal. This is also the only reason I use Restic for some of my backups
Multiple computers writing to the same repository Borg’s repository locking and chunk cache mechanisms make writing to the same repository with multiple computers a huge pain in the ass. Not having all your computers backing up to the same repository decreases deduplication by an extremely large factor and is just generally not good.

Features I want that neither has

Tar import and export

This isn’t entirely true, borg has tar export and is working on tar import, but it lacks one feature that is critical to my workflow, reproducing the same tar file. My work flow involves a program that produces backups as tar files, and when restoring them looks for a special file that must be the first in the tar. I would like the ability to import and export tars and keep the metadata of the tar the same, while still being able to take the tar apart and deduplciate the individual files within it, and use the compression defined by the repository.
Good multithreading

While borg is python based and doesn’t really used threads, restic has multithreading, but in my opinion, doesn’t use it well

Comparison with rdedup

rdedup is a very good tool, but falls sort in several areas for me.

No built in directory traversal

rdedup depends on external tools like tar to make backups. In my experience this makes for a poor deduplication rate compared to borg in my workflow.
No current support for cloud backends

This one is almost cheating because asuran does not currently have support for cloud backends, but asuran was designed from the ground up to be storage-agnostic.
No intelligent chunking

rdedup has good support for choosing from a few good content defined slicers, but lacks the framework for intelligent slicing of known data types, such as disk images that can be sliced blockwise, or intelligent picking apart of backups emitted by other applications in a way to maximize deduplication.
Little/no integration support

This complaint also somewhat applies to borg and restic, but to a lesser extent. libasuran is designed to be called into from other applications, such as a carbonite style automatic backup utility, allowing the easy creation of end user friendly applications that support the full suite of asuran features.