Asuran is a new archive format, and an implementation (in rust) of both a read/write library as well as a command line tool.
Deduplication is achieved through a combination of content defined slicing and content addressable storage. The CAS backend will only store each chunk submitted to it once. The content defined slicing provides a reasonable assurance that objects will be broken up into chunks in shuch a way that duplicate chunks will occur, if possible, and not be stored twice.
The encryption backend is plugable and can be changed on a chunk-by-chunk basis. Changing the default encryption will only change the encryption method for new chunks.
Each chunk is encrypted and then put in an Object along side a tag indicating the encryption algorithm used, as well as the IV if the algorithm requires one.
TODO Supported Encryption types
The supported encryption types are as follows. All types are authenticated with an HMAC tag, whether or not the algorithm includes authentication.
- No Encryption (Passthrough cipher)
Key generation and storage
They keys used by the encryption and HMAC are generated randomly at repository creation, and are stored on disk encrypted with a key derived from a user supplied passphrase. The current KDF used is bcrypt, however this will be swapped out for a more modern one in the near future, most likely argon2id.
The compression backend is pluggable and supports a variety of compression algorithms. Compression takes place before encryption, and each chunk is tagged with the compression algorithm used to compress it1.
TODO Supported Compression types
Supported Compression types are as follows
- No Compression
TODO Intelligent Compression
Asuran will support (but currently does not) intelligent encrypt on, where it will do a trial compression of the first chunk or chunks of a file with lz4, and if reasonable compression is detected, use the selected compression algorithm too compress the chunks of the file, otherwise applying no compression.
All data stored in the repository is authenticated through an encrypt-then-mac construction, with HMAC-SHA256 and Blake2b currently supported. Asuran will sternly refuse to unpack a chunk if the HMAC does not pass verification.
Content keys used by the CAS are generated with an HMAC of the plain text, using a different key than the HMAC used for verification. These are currently not verified.
TODO Supported Hashes for HMAC
The supported HMAC algorithms are as follows
The repository is a low level key-value store that is commited to disk. All values (refered to as “chunks”) in the repository are encrypted, compressed, and HMACed with algorithms configurable on a per-value basis.
The repository only understands keys and values, and effectively operates as content addressable storage, all other data structures are implemented on top of the repository.
The repository structure itself is storage-independent. The repository object itself simply views the world as list of segments, which themselves are lists of sized cells containing values.
Repositories are not strictly required to have multiple segments, and segments are not strictly required to contain multiple chunks. This allows simple mapping of any random access storage as (possibly) a single segment, or an object type store (such as S3) as a number of segments each containing one or many chunks.
The repository has special methods for pulling the manifest and index out of itself, and it may or may not treat these pieces of data as special, depending on the backend implementation in use. Typically, the manifest will be stored as a normal Chunk with a special key that is all zero.
A chunk is the representation of a value in the repository.
It is a compressed and encrypted sequence of bytes, along with a set of tags describing the encryption, compression, and HMAC algorithms used, as well as any IVs those algorithms require.
Chunks contain two HMAC values, id and hmac.
Compression and encryption are swappable on a per chunk basis.
- TODO ID
The ID of the Chunk is the HMAC of its plain text content, ideally using a different key than hmac, but currently uses the same key. (Will be changed in a future version).
ID is used for deduplication, and is the key used to reference the chunk in the repository.
The HMAC of the chunk is, as the name implies, an HMAC of the chunk’s encrypted contents. This is used for authentication and data integrity verification.
The repository backend is responsible for translating the repository’s “list of lists” segment/chunk view onto whatever storage backend is desired. The backend is additionally responsible for providing a map from keys to (Segment, Offset) pairs.
Segments are stored as the concatenation of the bytes making up the MessagePack representation of their chunks.
As long as the methods return what they should, Asuran places no restrictions on how the underlying mapping occurs, or what side effects the methods should perform.
These methods are extremely likely to be side effect prone in any implementation, and, generally, should not be called directly by the consumer, and instead used indirectly through the repository API.
The manifest is the root of the repository’s object graph, and is the primary object through which repository access is managed.
The manifest contains a list of reference to Archive objects within the repository, as well as methods for managing them. The manifest also contains utility methods, that when paired with a Target driver, can be used to backup objects to and restore objects from a repository.
The manifest additionally contains a timestamp of its last modification, as well as the ability to load and commit itself from/to the repository.
An archive is conceptually a collection of objects stored in a repository. This is the most common entry and exit point for data.
An archive object contains a name4, a list of the objects in the archive (stored as a HashMap mapping the path of the object to a list of its chunks and the offsets of the chunk within the object), as well as the timestamp of the archive’s creation.
The timestamp is primarily intended to prevent replay attacks, but also serves to provide the user with additional information, as well as allowing the user to distinguish multiple archives with identical names.
Object paths are unix-path style “/” delimited lists of tokens, and while they usually will map directly to paths, they are not required to, thus the individual tokens are allowed to contain any unicode character except “/”. The interpretation of the paths is left up to the target driver.
Archives are commited to a manifest by MessagePacking them and storing the result as a Chunk in the repository. The resulting ID is then wrapped in a Stored Archive object alongside the metadata (name, creation date, etc…), and the StoredArchive is then added to the manifest list.
Archives are namespaced, allowing multiple objects with the same path to be contained in the same archive, so long as they are in different namespaces.
Namespaces are described as colon delimited lists of tokens with a trailing colon, in order of increasing specificity (e.g. 1:2: would describe a namspace named “2” inside of a namespace named “1”).
The complete path of a specific object in a repository is described by appending the path of the object to its namespace string. For example, a file “/usr/share/example” stored in the root namespace of an archive would be referenced by the string “:/usr/share/example”, where as the file’s metadata might be referenced by “metadata:/usr/share/example”, and auditing information might be referenced by “metadata:audit:/user/share/example”.
Targets abstract the operation of creating and restoring archive to/from various types of storage. The API is written primarily to cater to the typical “files stored on a filesystem” use case, but is by no means limited to it.
As long as the target storage has objects, that can be serialized into a byte stream, and the “location” of those objects can be mapped to unix path style strings, then a valid target implementation can be written for the storage.
- BackupTarget and RestoreTarget
BackupTarget and RestoreTarget are the traits that targets must be able to implement in order to backup data to and restore data from an archive, respectively. Most, if not all, targets will implement both traits.
BackupTarget contains the following methods:
- Paths Returns a list of objects to be stored, as well as their paths
- Object Returns a reader for the object given its path (from the Paths method)
- Listing Returns a serialized listing of all the objects stored. Typically stored in the archive at “archive:listing”
RestoreTarget contains the following methods:
- Load listing Parses the listing produced by BackupTarget::Paths
- Object Returns a writer to the object’s real location on the storage
- Listing Provides a list of all paths to be restore.
- TODO Sparse Data
The target API is written to support the concept of sparse data, but currently no targets actually have support for sparse data.
Once complete, dense data will just be handled as the degenerate case of sparse data that has only one contiguous chunk. This will be implemented through describing BackupObjects and RestoreObjects as lists of pre-seeked readers and writers, and dense data will simply be the case where those lists only have one element.
TODO Target Drivers
The target driver trait specifies a collection of methods for writing objects to and reading objects from storage. The driver should handle the process of reading and writing the objects in their entirety, with the consumer only having to supply the repository, the archive, the root path to restore relative too, and the target object path.
The chunker is responsible for dividing objects into chunks of bytes, using some well-defined method.
The chunker framework is pluggable, and while support is planned for several chunkers, both special and general purpose, is planned, currently Asuran only implements one, a content defined chunker based on the BuzHash algorithm.
The buzhash chunker used a modification of the buzhash rolling hash algorithm to perform content defined slicing.
It runs a rolling hash down the data, and slices it when the last n bits of the hash are 0, as long as other requirements are met.
This chunker has three settings:
- Window Size Adjusts the sizof the data window considered by the rolling hash
Mask Bits How many bits of the hash have to be 0 to determine a slice.
With a Mask Bits value of n, the chunker will not split the data if it would result in a chunk less than 2n - 2 bytes in size, and will always split the data if the chunk is about to exceed 2n + 2 in size
- Nonce This implementation randomizes the buzhash table to help prevent chunk size based fingerprinting attacks. The Nonce is the seed used for the random number generator that fills the table.
The FastCDC chunker has the potential to be up to 100x faster than our buzhash implementation
TODO Static Size
The static size chunker will always may the chunks the same, configurable, size
TODO Disk Image Chunker
The disk image chunker will understand disk image formats, and chunk them in an intelligent way.
- TODO Raw Image Chunker
The raw image chunker will attempt to detect raw disk images (e.g. iso, img, etc..) and put any metadata in its own chunks, and then attempt to make the chunk size match up with the block size of the image.
- TODO VMA Chunker
This chunker should understand the Proxmox VMA format and be able to chunk it intelligently to maximize dedeuplication.
Development Process / Contributing
As it is only me developing at the moment, the current development model isn’t very structured. In the future it will consist of a branch-per-featured model with branches being required to past a minimum set of tests before being merged into master.
Pull requests and issues are welcome, and by contributing to this project you agree to license your work under the MIT license.
Release 0.1.0 should be a somewhat usable product. It will still only operate in append only mode, but will have support for an array of encryption, compression, and HMAC algorithm types. The repository should be able to verify itself as a dedicated operation. The filesystem target should handle sparse data correctly.
By this point, the on disk format should be more or less nailed down. Breaking changes should not occur without a compelling technical reason, a version bump (minor if before 1.0.0, major if afterwards), and a two way conversion utility being provided.
- TODO libasuran
libasuran 0.1.0 should have the following features:
- Somewhat stable on disk format
- Support for zlib, lzma, and lz4 compression
- Support for chacha20-poly1305 encryption
- Should have cargo benchmarks
- Should have a working sparse data API
- Should have a method for verifiying the integreity of the repo
- Repository should be transactionalized
- TODO asuran
asuran 0.1.0 should have the following features:
- Support for setting compression type/level
- Support for setting encryption type
- Support for setting HMAC algorithm
- Runtime tests/benchmarks
- Repository verification command
- Support for setting the chunking algorithm
In many ways, this project is intended to be a mashup of what I consider to be the best features of the two applications, while attempting to make a modifiable and extendable framework that can be embedded in other applications easily.
Features Borg has that Restic is missing
- Performance Borg generally has way better performance than Restic, in my work load I have personally found this to be to a disturbing extent.
- Optional/Switchable Encryption Don’t get me wrong, being able to safely store sensitive data on untrusted storage is really nice, but sometimes you really are backing up to trusted storage (e.g. an external hard drive that is already encrypted at the file system or drive level), and double encryption is just extra overhead.
- Optional/Switchable Compression Restic doesn’t support compression at all, which, in my opinion, makes it a no-go for many workloads
Features Restic has that Borg is missing
- Switchable Storage Backends This one is a big deal for me. As a home gamer, being able to directly backup my datahoarder levels of files to an unlimited GDrive or the like is a huge deal. This is also the only reason I use Restic for some of my backups
- Multiple computers writing to the same repository Borg’s repository locking and chunk cache mechanisms make writing to the same repository with multiple computers a huge pain in the ass. Not having all your computers backing up to the same repository decreases deduplication by an extremely large factor and is just generally not good.
Features I want that neither has
Tar import and export
This isn’t entirely true, borg has tar export and is working on tar import, but it lacks one feature that is critical to my workflow, reproducing the same tar file. My work flow involves a program that produces backups as tar files, and when restoring them looks for a special file that must be the first in the tar. I would like the ability to import and export tars and keep the metadata of the tar the same, while still being able to take the tar apart and deduplciate the individual files within it, and use the compression defined by the repository.
While borg is python based and doesn’t really used threads, restic has multithreading, but in my opinion, doesn’t use it well
Comparison with rdedup
rdedup is a very good tool, but falls sort in several areas for me.
No built in directory traversal
rdedup depends on external tools like tar to make backups. In my experience this makes for a poor deduplication rate compared to borg in my workflow.
No current support for cloud backends
This one is almost cheating because asuran does not currently have support for cloud backends, but asuran was designed from the ground up to be storage-agnostic.
No intelligent chunking
rdedup has good support for choosing from a few good content defined slicers, but lacks the framework for intelligent slicing of known data types, such as disk images that can be sliced blockwise, or intelligent picking apart of backups emitted by other applications in a way to maximize deduplication.
Little/no integration support
This complaint also somewhat applies to borg and restic, but to a lesser extent. libasuran is designed to be called into from other applications, such as a carbonite style automatic backup utility, allowing the easy creation of end user friendly applications that support the full suite of asuran features.