Expand description
This module implements the canonical FastCDC algorithm as described in the paper by Wen Xia, et al., in 2020.
The algorithm incorporates a simplified hash judgement using the fast Gear hash, sub-minimum chunk cut-point skipping, normalized chunking to produce chunks of a more consistent length, and “rolling two bytes each time”. According to the authors, this should be 30-40% faster than the 2016 version while producing the same cut points. Benchmarks on several large files on an Apple M1 show about a 20% improvement, but results may vary depending on CPU architecture, file size, chunk size, etc.
There are two ways in which to use the FastCDC
struct defined in this
module. One is to simply invoke cut()
while managing your own start
and
remaining
values. The other is to use the struct as an Iterator
that
yields Chunk
structs which represent the offset and size of the chunks.
Note that attempting to use both cut()
and Iterator
on the same
FastCDC
instance will yield incorrect results.
Note that the cut()
function returns the 64-bit hash of the chunk, which
may be useful in scenarios involving chunk size prediction using historical
data, such as in RapidCDC or SuperCDC. This hash value is also given in the
hash
field of the Chunk
struct. While this value has rather low entropy,
it is computationally cost-free and can be put to some use with additional
record keeping.
The StreamCDC
implementation is similar to FastCDC
except that it will
read data from a Read
into an internal buffer of max_size
and produce
ChunkData
values from the Iterator
.
Structs§
- An async-streamable version of the FastCDC chunker implementation from 2020 with streaming support.
- Represents a chunk returned from the FastCDC iterator.
- Represents a chunk returned from the StreamCDC iterator.
- The FastCDC chunker implementation from 2020.
- The FastCDC chunker implementation from 2020 with streaming support.
Enums§
- The error type returned from the
StreamCDC
iterator. - The level for the normalized chunking used by FastCDC.
Constants§
- Largest acceptable value for the average chunk size.
- Smallest acceptable value for the average chunk size.
- Largest acceptable value for the maximum chunk size.
- Smallest acceptable value for the maximum chunk size.
- Largest acceptable value for the minimum chunk size.
- Smallest acceptable value for the minimum chunk size.