fastcdc/
lib.rs

1//
2// Copyright (c) 2023 Nathan Fiedler
3//
4
5//! This crate implements multiple versions of the FastCDC content defined
6//! chunking algorithm in pure Rust. A critical aspect of the behavior of this
7//! algorithm is that it returns exactly the same results for the same input.
8//!
9//! To learn more about content defined chunking and its applications, see
10//! [FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data
11//! Deduplication](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)
12//! from 2016, as well as the subsequent improvements described in the
13//! [paper](https://ieeexplore.ieee.org/document/9055082) from 2020
14//!
15//! ## Migration from pre-3.0
16//!
17//! If you were using a release of this crate from before the 3.0 release, you
18//! will need to make a small adjustment to continue using the same
19//! implementation as before.
20//!
21//! Before the 3.0 release:
22//!
23//! ```no_run
24//! # use fastcdc::ronomon as fastcdc;
25//! # use std::fs;
26//! # let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
27//! let chunker = fastcdc::FastCDC::new(&contents, 8192, 16384, 32768);
28//! ```
29//!
30//! After the 3.0 release:
31//!
32//! ```no_run
33//! # use std::fs;
34//! # let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
35//! let chunker = fastcdc::ronomon::FastCDC::new(&contents, 8192, 16384, 32768);
36//! ```
37//!
38//! The cut points produced will be identical to previous releases as the
39//! `ronomon` implementation was never changed in that manner. Note, however,
40//! that the other implementations _will_ produce different results.
41//!
42//! ## Implementations
43//!
44//! This crate had started as a translation of a variation of FastCDC
45//! implemented in the
46//! [ronomon/deduplication](https://github.com/ronomon/deduplication)
47//! repository, written by Joran Dirk Greef. That variation makes several
48//! changes to the original algorithm, primarily to accomodate JavaScript. The
49//! Rust version of this variation is found in the `ronomon` module in this
50//! crate.
51//!
52//! For a canonical implementation of the algorithm as described in the 2016
53//! paper, see the `v2016` crate.
54//!
55//! For a canonical implementation of the algorithm as described in the 2020
56//! paper, see the `v2020` crate. This implementation produces identical cut
57//! points as the 2016 version, but does so a bit faster.
58//!
59//! If you are using this crate for the first time, the `v2020` implementation
60//! would be the most appropriate. It uses 64-bit hash values and tends to be
61//! faster than both the `ronomon` and `v2016` versions.
62//!
63//! ## Examples
64//!
65//! A short example of using the fast chunker is shown below:
66//!
67//! ```no_run
68//! use std::fs;
69//! use fastcdc::v2020;
70//! let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
71//! let chunker = v2020::FastCDC::new(&contents, 4096, 16384, 65535);
72//! for entry in chunker {
73//!     println!("offset={} size={}", entry.offset, entry.length);
74//! }
75//! ```
76//!
77//! The example above is using normalized chunking level 1 as described in
78//! section 3.5 of the 2020 paper. To use a different level of chunking
79//! normalization, replace `new` with `with_level` as shown below:
80//!
81//! ```no_run
82//! use std::fs;
83//! use fastcdc::v2020::{FastCDC, Normalization};
84//! let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
85//! let chunker = FastCDC::with_level(&contents, 8192, 16384, 32768, Normalization::Level3);
86//! for entry in chunker {
87//!     println!("offset={} size={}", entry.offset, entry.length);
88//! }
89//! ```
90//!
91//! Notice that the minimum and maximum chunk sizes were changed in the example
92//! using the maximum normalized chunking level. This is due to the behavior of
93//! normalized chunking in which the generated chunks tend to be closer to the
94//! expected chunk size. It is not necessary to change the min/max values, just
95//! something of which to be aware. With lower levels of normalized chunking,
96//! the size of the generated chunks will vary more. See the documentation of
97//! the `Normalization` enum for more detail as well as the FastCDC paper.
98//!
99//! ## Minimum and Maximum
100//!
101//! The values you choose for the minimum and maximum chunk sizes will depend on
102//! the input data to some extent, as well as the normalization level described
103//! above. Depending on your application, you may want to have a wide range of
104//! chunk sizes in order to improve the overall deduplication ratio.
105//!
106//! Note that changing the minimum chunk size will almost certainly result in
107//! different cut points. It is best to pick a minimum chunk size for your
108//! application that can remain relevant indefinitely, lest you produce
109//! different sets of chunks for the same data.
110//!
111//! Similarly, setting the maximum chunk size to be too small may result in cut
112//! points that were determined by the maximum size rather than the data itself.
113//! Ideally you want cut points that are determined by the input data. However,
114//! this is application dependent and your situation may be different.
115//!
116//! ## Large Data
117//!
118//! If processing very large files, the streaming version of the chunkers in the
119//! `v2016` and `v2020` modules may be a suitable approach. They both allocate a
120//! byte vector equal to the maximum chunk size, draining and resizing the
121//! vector as chunks are found. However, using a crate such as `memmap2` can be
122//! significantly faster than the streaming chunkers. See the examples in the
123//! `examples` directory for how to use the streaming versions as-is, versus the
124//! non-streaming chunkers which read from a memory-mapped file.
125
126pub mod ronomon;
127pub mod v2016;
128pub mod v2020;