This specification describes a mechanism for splitting a byte stream into blocks of varying size with split boundaries based solely on the content of the input. It also describes a mechanism for organizing those blocks into a (probabilistically) balanced tree whose shape is likewise determined solely by the content of the input.
The general technique has been used by various systems such as:
...and many others. The technique permits the efficient representation of slightly different versions of the same data (e.g. successive revisions of a file in a version control system), since changes in one part of the input generally do not affect the boundaries of any but the adjacent blocks.
However, the exact functions used by these systems differ in details, and thus do not produce identical splits, making interoperability for some use cases more difficult than it should be.
The role of this specification is therefore to fully and formally describe a concrete function on which future systems may standardize, improving interoperability.
This section discusses notation used in this specification.
We define the following sets:
All arithmetic operations in this document are implicitly performed modulo . We use standard mathematical notation for addition, subtraction, multiplication, and exponentiation. Division always denotes integer division, i.e. any remainder is dropped.
Numerals staring with the prefix 0x
are hexadecimal, e.g. 0xfe
for the (decimal) number 254
We use the notation to denote an ordered sequence of values.
denotes the length of the sequence , i.e. the number of elements it contains.
We also use the following operators and functions:
We use standard mathematical notation for summation. For example:
denotes the sum of integers in the range .
We define a similar notation for exclusive or:
denotes the bitwise exclusive or of the integers in , i.e.
Finally, we define the “prefix” of a non-empty sequence with respect to a given predicate to be the initial subsequence of up to and including the first member that makes true. And we define the “remainder” to be everything left after removing the prefix.
Formally, given a sequence and a predicate ,
for the smallest integer such that:
or if no such integer exists. (I.e., if nothing satisfies , the prefix is the whole sequence.) And:
where .
Note that when , .
The primary result of this specification is to define a family of functions:
...which is parameterized by a configuration , consisting of:
The configuration must satisfy .
We define the constant , which we call the "window size," to be 64.
We define the predicate on a non-empty byte sequence with respect to a configuration to be:
We define recursively, as follows:
If sequence and sequence are largely the same, will produce mostly the same chunks, choosing the same locations for chunk boundaries except in the vicinity of whatever differences there are between and .
This has obvious benefits for storage and bandwidth, as the same chunks can represent both and with few exceptions. But while only a small number of chunks may change, the sequence of chunks may get totally rewritten, as when a difference exists near the beginning of and and all subsequent chunks have to “shift position” to the left or right. Representing the two different sequences may therefore require space that is linear in the size of and .
We can do better, requiring space that is only logarithmic in the size of and , by organizing the chunks in a tree whose shape, like the chunk boundaries themselves, is determined by the content of the input. The trees representing two slightly different versions of the same input will differ only in the subtrees in the vicinity of the differences.
A “chunk” is a member of the sequence produced by .
The “hashval” of a byte sequence is:
(i.e., the hash of the last bytes of ).
A “node” in a hashsplit tree at non-negative “height” is a sequence of children. The children of a node at height 0 are chunks. The children of a node at height are nodes at height .
A “tier” of a hashsplit tree is a sequence of nodes at a given height .
The function on a node produces the “rightmost leaf chunk” defined recursively as follows:
The “level” of a given chunk is , where is the largest integer such that
(i.e., the level is the number of trailing zeroes in the hashval in excess of the threshold needed to produce the prefix chunk ).
The level of a given node is the level of its rightmost leaf chunk:
The predicate on a sequence of chunks or of nodes with respect to a height is defined as:
For conciseness, define
This section contains two descriptions of hashsplit trees: an algebraic description for formal reasoning, and a procedural description for practical construction.
The tier of hashsplit tree nodes for a given byte sequence is equal to
The tier of hashsplit tree nodes for a given byte sequence is equal to
(I.e., each node in the tree has as its children a sequence of chunks or lower-tier nodes, as appropriate, up to and including the first one whose “level” is greater than the node’s height.)
The root of the hashsplit tree is for the smallest value of such that
For this description we use to denote a single node at height . The algorithm must keep track of the “rightmost” such node for each tier in the tree.
To compute a hashsplit tree from a byte sequence , compute its “root node” as follows.
To “close” a node :
The cp32
hash function is based on cyclic polynomials. The family of related functions is sometimes also called "buzhash." cp32
is the recommended hash function for use with hashsplit; use it unless you have clear reasons for doing otherwise.
We define the function as:
Where and the sequence is defined in the appendix.
The sequence was chosen at random. Note that , so is always defined.
can be computed in a rolling fashion; for sequences
and
Given , and , we can compute as:
.
Note that the splitting algorithm only computes hashes on sequences of size , and since 64 is a multiple of 32 this means that for the purposes of splitting, the above can be simplified to:
.
The rrs
family of checksums is based on an algorithm first used in rsync, and later adapted for use in bup and perkeep. rrs
was originally inspired by the adler-32 checksum. The name rrs
was chosen for this specification, and stands for rsync rolling sum
.
A concrete rrs
checksum is defined by the parameters:
Given a sequence of bytes and a choice of and , the rrs
hash of the sub-sequence is , where:
The concrete hash called rrs1
uses the values:
rrs1
is used by both Bup and Perkeep, and implemented by the Go package go4.org/rollsum
.
rrs
is a family of rolling hashes. We can compute hashes in a rolling fashion by taking advantage of the fact that, for :
So, a typical implementation will work like this:
In all cases the ring buffer should initially contain all zero bytes, reflecting the use of for in "Splitting", above.
Choosing has the advantages of simplicity and efficiency, as it allows to be computed using only shifts and bitwise operators:
The definition of as used by is:
0x6b326ac4, 0x13f8e1bd, 0x1d61066f, 0x87733fc7, 0x37145391, 0x1c115e40,
0xd2ea17a3, 0x8650e4b1, 0xe892bb09, 0x408a0c3a, 0x3c40b72c, 0x2a988fb0,
0xf691d0f8, 0xb22072d9, 0x6fa8b705, 0x72bd6386, 0xdd905ac3, 0x7fcba0ba,
0x4f84a51c, 0x1dd8477e, 0x6f972f2c, 0xaccd018e, 0xe2964f13, 0x7a7d2388,
0xebf42ca7, 0xa8e2a0a2, 0x8eb726d3, 0xccd169b6, 0x5444f61e, 0xe178ad7a,
0xd556a18d, 0xbac80ef4, 0x34cb8a87, 0x7740a1a9, 0x62640fe1, 0xb1e64472,
0xdee2d6c8, 0x27849114, 0xb6333f4b, 0xbb0b5c1d, 0x57e53652, 0xfde51999,
0xef773313, 0x1bbaf941, 0x2e9aa084, 0x37587ab8, 0xa61e7c54, 0xb779be61,
0xd8795bfd, 0x1707c1f6, 0x50fe9c54, 0x32ff3685, 0x94f55c22, 0x2a32ce1a,
0x0b9076ab, 0x14363079, 0xae994b2c, 0x4a8da881, 0x4770b9c4, 0xf4d143dd,
0x70a90c0b, 0xa094582a, 0x4b254d10, 0x2454325e, 0x1725a589, 0x9a3380da,
0x948eeade, 0x79f88224, 0x7b8dc378, 0xc2090db6, 0x41f7a7ac, 0xd4d9528c,
0x7f0bace7, 0xd3157814, 0xd7757bc4, 0xb428db06, 0x2e2b1d02, 0x0499bcf5,
0x310f963e, 0xe5f31a83, 0xe0cd600f, 0x8b48af14, 0x568eb23a, 0x01d1150b,
0x33f54023, 0xa0e59fdf, 0x8d17c2dd, 0xfb7bd347, 0x4d8cd432, 0x664db8de,
0xd48f2a6c, 0x16c3412d, 0x873a32fc, 0x10796a21, 0xed40f0f8, 0x5ca8e9b2,
0x0f70d259, 0x0df532c2, 0x016d73aa, 0x45761aa5, 0x189b45a7, 0x4accd733,
0x641f90e3, 0x592ed9ee, 0x4b1d72ad, 0x42ff2cd4, 0x0654b609, 0x799012c0,
0x595f36a4, 0x082bdbd6, 0x0375ddd3, 0xc16c1fb5, 0x57492df8, 0xa2d56a98,
0xdfb2aa28, 0x3728f35f, 0xdc49ea71, 0x9aee8377, 0xd62de2ab, 0x2c3aa155,
0x407d9eed, 0xbc5b3832, 0x42961924, 0x1498172a, 0xc7126716, 0x95494b56,
0xd40442fb, 0xb22a3ed1, 0x0ad3e0ae, 0x77a6136a, 0xfb1bc3f0, 0x1a715c38,
0xccbbd21d, 0x061ff037, 0x85d700cb, 0x8a8fb396, 0x956bbe48, 0xf2556ed8,
0x3319c88b, 0xe0d6d3e9, 0x4783b316, 0x03a73543, 0x253be5ed, 0x41322aea,
0xdfc00c7a, 0x972b9413, 0xccca42f5, 0x0a1cdf35, 0xa2dc31b8, 0xf48397eb,
0xbe3f2b3e, 0xd2950b9f, 0xccd269cf, 0x51a64ca9, 0xea46d96e, 0xcaec892e,
0x3fae3a62, 0xf12e53db, 0x3753464c, 0x214fbd91, 0x609ce2f7, 0x6158b44c,
0xa74b8027, 0x79f36912, 0x16cac162, 0x5e76df4f, 0xbc4184fb, 0x912cac7d,
0xf97e5704, 0x664dd25f, 0x7d837805, 0x5386cfe0, 0x4e585d77, 0xa0fa527e,
0xeb5c8401, 0xa186cc51, 0x05ef3f1f, 0xc1efc774, 0x38730c2c, 0xad9c5539,
0x27cd4938, 0x7317b4f2, 0x852c186f, 0xa4c9b0f4, 0xf592f010, 0xf6fe86f3,
0xb14ba86c, 0x07109a27, 0x0d00568d, 0xd92ee49f, 0xdc643eb3, 0x8d81c333,
0xcd1d7bbd, 0x87ff9cda, 0x80fa4285, 0x25258d5b, 0xd9e4065a, 0x78955c18,
0x84874c2a, 0xfdae136b, 0x48eeb3d3, 0xc2623958, 0x5a74f96d, 0x0bcb49f5,
0x3041cefc, 0xa5b0a1a8, 0x2d29bae6, 0x916ace93, 0x0e70564d, 0xa24894ae,
0x9897044d, 0xcba97c2a, 0x52a313b1, 0x318ec481, 0xc4729ec1, 0xd90ad78a,
0x55eb9f90, 0x4f159fda, 0xa90fbd44, 0xd0ca6208, 0x5c597269, 0xe05a471e,
0x26a5e224, 0x97144944, 0xece2c486, 0xf65c9a9e, 0x82a3fbbb, 0x925d1a62,
0xd6c4c29b, 0x61b9292d, 0x161529c9, 0x37713240, 0x68ec933b, 0xed80a4e5,
0x02b2db41, 0x47cfd676, 0xbfe26b41, 0x5e8468bb, 0x6e0d15a4, 0x40383ef4,
0x81e622fb, 0x194b378c, 0x0c503af5, 0x8e0033a7, 0x003aaa5e, 0x9d7b6723,
0x0702e877, 0x34b75166, 0xd1ba98d8, 0x9b9f1794, 0xe8961c84, 0x9d773b17,
0xf9783ee9, 0xdff11758, 0x49bea2cf, 0xa0e0887f