Bitfield
What is a bitfield?
Space-efficient data structure used to figure out which data you have and what data you don't. Meant to always be kept in memory because it's small enough.
At its core, bitfields are a way of efficiently describing sets of numbers. You can think of them as a series of bits. When a number at a position is 1, it means that position is in the set. If a number is 0, that position isn't.
Example
bits: 00101
index: 01234
The set above contains 2
and 4
. It's stored left to right, because that's
the way it's enumerated.
In Dat we use bitfields to describe which pieces of data we have, and which pieces of data we don't. Also it's used internally to index data structures, such as the Merkle Tree.
Indexed Bitfields
The most common operations in Dat for bitfields is to either find a piece of data that's missing, or checking if we have a piece of data.
Checking if we have a piece of data is straightforward, as all we have to do is
look in the bitfield in the position of the data and see if it's a 1
.
Finding a piece of data we're missing is a bit more tricky. Basically it'll require a linear scan of the whole bitfield. In order to speed this up, we use an Indexed Bitfield.
Structure
At a high level Indexed Bitfields are a binary tree structure where each node is made up out of 2 bits.
11
indicates all bits under this node are1
s.00
indicates all bits under this node are0
s.10
indicates bits under this node are a mixture of1
s and0
s.01
is currently unused and reserved for (possible) future purposes.
We call this the Tree Index Scheme.
Consider this Indexed Bitfield, written as a sequence of bits:
01011101000000
Because the bits are indexed as a flat-tree, we can print it as a tree structure:
01
01 00
01 11 00 00
By looking at the root node we can tell that there's nodes in the tree, but not yet which nodes are in the tree. By going one level lower however, it becomes clear that there's nodes in one side of the tree, but no nodes in the other side of the tree. This means we only need to check the children of the left node to find out exactly which nodes we have.
A fun fact here is also: a completely zeroed-out buffer is a valid Indexed Bitfield - it just means it's completely empty.
Optimizing the Structure
Looking at a byte and looking at a bit is the same cost in a computer. You want to optimize for getting the most information possible when looking at a byte.
Therefore in order to get the most performance out of our structure, we want to construct our tree using bytes instead of pairs of bits.
Consider the following scheme. Given two bytes: A
B
. Take each of them, and
split each of them into pairs of two bits. We'll use a1 a2 a3 a4
to indicate
the pairs in A
. And b1 b2 b3 b4
to indicate the pairs in B
.
The parent of A
and B
is C
. C
is constructed by applying the Tree Index
Scheme to each pair of bits.
[a1 + a2, a3 + a4, b1 + b2, b3 + b4]
[a1, a2, a3, a4] [b1, b2, b3, b4]
In the example above, we use the +
operator to indicate the application of the
Tree Index Scheme.
In the future we might make this even more efficient using SIMD
instructions,
which can operate on more bits at the same time.
Lookup Tables
An efficient implementation of the previous scheme can be done using lookup tables for values between between 0 and 256.
This is all solely for performance and completely optional. The important part is that the indexing scheme is followed.
Types of Bitfields
We have 3 bitfields:
- Data Bitfield: Indicates which data you have, and which data you don't.
- Indexed Bitfield: Helps efficiently search through the Data Bitfield using the Tree Index Scheme.
- Merkle Tree Bitfield: Indicates which nodes in the Merkle Tree you have, and which nodes you don't.
This means that whenever you update the Data Bitfield, you must also update the Indexed Bitfield.
Updating a Byte
If we want to set an index in a bitfield to false
, it would mean we needed to
flip a bit to 0
. Because we can only operate on bytes, the easiest way to
achieve this is to apply a bitmask.
Consider the following lookup table, in binary notation:
# #![allow(unused_variables)] #fn main() { let data_update = vec![ 0b01111111, // 127 0b10111111, // 191 0b11011111, // 223 0b11101111, // 239 0b11110111, // 247 0b11111011, // 251 0b11111101, // 253 0b11111110, // 254 ]; #}
There are 8 entries in this table, all of which have a different position of which bit is set to zero. When you want to flip a bit to zero, you take the index of the bit you want to flip, look up the entry in the table, and bitwise AND the two numbers.
Serialization
For every piece of data there's going to be 1 bit in the Data Bitfield. And 2 bits in the Merkle Tree Bitfield because there's a parent node and a leaf node. There are going to be as many parents as there will be leaves.
Every time there's 16 bits in the Data Bitfield, the Indexed Bitfield needs 2
bits to indicate if it's all 1
s, 0
s, or a mixture. And 2 bits for the Tree
Index Scheme, totalling 4 bits in the Indexed Bitfield.
So this translates to the following ratios:
- Data: 1024 bytes.
- Merkle Tree: 2048 bytes.
- Indexed Tree: 256 bytes.
Run Length Encoding
When sending data over the wire, we want to compress the bitfields further. An efficient way of doing this is by using Run Length Encoding (RLE). TODO: explain the module. For now read the README.