Rust implementation of leveldb memdb source code analysis

preface
The memdb module in leveldb uses skiplist as a kv memory storage, and the related code implementation is very beautiful. The following is described above:

  • Compare the implementation of query, insert and delete in c + + and golang
  • Analyze the places in the golang version that can be optimized, and then optimize the t rust version

Then in this article

  • How to reference the version of goleveldb and rewrite memdb (arena version) with trust
  • Use t rust to rewrite a non arena version of memdb, which is the classic implementation of linked list structure

arena implementation

reference resources goleveldb DB Considering concurrency security, DB and DB structures are defined in the Rust implementation

  • db contains all members, non thread safe, and provides query related methods;
  • db adds Mutex encapsulation to db, which is thread safe and provides core insertion, deletion and more query functions;

The details are as follows:

db
https://github.com/kingeaster...

struct db<T: Comparer> {
    cmp: T,  // Comparator for comparing key s
    // rnd:rand
    // Store the data of the actual key and value
    kv_data: Vec<u8>,
    // Used to record key and value in kV_ The index in data. The format of each node is as follows, where level represents the number of layers of the current node, followed by level numbers, respectively indicating that the next node of each layer in the level layers of the current node is on the node_ Location in data
    // kvOffset1|len(key1)|len(value1)|level|next_node_1|...|next_node_i|...|next_node_h|
    node_data: Vec<usize>, // The first 16 bytes correspond to the head node
    // In the query process, the nodes experienced in the search process (actually the position of the node in node_data) are recorded, which are mainly used for insertion and deletion
    prev_node: RefCell<[usize; MAX_HEIGHT]>,
    // The number of layers of the current skiplist
    max_height: usize,
    // Number of nodes in skiplist
    n: usize,
    // Total data size
    kv_size: usize,
}

This db definition is very similar to that of goleveldb, without much complexity.

However, it should be noted that the prev in db here_ Node member variable is used to record the forward nodes of each layer during query or deletion. It is a common array in golveldb prevNode In our Rust definition, there is an array RefCell < [use; max_height] >, which is encapsulated by RefCell. The reason is that there are two usage scenarios for db search method. One is for pure search query, so the current db is read-only and uses constant borrowing & self. If it is used for insertion or deletion and needs to insert data into RefCell, db becomes variable, Variable borrowing & mut self is required. In order to keep db borrowing semantics unchanged, RefCell is used to provide internal variable features. So why do you want to keep db borrowing unchanged and use variable borrowing directly regardless of pure query or modified query?

Because constant borrowing in Rust can be shared, but variable borrowing cannot be shared. If only variable borrowing & mut self is used directly, the usage scenario of pure query operation will be limited. Even if an operation is only query, db should be declared as mut.

Another method is prev_node is not used as a member variable of db, but as an additional function during query. For details, please refer to Practice of node version

Encapsulates the reading and setting of the next node

github address

In order to improve the readability and maintainability of the code, the next node in the level layer will be obtained_ The operation of the position in data and the operation of setting the next node are encapsulated:

// Calculate the node node in the level layer and the next node in the node_data to improve the readability of the code
    fn next_node(&self, node: usize, i: usize) -> usize {
        // +NNEXT means at node_ In data, start from the node position, skip the four bytes kvOffset1|len(key1)|len(value1)|level| and move i positions later to reach next_node_i got it
        self.node_data[node + NNEXT + i]
    }

    fn set_next_node(&mut self, node: usize, i: usize, next: usize) {
        self.node_data[node + NNEXT + i] = next;
    }

Encapsulates the data to read key or value

github address

In order to provide the feasibility and maintainability of the code, the operations of obtaining the real data of key and obtaining the real number of value are encapsulated

//According to node_data, find the position in kv_data to get the key

fn get_key_data(&self, node: usize) -> &[u8] {
    let offset = self.node_data[node];
    &self.kv_data[offset..offset + self.node_data[node + NKEY]]
}
    // According to node_data, find the position in kV_ The offset and length in data to get value
    fn get_value_data(&self, node: usize) -> &[u8] {
        let key_offset = self.node_data[node] + self.node_data[node + NKEY];
        &self.kv_data[key_offset..key_offset + self.node_data[node + NVAL]]
    }

Query nodes with a specific key greater than or equal to

github address

You can see that after encapsulation, find_ great_ or_ The implementation of equal fits more closely with the algorithm description of skiplist.

// save_pre flag whether to record traversed nodes during search
    pub fn find_great_or_equal(&self, key: &internalKey, save_pre: bool) -> (usize, bool) {
        let mut node = 0;
        // Search from top to bottom
        let mut i = self.max_height - 1;
        // println!("max_height {}", i);

        loop {
            // The next node is on node_ Location in data
            let next = self.next_node(node, i);
            let mut cmp = Ordering::Greater;
            // There is no end in the current linked list
            if next > 0 {
                // key comparison with the next node
                cmp = self.cmp.compare(self.get_key_data(next), key.data());
            }

            // Greater than the next node, continue to jump right along the current layer
            if cmp == Ordering::Less {
                node = next;
            } else {
                // Less than or equal to the next node or the next node is empty
                // if save_pre {
                //     //For the search for insertion or deletion, even if the same one is encountered, continue to the next level of comparison, and cannot return immediately
                //     //Therefore, the judgment here should precede cmp == Ordering::Equal
                //     self.prev_node.borrow_mut()[i] = node;
                // } else if cmp == Ordering::Equal {
                //     // find_great_or_equal and find_ One difference of less is that next is returned here
                //     return (next, true);
                // }

                // Change to the following way, which is more readable
                if (!save_pre) && cmp == Ordering::Equal {
                    return (next, true);
                }
                if save_pre {// If you need to keep the forward node, record it to pre_ In node
                    self.prev_node.borrow_mut()[i] = node;
                }

                if i == 0 {
                    return (next, cmp == Ordering::Equal);
                }

                i -= 1;
            }
        }
    }

In the above implementation, the processing of the current node less than or equal to the next node is reconstructed compared with the writing method of golang. The reference golang is written as follows:

// Less than or equal to the next node or the next node is empty
                if save_pre {
                     // For the search for insertion or deletion, even if the same one is encountered, continue to the next level of comparison, and cannot return immediately
                     // Therefore, the judgment here should precede cmp == Ordering::Equal
                     self.prev_node.borrow_mut()[i] = node;
                 } else if cmp == Ordering::Equal {
                     // find_great_or_equal and find_ One difference of less is that next is returned here
                     return (next, true);
                 }

The main meaning of the code here is: if it is only a pure query operation, you can directly return the matching one; However, if the query is for insertion or deletion, even if a matching node is found, it must jump to the next layer until the lowest layer can be returned. After understanding the meaning of the code, we rewrite it
//Change to the following way, which is more readable

                if (!save_pre) && cmp == Ordering::Equal {
                    return (next, true);
                }
                if save_pre {
                    self.prev_node.borrow_mut()[i] = node;
                }

other

find_lessthanfind_last The Rust implementation of these two method s is consistent with goleveldb. I won't talk more. You can directly click to see the source code.

Db

github Db

pub struct Db<T: Comparer> {
    db: sync::RwLock<db<T>>,
}

Db is mainly used to provide search methods. It is non thread safe. Db performs insertion and deletion. It is thread safe.

Insert put

github insert

First, obtain the write lock. In Rust, Sync:: rwlock < T > obtains the write lock by calling the write() method:

let mut db = self.db.write().unwrap();

Search key for the location of the insert

let (node, exist) = db.find_great_or_equal(key, true);

If the key already exists, you can reuse the previous node_data

let (node, exist) = db.find_great_or_equal(key, true);
        if exist {
            // TODO Optimization: if the length of the new value is less than or equal to the old value, the direct value is directly overwritten without reallocation
            // If the key already exists, directly overwrite the previous value
            // Data is appended, so the current kV_ The length of data is the length of the node in Skv_ New offset on data
            let offset = db.kv_data.len();
            // Append key and value data
            db.kv_data.append(&mut key.data().to_vec());
            let mut v_len = 0;
            if let Some(value) = value {
                v_len = value.len();
                db.kv_data.append(&mut value.to_vec());
            }
            // Update offset of node
            db.node_data[node] = offset;
            // The length of value may also change
            // Previous length
            let v_old = db.node_data[node + NVAL];
            // Update to new length
            db.node_data[node + NVAL] = v_len;
            // Total update data size
            db.kv_size += v_len - v_old;
            return Ok(());
        }

For newly inserted points, the assigned floor height is calculated first

let h = db.rand_heigth();

Similarly, if the allocated layer height is higher than the maximum layer height of the current skiplist, it should be given to pre_node compensates for the missing forward node, that is, the head node is added.

// Processing head nodes
        if h > db.max_height {
            for i in db.max_height..h + 1 {
                db.prev_node.borrow_mut()[i] = 0;
            }
            db.max_height = h;
            println!("height {}", h);
        }

Get new node data in kV_ Start offset in data, and then append key and value data to kV_ Behind the data

// The new node is in kV_ Start offset in data
        let kv_offset = db.kv_data.len();
        // Append key and value data
        db.kv_data.append(&mut key.data().to_vec());
        let mut v_len = 0;
        if let Some(value) = value {
            v_len = value.len();
            db.kv_data.append(&mut value.to_vec());
        }

Record current node_ The length of data, that is, the new node is on the node_ The starting offset in data, and then in node_ Add a new node to data,

// Create a new node. Because it is an append method, the current node_ The length of data is the length of the new node on the node_ Location of data
        let node = db.node_data.len();
        // Add new node
        db.node_data.push(kv_offset); // At kV_ Offset in data
        db.node_data.push(key.data().len()); // Length of key
        db.node_data.push(v_len);// Length of value
        db.node_data.push(h); // Floor height of current node

Then perform linked list insertion according to the writing method of goleveldb, which is written in goleveldb

// Traverse the forward node of each layer
        for i, n := range p.prevNode[:h] {
            m := n + nNext + i // n node is the next node in layer i
            p.nodeData = append(p.nodeData, p.nodeData[m]) // The next node of layer n of the current node points to m
            p.nodeData[m] = node // The next node of n node in layer i points to the current node node
        }

Traversal prev_ Insert the forward node saved in node, that is, point the current node to the next node of the forward node, and then point the next node of the forward node to the current node:

// Traverse the forward node of each layer, iter() will only return Item, and the subscript can be returned at the same time by encapsulating enumerate
         for (i,n )in db.prev_node.borrow()[0..h].iter().enumerate(){
             let next = db.next_node(*n, i);
             db.node_data.push(next);
             db.set_next_node(*n, i, node);
         }

Compiler error:

error[E0502]: cannot borrow `db` as mutable because it is also borrowed as immutable
            --> src/memdb/memdb.rs:343:13
             |
         341 |         for (i,n )in db.prev_node.borrow()[0..h].iter().enumerate(){
             |                      ---------------------
             |                      |
             |                      immutable borrow occurs here
             |                      a temporary with access to the immutable borrow is created here ...
         342 |             let next = db.next_node(*n, i);
         343 |             db.node_data.push(next);
             |             ^^ mutable borrow occurs here
         344 |
         345 |         }

In the scope of the for loop, there are both immutable borrowing and variable borrowing of db, so you can only separate them; In addition, in order to improve the readability of the code, give node in advance_ The data extension h length is used to store h next nodes:

db.node_data.resize(node + NNEXT + h, 0);

The code for executing node insertion is rewritten as

// Traverse each layer
        for i in 0..h {
            let n = db.prev_node.borrow()[i];// Forward node
            let next = db.next_node(n, i); // The forward node is the next node in layer i
            db.set_next_node(node, i, next); // The next node of layer i of the current node points to next
            db.set_next_node(n, i, node); // The forward node points to the current node at the next node of layer i
        }

Update statistics

// Update data size and number
        db.kv_size += key.data().len() + v_len;
        db.n += 1;

delete

github delete

// delete
    pub fn delete(&mut self, key: &internalKey) -> Option<()> {
        let mut db = self.db.write().unwrap();
        let (node, exist) = db.find_great_or_equal(key, true);
        if !exist {
            return None;
        }

        // How many layers does the current node have
        let h = db.node_data[node + NHEIGHT];
        // Start deletion. Let the previous node refer to the next node of the next node of the previous node. Pre - > next = pre - > next - > next
        for i in 0..h {
            let pre = db.prev_node.borrow()[i];
            // let pre_next = db.next_node(pre, i);
            // if pre_next != node {
            //     print!("{}:{}", pre_next, node);
            // }

            // let next_next = db.next_node(pre_next, i);
            // db.set_next_node(pre, i, next_next);

            // Because the next node of the previous node is pre_node is the current node, so the above code can be optimized as
            let next_next = db.next_node(node, i);
            db.set_next_node(pre, i, next_next);
        }

        db.kv_size -= db.node_data[node + NKEY] + db.node_data[node + NVAL];
        db.n -= 1;

        Some(())
    }

In the code of delete, similar to put, it traverses pre_node, access through subscript to obtain the forward node of each layer

let pre = db.prev_node.borrow()[i];

In addition, due to the pre of the next node of the previous node_ Node is the current node, so the code to delete the node is removed from the

// Next node of forward node
             let pre_next = db.next_node(pre, i);
             let next_next = db.next_node(pre_next, i);
             db.set_next_node(pre, i, next_next);

Change to

// Use current node directly
            let next_next = db.next_node(node, i);
            db.set_next_node(pre, i, next_next);

other

Other methods are relatively simple. You can directly look at the source code

Node pointer implementation

Node definition

node,RcNode As follows:

type RcNode = Rc<RefCell<node>>;
// Each node
struct node {
    offset: usize,             // Corresponding kV_ Start position in data
    key_len: usize,            // Length of key
    value_len: usize,          // Length of value
    next: Vec<Option<RcNode>>, // The current node has a height layer, and the i-th element represents the next node of the i-th layer
}

Since the nodes in the linked list need to be referenced by other nodes and share ownership, Rc should be used. In addition, RefCell is used to provide internal variability because next needs to be changed during operation. Type rcnode = Rc < RefCell < node > >;

In the next attribute of node, the next node of each layer may be empty, so Option is used;

The node implements the following methods to obtain the actual key and value data. Here, because the return is & [U8], the declaration cycle ` a is used;

// According to node_data, find the position in kv_data to get the key
    fn get_key_data<'a>(&self, node: &node, kv_data: &'a [u8]) -> &'a [u8] {
        &kv_data[node.offset..node.offset + node.key_len]
    }

    // According to node_data, find the position in kV_ The offset and length in data to get value
    fn get_value_data<'a>(&self, node: &node, kv_data: &'a [u8]) -> &'a [u8] {
        &kv_data[node.offset + node.key_len..node.offset + node.key_len + node.value_len]
    }

db_skip

github db_skip

db_skip is defined as follows:

struct db_skip<T: Comparer> {
    cmp: T, // Comparator for comparing key s
    kv_data: Vec<u8>, // Store the data of the actual key and value. The offset starts from 1, and the offset of 0 indicates the head node
    head: RcNode, // Head,
    // The number of layers of the current skiplist
    max_height: usize,
    // Number of nodes in skiplist
    n: usize,
    // Total data size
    kv_size: usize,
}

Compared with agena version, pre is missing here_ Node, an additional head member is used to save the first node of the skiplist.

Encapsulates the data to read key or value

// According to node_data, find the position in kv_data to get the key
    fn get_key_data(&self, node: &node) -> &[u8] {
        &self.kv_data[node.offset..node.offset + node.key_len]
    }

    // According to node_data, find the position in kV_ The offset and length in data to get value
    fn get_value_data(&self, node: &node) -> &[u8] {
        &self.kv_data[node.offset + node.key_len..node.offset + node.key_len + node.value_len]
    }

Query nodes with a specific key greater than or equal to

github find_great_or_equal

pub fn find_great_or_equal(
        &self,
        key: &internalKey,
        pre_node: &mut Option<Vec<RcNode>>, // Note that option < & mut VEC < rcnode > > cannot be used
    ) -> (RcNode, bool) {
        let mut node = Rc::clone(&self.head); // Start from scratch node
        let mut next_node = Rc::clone(&node);
        let mut i = self.max_height - 1;

        loop {
            // Here, setting cmp to Ordering::Less in advance is a very clever way to automatically include the case that the next node is empty (regarded as infinity)
            let mut cmp = Ordering::Less;
            if let Some(ref next) = node.borrow().next[i] {
                // Next node exists
                cmp = self
                    .cmp
                    .compare(key.data(), self.get_key_data(&node.borrow()));

                next_node = Rc::clone(&next);
            }

            // Greater than the next node, continue to jump right along the current layer
            if cmp == Ordering::Greater {
                node = Rc::clone(&next_node);
                continue;
            }

            // Go here and explain: node is less than or equal to the next node or the next node is empty

            // If the forward node is not saved, it is just an ordinary search. If a match is found, it will be returned directly
            if (pre_node.is_none()) && cmp == Ordering::Equal {
                return (next_node, true);
            }

            // If you save the forward node node
            if let Some(ref mut pre) = pre_node {
                pre.push(Rc::clone(&node));
            }

            if i == 0 {
                return (next_node, cmp == Ordering::Equal);
            }

            i -= 1;
        }
    }

First put pre_node is passed into find as an input parameter_ great_ or_ equal , pre_node: & mut Option < Vec < rcnode > > because pre_node stores the forward nodes of each layer and is variable, so & mut, in addition, due to pre_node is optional, so use Option. Note that it can't be written as pre here_ Node: Option < & mut Vec < rcnode > >, which will cause pre error when extracting Vec from Option_ Node's move causes compilation failure.

Next, use node Borrow() obtains the constant borrowing of the current node, and then obtains the next node of layer I through next[i]. Because it is an Option type, use if let Some(ref next) to obtain the reference of the next node when the next node exists.

// If the next node stores
            if let Some(ref next) = node.borrow().next[i] {
                // The next node exists and is compared with the next node
                cmp = self.cmp.compare(key.data(), self.get_key_data(&next.borrow()));
                next_node = Rc::clone(&next);
            }

In addition, note that since the declaration cycle of next is only within if let {}, you need to pass next_ node = Rc::clone(&next); Write it down for the next iteration.

According to the comparison results, if it is greater than the next node, it jumps to the next node along the current layer

// Greater than the next node, continue to jump right along the current layer
            if cmp == Ordering::Greater {
                node = Rc::clone(&next_node);
                continue;
            }

Next, in the same way, give priority to judgment if the forward node is not saved and a matching node is found:

// Go here and explain: node is less than or equal to the next node or the next node is empty

            // If the forward node is not saved, it is just an ordinary search. If a match is found, it will be returned directly
            if (pre_node.is_none()) && cmp == Ordering::Equal {
                return (next_node, true);
            }

If the forward node is saved, use if let Some(ref mut pre)=pre_node from pre_node obtains the variable borrowing of Vec, and then puts the shared borrowing of node into:

// If you save the forward node node
            if let Some(ref mut pre) = pre_node {
                pre.push(Rc::clone(&node));
            }

Finally, if there is a bottom layer, it will return. If there is no bottom layer, it will skip the next layer:

// If you get to the last level, go back
            if i == 0 {
                return (next_node, cmp == Ordering::Equal);
            }

            i -= 1;

other

Other parts are relatively simple. Just look at the source code directly.

DBSkip

pub struct DBSkip<T: Comparer> {
    db: sync::RwLock<db_skip<T>>,
}

put

key already exists The processing logic of is as follows, which is similar to that of the arena version. The difference is that when updating the current node node, the border is used_ Mut() gets the variable borrowing of the current node to modify:

// If the key already exists, directly overwrite the previous value
            // Data is appended, so the current kV_ The length of data is the length of the node in Skv_ New offset on data
            let offset = db.kv_data.len();
            // Append key and value data
            db.kv_data.append(&mut key.data().to_vec());
            let mut v_len = 0;
            if let Some(value) = value {
                v_len = value.len();
                db.kv_data.append(&mut value.to_vec());
            }
            // Update offset of node
            node.borrow_mut().offset = offset;

            // The length of value may also change
            // Previous length
            let v_old = node.borrow().value_len;
            // Update to new length
            node.borrow_mut().value_len = v_len;
            // Total update data size
            db.kv_size += v_len - v_old;
            return Ok(());

For a new key, first allocate the floor height and process the pre according to the floor height_ node

let mut pre_node = pre_node.unwrap();
        let h = db.rand_heigth();
        // Processing head nodes
        if h > db.max_height {
            // Supplement the higher part
            for i in db.max_height..h{
                pre_node.push(Rc::clone(&db.head));
            }
            db.max_height = h;
            println!("height {}", h);
        }

Save new node data in kV_ Start offset in data, and then give kv_data append key and value data.

// The new node is in kV_ Start offset in data
        let kv_offset = db.kv_data.len();
        // Append key and value data
        db.kv_data.append(&mut key.data().to_vec());
        let mut v_len = 0;
        if let Some(value) = value {
            v_len = value.len();
            db.kv_data.append(&mut value.to_vec());
        }

Vec's append method pub FN append (& mut self, other: & mut self) corresponding description Moves all the elements of other into Self, leaving other empty

Create a new node

let node = Rc::new(RefCell::new(node::new(
            kv_offset,
            key.data().len(),
            v_len,
            h,
        )));

Insert the linked list of each layer that the query passes through:

// Execute insert
        for (i, pre) in pre_node.iter().enumerate() {
            // New node - > next = pre - > next
            if let Some(ref pre_next) = pre.borrow().next[i] {
                node.borrow_mut().next[i] = Some(Rc::clone(pre_next));
            }

            // Pre - > next = new node
            pre.borrow_mut().next[i] = Some(Rc::clone(&node));
        }

Update statistics

db.kv_size+=key.data().len()+v_len;
       db.n+=1;

       Ok(())

delete

If you understand the put method, the delete method is much simpler

pub fn delete(&mut self, key: &internalKey) -> Option<()> {
        let mut db = self.db.write().unwrap();
        let mut pre_node = Some(vec![]);
        let (node, exist) = db.find_great_or_equal(key, &mut pre_node);
        if !exist{
            return None;
        }
        let pre_node = pre_node.unwrap();
        // Execute delete
        for (i, pre) in pre_node.iter().enumerate() {
            // The forward node executes the next node of the current node pre - > next = node - > next
            if let Some(ref node_next) = node.borrow().next[i] { // node_next: the current node is the next hop node of layer i
                pre.borrow_mut().next[i] = Some(Rc::clone(node_next));
            }
        }

        db.kv_size-=node.borrow().key_len+node.borrow().value_len;
        db.n -=1;
        Some(())
    }

reference material
Jump table https://www.bookstack.cn/read...
Jump linked list https://www.cnblogs.com/s-lis...
Level RS full project address https://github.com/kingeaster...

Keywords: Database Rust

Added by mrvijayakumar on Tue, 11 Jan 2022 09:06:55 +0200