preface
The memdb module in leveldb uses skiplist as a kv memory storage, and the related code implementation is very beautiful. The following is described above:
- Compare the implementation of query, insert and delete in c + + and golang
- Analyze the places in the golang version that can be optimized, and then optimize the t rust version
Then in this article
- How to reference the version of goleveldb and rewrite memdb (arena version) with trust
- Use t rust to rewrite a non arena version of memdb, which is the classic implementation of linked list structure
arena implementation
reference resources goleveldb DB Considering concurrency security, DB and DB structures are defined in the Rust implementation
- db contains all members, non thread safe, and provides query related methods;
- db adds Mutex encapsulation to db, which is thread safe and provides core insertion, deletion and more query functions;
The details are as follows:
db
https://github.com/kingeaster...
struct db<T: Comparer> { cmp: T, // Comparator for comparing key s // rnd:rand // Store the data of the actual key and value kv_data: Vec<u8>, // Used to record key and value in kV_ The index in data. The format of each node is as follows, where level represents the number of layers of the current node, followed by level numbers, respectively indicating that the next node of each layer in the level layers of the current node is on the node_ Location in data // kvOffset1|len(key1)|len(value1)|level|next_node_1|...|next_node_i|...|next_node_h| node_data: Vec<usize>, // The first 16 bytes correspond to the head node // In the query process, the nodes experienced in the search process (actually the position of the node in node_data) are recorded, which are mainly used for insertion and deletion prev_node: RefCell<[usize; MAX_HEIGHT]>, // The number of layers of the current skiplist max_height: usize, // Number of nodes in skiplist n: usize, // Total data size kv_size: usize, }
This db definition is very similar to that of goleveldb, without much complexity.
However, it should be noted that the prev in db here_ Node member variable is used to record the forward nodes of each layer during query or deletion. It is a common array in golveldb prevNode In our Rust definition, there is an array RefCell < [use; max_height] >, which is encapsulated by RefCell. The reason is that there are two usage scenarios for db search method. One is for pure search query, so the current db is read-only and uses constant borrowing & self. If it is used for insertion or deletion and needs to insert data into RefCell, db becomes variable, Variable borrowing & mut self is required. In order to keep db borrowing semantics unchanged, RefCell is used to provide internal variable features. So why do you want to keep db borrowing unchanged and use variable borrowing directly regardless of pure query or modified query?
Because constant borrowing in Rust can be shared, but variable borrowing cannot be shared. If only variable borrowing & mut self is used directly, the usage scenario of pure query operation will be limited. Even if an operation is only query, db should be declared as mut.
Another method is prev_node is not used as a member variable of db, but as an additional function during query. For details, please refer to Practice of node version
Encapsulates the reading and setting of the next node
In order to improve the readability and maintainability of the code, the next node in the level layer will be obtained_ The operation of the position in data and the operation of setting the next node are encapsulated:
// Calculate the node node in the level layer and the next node in the node_data to improve the readability of the code fn next_node(&self, node: usize, i: usize) -> usize { // +NNEXT means at node_ In data, start from the node position, skip the four bytes kvOffset1|len(key1)|len(value1)|level| and move i positions later to reach next_node_i got it self.node_data[node + NNEXT + i] } fn set_next_node(&mut self, node: usize, i: usize, next: usize) { self.node_data[node + NNEXT + i] = next; }
Encapsulates the data to read key or value
In order to provide the feasibility and maintainability of the code, the operations of obtaining the real data of key and obtaining the real number of value are encapsulated
//According to node_data, find the position in kv_data to get the key
fn get_key_data(&self, node: usize) -> &[u8] { let offset = self.node_data[node]; &self.kv_data[offset..offset + self.node_data[node + NKEY]] }
// According to node_data, find the position in kV_ The offset and length in data to get value fn get_value_data(&self, node: usize) -> &[u8] { let key_offset = self.node_data[node] + self.node_data[node + NKEY]; &self.kv_data[key_offset..key_offset + self.node_data[node + NVAL]] }
Query nodes with a specific key greater than or equal to
You can see that after encapsulation, find_ great_ or_ The implementation of equal fits more closely with the algorithm description of skiplist.
// save_pre flag whether to record traversed nodes during search pub fn find_great_or_equal(&self, key: &internalKey, save_pre: bool) -> (usize, bool) { let mut node = 0; // Search from top to bottom let mut i = self.max_height - 1; // println!("max_height {}", i); loop { // The next node is on node_ Location in data let next = self.next_node(node, i); let mut cmp = Ordering::Greater; // There is no end in the current linked list if next > 0 { // key comparison with the next node cmp = self.cmp.compare(self.get_key_data(next), key.data()); } // Greater than the next node, continue to jump right along the current layer if cmp == Ordering::Less { node = next; } else { // Less than or equal to the next node or the next node is empty // if save_pre { // //For the search for insertion or deletion, even if the same one is encountered, continue to the next level of comparison, and cannot return immediately // //Therefore, the judgment here should precede cmp == Ordering::Equal // self.prev_node.borrow_mut()[i] = node; // } else if cmp == Ordering::Equal { // // find_great_or_equal and find_ One difference of less is that next is returned here // return (next, true); // } // Change to the following way, which is more readable if (!save_pre) && cmp == Ordering::Equal { return (next, true); } if save_pre {// If you need to keep the forward node, record it to pre_ In node self.prev_node.borrow_mut()[i] = node; } if i == 0 { return (next, cmp == Ordering::Equal); } i -= 1; } } }
In the above implementation, the processing of the current node less than or equal to the next node is reconstructed compared with the writing method of golang. The reference golang is written as follows:
// Less than or equal to the next node or the next node is empty if save_pre { // For the search for insertion or deletion, even if the same one is encountered, continue to the next level of comparison, and cannot return immediately // Therefore, the judgment here should precede cmp == Ordering::Equal self.prev_node.borrow_mut()[i] = node; } else if cmp == Ordering::Equal { // find_great_or_equal and find_ One difference of less is that next is returned here return (next, true); }
The main meaning of the code here is: if it is only a pure query operation, you can directly return the matching one; However, if the query is for insertion or deletion, even if a matching node is found, it must jump to the next layer until the lowest layer can be returned. After understanding the meaning of the code, we rewrite it
//Change to the following way, which is more readable
if (!save_pre) && cmp == Ordering::Equal { return (next, true); } if save_pre { self.prev_node.borrow_mut()[i] = node; }
other
find_lessthan, find_last The Rust implementation of these two method s is consistent with goleveldb. I won't talk more. You can directly click to see the source code.
Db
github Db
pub struct Db<T: Comparer> { db: sync::RwLock<db<T>>, }
Db is mainly used to provide search methods. It is non thread safe. Db performs insertion and deletion. It is thread safe.
Insert put
First, obtain the write lock. In Rust, Sync:: rwlock < T > obtains the write lock by calling the write() method:
let mut db = self.db.write().unwrap();
Search key for the location of the insert
let (node, exist) = db.find_great_or_equal(key, true);
If the key already exists, you can reuse the previous node_data
let (node, exist) = db.find_great_or_equal(key, true); if exist { // TODO Optimization: if the length of the new value is less than or equal to the old value, the direct value is directly overwritten without reallocation // If the key already exists, directly overwrite the previous value // Data is appended, so the current kV_ The length of data is the length of the node in Skv_ New offset on data let offset = db.kv_data.len(); // Append key and value data db.kv_data.append(&mut key.data().to_vec()); let mut v_len = 0; if let Some(value) = value { v_len = value.len(); db.kv_data.append(&mut value.to_vec()); } // Update offset of node db.node_data[node] = offset; // The length of value may also change // Previous length let v_old = db.node_data[node + NVAL]; // Update to new length db.node_data[node + NVAL] = v_len; // Total update data size db.kv_size += v_len - v_old; return Ok(()); }
For newly inserted points, the assigned floor height is calculated first
let h = db.rand_heigth();
Similarly, if the allocated layer height is higher than the maximum layer height of the current skiplist, it should be given to pre_node compensates for the missing forward node, that is, the head node is added.
// Processing head nodes if h > db.max_height { for i in db.max_height..h + 1 { db.prev_node.borrow_mut()[i] = 0; } db.max_height = h; println!("height {}", h); }
Get new node data in kV_ Start offset in data, and then append key and value data to kV_ Behind the data
// The new node is in kV_ Start offset in data let kv_offset = db.kv_data.len(); // Append key and value data db.kv_data.append(&mut key.data().to_vec()); let mut v_len = 0; if let Some(value) = value { v_len = value.len(); db.kv_data.append(&mut value.to_vec()); }
Record current node_ The length of data, that is, the new node is on the node_ The starting offset in data, and then in node_ Add a new node to data,
// Create a new node. Because it is an append method, the current node_ The length of data is the length of the new node on the node_ Location of data let node = db.node_data.len(); // Add new node db.node_data.push(kv_offset); // At kV_ Offset in data db.node_data.push(key.data().len()); // Length of key db.node_data.push(v_len);// Length of value db.node_data.push(h); // Floor height of current node
Then perform linked list insertion according to the writing method of goleveldb, which is written in goleveldb
// Traverse the forward node of each layer for i, n := range p.prevNode[:h] { m := n + nNext + i // n node is the next node in layer i p.nodeData = append(p.nodeData, p.nodeData[m]) // The next node of layer n of the current node points to m p.nodeData[m] = node // The next node of n node in layer i points to the current node node }
Traversal prev_ Insert the forward node saved in node, that is, point the current node to the next node of the forward node, and then point the next node of the forward node to the current node:
// Traverse the forward node of each layer, iter() will only return Item, and the subscript can be returned at the same time by encapsulating enumerate for (i,n )in db.prev_node.borrow()[0..h].iter().enumerate(){ let next = db.next_node(*n, i); db.node_data.push(next); db.set_next_node(*n, i, node); }
Compiler error:
error[E0502]: cannot borrow `db` as mutable because it is also borrowed as immutable --> src/memdb/memdb.rs:343:13 | 341 | for (i,n )in db.prev_node.borrow()[0..h].iter().enumerate(){ | --------------------- | | | immutable borrow occurs here | a temporary with access to the immutable borrow is created here ... 342 | let next = db.next_node(*n, i); 343 | db.node_data.push(next); | ^^ mutable borrow occurs here 344 | 345 | }
In the scope of the for loop, there are both immutable borrowing and variable borrowing of db, so you can only separate them; In addition, in order to improve the readability of the code, give node in advance_ The data extension h length is used to store h next nodes:
db.node_data.resize(node + NNEXT + h, 0);
The code for executing node insertion is rewritten as
// Traverse each layer for i in 0..h { let n = db.prev_node.borrow()[i];// Forward node let next = db.next_node(n, i); // The forward node is the next node in layer i db.set_next_node(node, i, next); // The next node of layer i of the current node points to next db.set_next_node(n, i, node); // The forward node points to the current node at the next node of layer i }
Update statistics
// Update data size and number db.kv_size += key.data().len() + v_len; db.n += 1;
delete
// delete pub fn delete(&mut self, key: &internalKey) -> Option<()> { let mut db = self.db.write().unwrap(); let (node, exist) = db.find_great_or_equal(key, true); if !exist { return None; } // How many layers does the current node have let h = db.node_data[node + NHEIGHT]; // Start deletion. Let the previous node refer to the next node of the next node of the previous node. Pre - > next = pre - > next - > next for i in 0..h { let pre = db.prev_node.borrow()[i]; // let pre_next = db.next_node(pre, i); // if pre_next != node { // print!("{}:{}", pre_next, node); // } // let next_next = db.next_node(pre_next, i); // db.set_next_node(pre, i, next_next); // Because the next node of the previous node is pre_node is the current node, so the above code can be optimized as let next_next = db.next_node(node, i); db.set_next_node(pre, i, next_next); } db.kv_size -= db.node_data[node + NKEY] + db.node_data[node + NVAL]; db.n -= 1; Some(()) }
In the code of delete, similar to put, it traverses pre_node, access through subscript to obtain the forward node of each layer
let pre = db.prev_node.borrow()[i];
In addition, due to the pre of the next node of the previous node_ Node is the current node, so the code to delete the node is removed from the
// Next node of forward node let pre_next = db.next_node(pre, i); let next_next = db.next_node(pre_next, i); db.set_next_node(pre, i, next_next);
Change to
// Use current node directly let next_next = db.next_node(node, i); db.set_next_node(pre, i, next_next);
other
Other methods are relatively simple. You can directly look at the source code
Node pointer implementation
Node definition
node,RcNode As follows:
type RcNode = Rc<RefCell<node>>; // Each node struct node { offset: usize, // Corresponding kV_ Start position in data key_len: usize, // Length of key value_len: usize, // Length of value next: Vec<Option<RcNode>>, // The current node has a height layer, and the i-th element represents the next node of the i-th layer }
Since the nodes in the linked list need to be referenced by other nodes and share ownership, Rc should be used. In addition, RefCell is used to provide internal variability because next needs to be changed during operation. Type rcnode = Rc < RefCell < node > >;
In the next attribute of node, the next node of each layer may be empty, so Option is used;
The node implements the following methods to obtain the actual key and value data. Here, because the return is & [U8], the declaration cycle ` a is used;
// According to node_data, find the position in kv_data to get the key fn get_key_data<'a>(&self, node: &node, kv_data: &'a [u8]) -> &'a [u8] { &kv_data[node.offset..node.offset + node.key_len] } // According to node_data, find the position in kV_ The offset and length in data to get value fn get_value_data<'a>(&self, node: &node, kv_data: &'a [u8]) -> &'a [u8] { &kv_data[node.offset + node.key_len..node.offset + node.key_len + node.value_len] }
db_skip
db_skip is defined as follows:
struct db_skip<T: Comparer> { cmp: T, // Comparator for comparing key s kv_data: Vec<u8>, // Store the data of the actual key and value. The offset starts from 1, and the offset of 0 indicates the head node head: RcNode, // Head, // The number of layers of the current skiplist max_height: usize, // Number of nodes in skiplist n: usize, // Total data size kv_size: usize, }
Compared with agena version, pre is missing here_ Node, an additional head member is used to save the first node of the skiplist.
Encapsulates the data to read key or value
// According to node_data, find the position in kv_data to get the key fn get_key_data(&self, node: &node) -> &[u8] { &self.kv_data[node.offset..node.offset + node.key_len] } // According to node_data, find the position in kV_ The offset and length in data to get value fn get_value_data(&self, node: &node) -> &[u8] { &self.kv_data[node.offset + node.key_len..node.offset + node.key_len + node.value_len] }
Query nodes with a specific key greater than or equal to
pub fn find_great_or_equal( &self, key: &internalKey, pre_node: &mut Option<Vec<RcNode>>, // Note that option < & mut VEC < rcnode > > cannot be used ) -> (RcNode, bool) { let mut node = Rc::clone(&self.head); // Start from scratch node let mut next_node = Rc::clone(&node); let mut i = self.max_height - 1; loop { // Here, setting cmp to Ordering::Less in advance is a very clever way to automatically include the case that the next node is empty (regarded as infinity) let mut cmp = Ordering::Less; if let Some(ref next) = node.borrow().next[i] { // Next node exists cmp = self .cmp .compare(key.data(), self.get_key_data(&node.borrow())); next_node = Rc::clone(&next); } // Greater than the next node, continue to jump right along the current layer if cmp == Ordering::Greater { node = Rc::clone(&next_node); continue; } // Go here and explain: node is less than or equal to the next node or the next node is empty // If the forward node is not saved, it is just an ordinary search. If a match is found, it will be returned directly if (pre_node.is_none()) && cmp == Ordering::Equal { return (next_node, true); } // If you save the forward node node if let Some(ref mut pre) = pre_node { pre.push(Rc::clone(&node)); } if i == 0 { return (next_node, cmp == Ordering::Equal); } i -= 1; } }
First put pre_node is passed into find as an input parameter_ great_ or_ equal , pre_node: & mut Option < Vec < rcnode > > because pre_node stores the forward nodes of each layer and is variable, so & mut, in addition, due to pre_node is optional, so use Option. Note that it can't be written as pre here_ Node: Option < & mut Vec < rcnode > >, which will cause pre error when extracting Vec from Option_ Node's move causes compilation failure.
Next, use node Borrow() obtains the constant borrowing of the current node, and then obtains the next node of layer I through next[i]. Because it is an Option type, use if let Some(ref next) to obtain the reference of the next node when the next node exists.
// If the next node stores if let Some(ref next) = node.borrow().next[i] { // The next node exists and is compared with the next node cmp = self.cmp.compare(key.data(), self.get_key_data(&next.borrow())); next_node = Rc::clone(&next); }
In addition, note that since the declaration cycle of next is only within if let {}, you need to pass next_ node = Rc::clone(&next); Write it down for the next iteration.
According to the comparison results, if it is greater than the next node, it jumps to the next node along the current layer
// Greater than the next node, continue to jump right along the current layer if cmp == Ordering::Greater { node = Rc::clone(&next_node); continue; }
Next, in the same way, give priority to judgment if the forward node is not saved and a matching node is found:
// Go here and explain: node is less than or equal to the next node or the next node is empty // If the forward node is not saved, it is just an ordinary search. If a match is found, it will be returned directly if (pre_node.is_none()) && cmp == Ordering::Equal { return (next_node, true); }
If the forward node is saved, use if let Some(ref mut pre)=pre_node from pre_node obtains the variable borrowing of Vec, and then puts the shared borrowing of node into:
// If you save the forward node node if let Some(ref mut pre) = pre_node { pre.push(Rc::clone(&node)); }
Finally, if there is a bottom layer, it will return. If there is no bottom layer, it will skip the next layer:
// If you get to the last level, go back if i == 0 { return (next_node, cmp == Ordering::Equal); } i -= 1;
other
Other parts are relatively simple. Just look at the source code directly.
DBSkip
pub struct DBSkip<T: Comparer> { db: sync::RwLock<db_skip<T>>, }
put
key already exists The processing logic of is as follows, which is similar to that of the arena version. The difference is that when updating the current node node, the border is used_ Mut() gets the variable borrowing of the current node to modify:
// If the key already exists, directly overwrite the previous value // Data is appended, so the current kV_ The length of data is the length of the node in Skv_ New offset on data let offset = db.kv_data.len(); // Append key and value data db.kv_data.append(&mut key.data().to_vec()); let mut v_len = 0; if let Some(value) = value { v_len = value.len(); db.kv_data.append(&mut value.to_vec()); } // Update offset of node node.borrow_mut().offset = offset; // The length of value may also change // Previous length let v_old = node.borrow().value_len; // Update to new length node.borrow_mut().value_len = v_len; // Total update data size db.kv_size += v_len - v_old; return Ok(());
For a new key, first allocate the floor height and process the pre according to the floor height_ node
let mut pre_node = pre_node.unwrap(); let h = db.rand_heigth(); // Processing head nodes if h > db.max_height { // Supplement the higher part for i in db.max_height..h{ pre_node.push(Rc::clone(&db.head)); } db.max_height = h; println!("height {}", h); }
Save new node data in kV_ Start offset in data, and then give kv_data append key and value data.
// The new node is in kV_ Start offset in data let kv_offset = db.kv_data.len(); // Append key and value data db.kv_data.append(&mut key.data().to_vec()); let mut v_len = 0; if let Some(value) = value { v_len = value.len(); db.kv_data.append(&mut value.to_vec()); }
Vec's append method pub FN append (& mut self, other: & mut self) corresponding description Moves all the elements of other into Self, leaving other empty
Create a new node
let node = Rc::new(RefCell::new(node::new( kv_offset, key.data().len(), v_len, h, )));
Insert the linked list of each layer that the query passes through:
// Execute insert for (i, pre) in pre_node.iter().enumerate() { // New node - > next = pre - > next if let Some(ref pre_next) = pre.borrow().next[i] { node.borrow_mut().next[i] = Some(Rc::clone(pre_next)); } // Pre - > next = new node pre.borrow_mut().next[i] = Some(Rc::clone(&node)); }
Update statistics
db.kv_size+=key.data().len()+v_len; db.n+=1; Ok(())
delete
If you understand the put method, the delete method is much simpler
pub fn delete(&mut self, key: &internalKey) -> Option<()> { let mut db = self.db.write().unwrap(); let mut pre_node = Some(vec![]); let (node, exist) = db.find_great_or_equal(key, &mut pre_node); if !exist{ return None; } let pre_node = pre_node.unwrap(); // Execute delete for (i, pre) in pre_node.iter().enumerate() { // The forward node executes the next node of the current node pre - > next = node - > next if let Some(ref node_next) = node.borrow().next[i] { // node_next: the current node is the next hop node of layer i pre.borrow_mut().next[i] = Some(Rc::clone(node_next)); } } db.kv_size-=node.borrow().key_len+node.borrow().value_len; db.n -=1; Some(()) }
reference material
Jump table https://www.bookstack.cn/read...
Jump linked list https://www.cnblogs.com/s-lis...
Level RS full project address https://github.com/kingeaster...