Author: Wang Dongyang
preface
Sorting and binary search are very common algorithms in computer science leveldb memdb source code analysis (Part 2): Rust implementation You can see that many methods use binary search. This paper will compare some characteristics and differences in sorting and search methods between Go and Rust. For Go, we mainly use digital slice [] int as an example, and for Rust, we mainly use Vec as an example.
sort
In the Go language, we can use sort directly for [] int Ints are sorted as follows:
func sort_simple() { a := []int{1, 2, 6, 7, 8, 3, 4} sort.Ints(a) fmt.Println(a) }
In Rust, Vec can be sorted using the sort method as follows:
fn sort_simple() { let mut a = vec![1, 2, 6, 7, 8, 3, 4]; a.sort(); println!("{:?}", a); }
If you want to customize the collation, we can use sort. In Go Slice, which implements reverse sorting or sorting based on specific rules such as absolute value, as follows:
func sort_reverse() { a := []int{1, 2, 6, 7, 8, 3, 4} sort.Slice(a, func(i, j int) bool { return a[i] > a[j] }) fmt.Println(a) } func sort_abs() { a := []int{1, -2, 6, 7, -8, 3, 4} sort.Slice(a, func(i, j int) bool { return abs(a[i]) > abs(a[j]) }) fmt.Println(a) } func abs(i int) int { if i > 0 { return i } return -i }
In Rust, you can use sort_by custom sort
fn sort_reverse() { let mut a = vec![1, 2, 6, 7, 8, 3, 4]; a.sort_by(|a, b| b.cmp(a)); println!("{:?}", a); } fn sort_abs() { let mut a: Vec<i32> = vec![1, -2, 6, 7, -8, 3, 4]; a.sort_by(|a, b| a.abs().cmp(&b.abs())); println!("{:?}", a); }
In addition, in Rust, you can also use sort if you want to sort the key according to a specific rule_ by_ key
fn sort_by_key_abs() { let mut a: Vec<i32> = vec![1, -2, 6, 7, -8, 3, 4]; a.sort_by_key(|k| k.abs()); println!("{:?}", a); }
In addition, Rust considers sort_ by_ The closure function passed in by the key method may take a long time to process the key, so sort is also provided_ by_ cached_ Key. The key will not be calculated repeatedly to optimize the sorting performance.
The corresponding is also provided in Rust_ The unstable version has better performance than the stable sorting. Those interested can see the corresponding api file by themselves.
In Go, sort the custom types, except using sort Slice can also use the following troublesome way to implement sorting related interfaces:
type SortBy []Type func (a SortBy) Len() int { return len(a) } func (a SortBy) Swap(i, j int) { a[i], a[j] = a[j], a[i] } func (a SortBy) Less(i, j int) bool { return a[i] < a[j] }
search
Go
In Go, you can use SearchInts, SearchFloat64s and SearchStrings to sort different types of slices. This paper mainly focuses on SearchInts As an example:
func SearchInts(a []int, x int) int
The corresponding official introduction is as follows
SearchInts searches for x in a sorted slice of ints and returns the index as specified by Search. The return value is the index to insert x if x is not present (it could be len(a)). The slice must be sorted in ascending order.
Here are a few very important points that many people easily make mistakes or never notice:
- If the element cannot be found, the smallest subscript in the element greater than x in the sorted ints is returned.
- If x is larger than all elements in sorted ints, the returned subscript is the length of the sorted ints;
- If there are multiple identical x elements in the sorted ints, the one with the smallest subscript is returned
Next, we will introduce these three points through specific examples:
package main import ( "fmt" "sort" ) func main() { a := []int{1, 2, 3, 4, 6, 7, 8} x := 2 i := sort.SearchInts(a, x)// Find x fmt.Printf("found %d at index %d in %v\n", x, i, a) x = 5 i = sort.SearchInts(a, x) // x not found fmt.Printf("%d not found, can be inserted at index %d in %v\n", x, i, a) x = 9 i = sort.SearchInts(a, x) // Cannot find x and x is larger than all elements fmt.Printf("%d not found, can be inserted at index %d in %v\n", x, i, a) }
For example, in the above example, for ordered slices a:=[]int{1, 2, 3, 4, 6, 7, 8},
- Execute sort Searchints (a, 2). Since 2 exists in a, the corresponding index 1 in the following table is returned;
- Execute sort Searchints (a, 5). Since 5 does not exist, it returns the subscript index 4 of 6, which is the first number greater than 5 in a;
- Execute sort Searchints (a, 9). Since 9 is larger than all elements in a, it returns the length of A.
From the above example, we know that if you want to judge whether an element exists, execute sort Searchints needs to judge again after searching.
For slices containing duplicate elements, see the following:
a = []int{0,2,2,2,2,2,5}
If you perform a search, sort Searchints (a, 2) returns the subscript of the element with the smallest subscript among the elements equal to 2 in A. why does this behavior occur? Let's see
In the corresponding source code, the bottom layer of SearchInts actually calls the Search method. In fact, SearchFloat64s and SearchStrings also call Search
Let's first look at the source code of Search, which is a very classic way to write binary Search
func Search(n int, f func(int) bool) int { // Define f(-1) == false and f(n) == true. // Invariant: f(i-1) == false, f(j) == true. i, j := 0, n for i < j { h := int(uint(i+j) >> 1) // avoid overflow when computing h // i ≤ h < j if !f(h) { i = h + 1 // preserves f(i-1) == false } else { j = h // preserves f(j) == true } } // i == j, f(i-1) == false, and f(j) (= f(i)) == true => answer is i. return i }
When SearchInts calls Search, the function passed is return a [i] > = X
func SearchInts(a []int, x int) int { return Search(len(a), func(i int) bool { return a[i] >= x }) }
Combine the above sort Searchints source code, we can understand why sort Searchints will behave like that because sort Searchints actually queries the slice for the subscript of the smallest element greater than or equal to a specific value x.
So if there are multiple identical elements in the slice, sort Searchints returns the index of the leftmost element. What if we want to return the index of the rightmost element? In fact, it's very simple. We just need to pass sort Searchints queries the subscript of x+1, then subtracts this subscript and returns it.
Rust
Binary is a common method for binary search in Rust_ Search, which returns the Result structure.
- If found, Result::Ok is returned, including the index of the found element. If multiple elements match, the index of the returned element is uncertain; This is different from the one in Go;
- If the Result::Err is not found, the first index subscript greater than the search element in vec is returned. If the query element is larger than all elements, the value in it is equal to the length of vec. This behavior is consistent with that in go;
Let's start with the following example:
let s = [0, 1, 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]; assert_eq!(s.binary_search(&13), Ok(9)); assert_eq!(s.binary_search(&4), Err(7)); assert_eq!(s.binary_search(&100), Err(13));
If the ordered vec contains repeated elements, for example, there are multiple 1 in s above, the returned subscript is uncertain, and may be any subscript from 1 to 4.
let r = s.binary_search(&1); assert!(match r { Ok(1..=4) => true, _ => false, });
To understand binary_search behavior, let's take a look at binary_ The source code of search calls binary internally_ search_ By, the source code of the two methods is as follows:
#[stable(feature = "rust1", since = "1.0.0")] pub fn binary_search(&self, x: &T) -> Result<usize, usize> where T: Ord, { self.binary_search_by(|p| p.cmp(x)) }
binary_ search_ The official documents of by are described as follows:
Binary searches this sorted slice with a comparator function.
The comparator function should implement an order consistent with the sort order of the underlying slice, returning an order code that indicates whether its argument is Less, Equal or Greater the desired target.
If the value is found then Result::Ok is returned, containing the index of the matching element. If there are multiple matches, then any one of the matches could be returned. The index is chosen deterministically, but is subject to change in future versions of Rust. If the value is not found then Result::Err is returned, containing the index where a matching element could be inserted while maintaining sorted order.
- Sort based on the passed in comparison function;
- The comparison function returns the Ordering type F: fnmut (&'a T) - > Ordering;
- If found, Result::Ok is returned, including the subscript of the found element. If there are multiple matches, any one may be returned. This can be seen in the following source code;
#[stable(feature = "rust1", since = "1.0.0")] #[inline] pub fn binary_search_by<'a, F>(&'a self, mut f: F) -> Result<usize, usize> where F: FnMut(&'a T) -> Ordering, { let mut size = self.len(); let mut left = 0; let mut right = size; while left < right { let mid = left + size / 2; // SAFETY: the call is made safe by the following invariants: // - `mid >= 0` // - `mid < size`: `mid` is limited by `[left; right)` bound. let cmp = f(unsafe { self.get_unchecked(mid) }); // The reason why we use if/else control flow rather than match // is because match reorders comparison operations, which is perf sensitive. // This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra. if cmp == Less { left = mid + 1; } else if cmp == Greater { right = mid; } else { // SAFETY: same as the `get_unchecked` above unsafe { crate::intrinsics::assume(mid < self.len()) }; return Ok(mid); } size = right - left; } Err(left) }
If you want to implement the behavior of SearchInts in golang in t rust, you can use partition_point, the official introduction is as follows:
Returns the index of the partition point according to the given predicate (the index of the first element of the second partition).
The slice is assumed to be partitioned according to the given predicate. This means that all elements for which the predicate returns true are at the start of the slice and all elements for which the predicate returns false are at the end. For example, [7, 15, 3, 5, 4, 12, 6] is a partitioned under the predicate x % 2 != 0 (all odd numbers are at the start, all even at the end).
If this slice is not partitioned, the returned result is unspecified and meaningless, as this method performs a kind of binary search.
In short, partition_point is used to divide the input slice into two parts. The elements before the return index meet condition P, and the elements after the index and index do not meet condition P.
Examples are as follows:
let v = [1, 2, 3, 3, 5, 6, 7]; let i = v.partition_point(|&x| x < 5); assert_eq!(i, 4); assert!(v[..i].iter().all(|&x| x < 5)); assert!(v[i..].iter().all(|&x| !(x < 5)));
The above divides v into two parts based on the condition x < 5. The elements before i are less than 5, and the elements after i are greater than 5, partition_ The bottom layer of point actually calls binary_search_by, as follows:
#[stable(feature = "partition_point", since = "1.52.0")] pub fn partition_point<P>(&self, mut pred: P) -> usize where P: FnMut(&T) -> bool, { self.binary_search_by(|x| if pred(x) { Less } else { Greater }).unwrap_or_else(|i| i) }
summary
This paper introduces the usage of sorting and binary search related APIs in Go language and Rust language, and compares some different behavior characteristics of the two languages in binary search to avoid stepping on the pit in development. It is summarized as follows:
- Slices in Go can be sorted using the Sort method, using Sort Slice() sorts user-defined behaviors;
- Sort can be used in Rust_ by ,sort_by_key for user-defined sorting, and sort with higher performance_ by_ cached_ key, _ The unstable version can be selected;
- Use sort. In Go SearchInts,sort.SearchFloat64s,sort.SearchStrings performs binary search in the corresponding slice, and internally calls sort Search query the index greater than or equal to a specific value;
- Binary in Rust_ Search performs binary search. In fact, binary is called internally_ search_ Binary search performed by;
- Go calls sort When searching with search, you will continue to search when you encounter elements that meet the conditions; And Rust's binary_ search_ When searching by, it will return immediately if it meets the conditions. The implementation of these two search algorithms is different, resulting in different results when querying data containing duplicate elements.