[import] how to store and query the following key value pairs?
1: Watermelon
2: Apples
3: Oranges
Simple storage using array arr[N]:
watermelon | Apple | Orange |
When storing, take the key value as the index, directly obtain the storage location through offset, and assign it:
arr[1] = "watermelon"; arr[2] = "Apple"; arr[3] = "Orange";
When querying, use the key value as the index to directly obtain the data:
return arr[index];
This is a very efficient query method, which directly offsets the index elements without comparing them one by one.
If the storage content is a little complicated, for example, record the fruits that children like to eat, the above method is not easy to use:
Xiao Ming: watermelon
Xiaohong: apple
Monkey: orange
key is no longer a simple index and cannot be directly used as an index of an array.
Therefore, you need to convert the key into a simple index through some method:
PUT: index = Hash("Xiao Ming"); arr[index] = "watermelon"; GET: index = Hash("Xiao Ming"); return arr[index];
The above is the essence of hashMAP.
The implementation of "Hash()" method will not be studied in depth for the time being. Here are some common methods. If you are interested, you can have an in-depth understanding of its implementation principle: separation and connection method, open addressing method, linear detection method and square detection method.
Generally, hash obtains a large string of numbers, but the array capacity is limited. The usual processing is as follows:
PUT: index = Hash("Xiao Ming"); index = index % ARR_MAX_LEN; arr[index] = "watermelon"; GET: index = Hash("Xiao Ming"); index = index % ARR_MAX_LEN; return arr[index];
There are several questions:
1, Different key s may eventually get the same index. For example, Xiaoming gets 1 after hashing, Xiaohong gets 11 after hashing, and the array capacity is 10. Both Xiaoming and Xiaohong get 1. This phenomenon is called "hash conflict";
Two, modulo operation is inefficient, and if query is very frequent, this efficiency is relatively low.
3, The capacity of the array is uncertain. The initial setting is too large, resulting in waste; If the capacity is too small, the query will be inefficient due to too many conflicts.
Solution 1. There are many ways to solve "hash conflict". The common way is to hang the data with the same index to the corresponding array bucket in the form of linked list.
Solving two and modular operations are usually replaced by bit operations.
PUT: index = Hash("Xiao Ming"); index = index & (ARR_MAX_LEN-1); // Replace modulo operation with bit operation arr[index] = "watermelon"; GET: index = Hash("Xiao Ming"); index = index & (ARR_MAX_LEN-1); // Replace modulo operation with bit operation return arr[index];
Solution 3. The array capacity is dynamically adjusted. The initial value is 16. When the amount of data reaches the threshold (usually twice the capacity), it will be automatically expanded and all data will be hashed again.
typedef struct hashmap_s { unsigned long capability; unsigned long used; void **arr; } hashmap_t; hashmap_t hashmap; bool hash_put (char *key, char *val) { if (hashmap->capability > (hashmap->used)<<1) { index = Hash(key); index = index & (hashmap->capability-1); hashmap->arr[index] = val; return true; } //rehash // Malloc (HashMap - > capability < < 1) & & update capability, used, arr // Re hash all the original data and put it into the space pointed to by the new arr. } char *hash_get (char *key) { index = Hash(key); index = index & (hashmap->capability-1); return hashmap->arr[index]; // You need to check whether the space has been inserted with data }
When re hashing, the query speed will be very slow. redis uses the divide and conquer strategy, that is, if capacity expansion is needed, the second table will be generated: table2. Query the data in table 1. If it is found, re hash the data and move it to table 2. After countless queries, once all the data in table 1 is migrated to table 2, and then table1 is released, and table2 replaces table1.
At the same time, when re hashing, This will increase memory usage (expand a larger table). If redis is performing some operations that consume a lot of memory (such as backup), it may lead to memory shortage. Therefore, when re hashing, if it is found that other operations that consume a lot of memory are being performed, re hashing shall be delayed as much as possible. That is, re hashing shall be performed only when (used > 4 * capacity).
Why is the initial value 16? And then it is expanded to the power of 2?
Because the power of 2 is regarded as capability, then (capability-1) will become all 1 when converted to binary:
16-1=15
Binary of 15: 1111
Index & 1111b, which can ensure that each bucket is used, so that the conflict will be minimized.