Hash table, hash bucket data structure and analyzing the use of hash bucket in HashMap source code

1, Hash table

1. Hash table concept

You can get the elements to be searched directly from the table at one time without any comparison. If a storage structure is constructed to establish a one-to-one mapping relationship between the storage location of an element and its key code through a function (hashFunc), the element can be found quickly through this function. Its storage space is called hash table. The time complexity of addition, deletion, query and modification is O(1).

When adding to this structure:

1. Insert element
According to the key of the element to be inserted, this function calculates the storage location of the element and stores it according to this location.
2. Search element
Carry out the same calculation for the key of the element, take the obtained function value as the storage location of the element, and compare the elements according to this location in the structure. If the key codes are equal, the search is successful.

This method is called hash (hash) method. The conversion function used in hash method is called hash (hash) function, and the structure constructed is called hash table (or Hash list).

For example: data set {1, 7, 6, 4, 5, 9};

The hash function is set to: hash (key) = key% capacity; Capacity is the total size of the underlying space of the storage element.


Using this method to search does not need to compare multiple key codes, so the search speed is relatively fast and the time complexity is O(1). Question: according to the above hash method, if there are already elements in subscript 4, what problems will occur when inserting element 44 into the set?

2. The concept of conflict

Conflict means that different keyword keys get the same position through the same hash function. This phenomenon is called hash conflict or hash collision.

Data elements with different keys and the same hash address are called "synonyms".

3. Conflict avoidance and resolution

First of all, we need to make it clear that the capacity of the underlying array of our hash table is often less than the actual number of keywords to be stored, which leads to a problem. The occurrence of conflict is inevitable, but what we can do is to reduce the conflict rate as much as possible.

There are two ways to avoid conflicts. One is to set a reasonable hash function, and the other is to adjust the load factor. The greater the load factor, the greater the conflict rate.

3.1 ways to avoid conflict 1 - Design of hash function

Design principle of hash function:
1. The definition field of hash function must include all keys to be stored. If the hash table allows m addresses, its value field must be between 0 and m-1.
2. The address calculated by hash function can be evenly distributed in the whole space.
3. Hash functions should be simple.

Common hash function settings:

  1. Direct customization method – (common)
    Take a linear function of the keyword as the Hash address: Hash (Key) = A*Key + B advantages: simple and uniform disadvantages: you need to know the distribution of keywords in advance. Usage scenario: it is suitable for finding small and continuous situations.
  2. Division and remainder method – (common)
    Let the number of addresses allowed in the hash table be m, take a prime number p not greater than m but closest to or equal to m as the divisor, and convert the key code into a hash address according to the hash function: hash (key) = key% p (p < = m)
  3. Square middle method – (understand)
    Assuming that the keyword is 1234, its square is 1522756, and the middle three bits 227 are extracted as the hash address; For another example, if the keyword is 4321, its square is 18671041. It is more suitable to extract the middle three bits 671 (or 710) as the square median of the hash address: the distribution of keywords is not known, and the number of bits is not very large.
  4. Folding method – (understand)
    The folding method is to divide the keyword from left to right into several parts with equal digits (the digits of the last part can be shorter), then overlay and sum these parts, and take the last few digits as the hash address according to the length of the hash table.
    The folding method is suitable for the situation that there is no need to know the distribution of keywords in advance and there are many keywords.
  5. Random number method – (understand)
    Select a random function and take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is a random number function. This method is usually used when the keyword length is different.
  6. Mathematical analysis – (understanding)
    There are n d digits, and each digit may have r different symbols. The frequency of these r different symbols may not be the same, and they may appear in a certain place
    Some bits are evenly distributed, and each symbol has equal opportunities. In some bits, it is unevenly distributed, and only some symbols often appear. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash address. For example:

    Suppose we want to store the employee registration form of a company. If we use the mobile phone number as the keyword, it is very likely that the first seven digits are the same. Then we can select the last four digits as the hash address. If such extraction is prone to conflict, we can also reverse the extracted numbers (for example, 1234 is changed to 4321), shift the right ring (for example, 1234 is changed to 4123), shift the left ring The first two numbers and the last two numbers are superimposed (for example, 1234 is changed to 12 + 34 = 46).

Digital analysis method is usually suitable for dealing with the situation of large number of keywords, if the distribution of keywords is known in advance and several bits of keywords are evenly distributed.

Note: the more sophisticated the hash function is designed, the lower the possibility of hash conflict, but hash conflict cannot be avoided.

3.2 way to avoid conflict 2 - adjustment of load factor


Rough demonstration of the relationship between load factor and conflict rate:

Therefore, when the conflict rate reaches an intolerable level, we need to reduce the conflict rate in disguise by reducing the load factor.

Given that the number of existing keywords in the hash table is immutable, all we can adjust is the size of the array in the hash table.

Summary: calculation formula of load factor: the number of data currently stored in the table / the size of the table. Therefore, if you want to reduce the load factor, you can only increase the length of the table.

3.3 conflict resolution method 1 - closed hash

Closed hash: also known as open addressing method. In case of hash conflict, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the "next" empty position in the conflict position. There are two ways to find the next empty closed hash:

  1. Linear detection
    For example, in the above scenario, you need to insert element 44. First calculate the hash address through the hash function with the subscript of 4. Therefore, 44 should be inserted in this position theoretically, but the element with the value of 4 has been placed in this position, that is, hash conflict occurs.

Linear detection: start from the conflicting position and detect backward in turn until the next empty position is found.

Insert operation:
1. Obtain the position of the element to be inserted in the hash table through the hash function.
2. If there is no element in this position, insert the new element directly. If there is a hash conflict in this position, use linear detection to find the next empty position and insert the new element.

When using closed hash to deal with hash conflicts, the existing elements in the hash table cannot be deleted physically. If the elements are deleted directly, the search of other elements will be affected. For example, if element 4 is deleted directly, 44 searching may be affected. Therefore, linear detection uses the marked pseudo deletion method to delete an element.

Disadvantages: it will pile up conflicting elements.

  1. Secondary detection
    Secondary detection in order to avoid the disadvantage of linear detection, the method of finding the next empty position is as follows:

    Or minus the square of i. i is the number of conflicts, i starts from 1.

For example:
When the hash table is as shown in the figure below, 44 should be inserted, but there are already elements in the 4 subscript. This is the first conflict, so calculate Hi=(4+1)%10=5.

At this time, insert 24 again, but the 4 subscript already has elements, which is the first conflict. Find another subscript. At this time, there are elements in the 5 subscript, which is the second conflict. If there is no element in the 6 subscript, then Hi=(4+4)%10=8, then 24 is put into the 8 subscript.

The research shows that when the length of the table is a prime number and the table loading factor A does not exceed 0.5, the new table entry can be inserted, and any position will not be explored twice. Therefore, as long as there are half empty positions in the table, there will be no problem of full table. When searching, you can not consider the situation that the table is full, but when inserting, you must ensure that the loading factor a of the table does not exceed 0.5. If it exceeds, you must consider increasing the capacity.

Therefore: the biggest defect of closed hash is the low space utilization, which is also the defect of hash.

3.4 conflict resolution method 2 - hashing and hashing (important)

The open hash method, also known as the chain address method (open chain method), first uses the hash function to calculate the hash address of the key code set. The key codes with the same address belong to the same subset. Each subset is called a bucket. The elements in each bucket are linked through a single linked list, and the head node of each linked list is stored in the hash table.

In fact, the hash bucket can be regarded as transforming the search problem of large sets into the search problem of small sets. If the conflict is serious, it means that the search performance of small sets is actually poor. At this time, we can continue to transform the so-called small set search problem, for example:

  1. Behind each bucket is another hash table
  2. Behind each bucket is a search tree (red black tree)

Illustration of chain address method: when there is a hash bucket under a subscript, the value of the subscript corresponding to the hash table is the address of the chain header node. If there is no hash bucket, it is null. Each node is composed of key, value and next.

Detailed operation:
Because the hash bucket is used to solve the conflict, it is necessary to create a linked list array composed of nodes under each subscript, Node[] array;, And usedSize is used to record the number of nodes in the array in order to find the load factor. Each hash bucket is composed of nodes, so the node class can be defined into the hash bucket class.

put and resize methods: first calculate the corresponding subscript according to the designed hash function, find the corresponding subscript and then traverse the linked list of the subscript, * * if there are nodes with the same key value, update the value in the key value and exit** If not, create a hash bucket, set the cur node to start from the first node, and insert a new node with header interpolation. Of course, after insertion, calculate whether the load factor exceeds 0.75. If it exceeds 0.75, expand the capacity** After capacity expansion, the length of the array will become longer. Therefore, if the calculation result of each hash bucket is different from the original according to the new hash function, it is necessary to re hash. (resize method) * * after all hash buckets are re hashed, the reference of the new array will be assigned to the reference of the old array.

get method: first find the corresponding subscript according to the hash function, and then find out whether there is a hash bucket with the same value from the subscript. If so, return the value of the hash bucket, otherwise return - 1.

Overall Code:

public class HashBucket {

    static class Node {
        public int key;
        public int val;
        public Node next;

        public Node(int key, int val) {
            this.key = key;
            this.val = val;
        }
    }

    private Node[] array;
    public int usedSize;

    public HashBucket() {
        this.array = new Node[10];
        this.usedSize = 0;
    }


    public void put(int key,int val) {
        //1. Determine subscript
        int index = key % this.array.length;
        //2. Traverse the linked list of this subscript
        Node cur = array[index];
        while (cur != null) {
            //Update val
            if(cur.key == key) {
                cur.val = val;
                return;
            }
            cur = cur.next;
        }
        //3. cur == null the linked list of the current array subscript does not need a key
        Node node = new Node(key,val);
        node.next = array[index];
        array[index] = node;
        this.usedSize++;
        //4. Judge whether the current load factor is exceeded
        if(loadFactor() >= 0.75) {
            //Capacity expansion
            resize();
        }
    }

    public int get(int key) {
        //In what way it is stored, it is retrieved in what way
        int index = key % this.array.length;
        Node cur = array[index];
        while (cur != null) {
            if(cur.key == key) {
                return cur.val;
            }
            cur = cur.next;
        }
        return -1;//
    }

    public double loadFactor() {
        return this.usedSize*1.0 / this.array.length;
    }

    public void resize() {
        //Create a new 2x array yourself
        Node[] newArray = new Node[2*this.array.length];
        //Traverse the original hash bucket
        //Outermost loop control array subscript
        for (int i = 0; i < this.array.length; i++) {
            Node cur = array[i];
            Node curNext = null;
            while (cur != null) {
                //Record cur next
                curNext = cur.next;
                //Subscript in new array
                int index = cur.key % newArray.length;
                //Perform head insertion
                cur.next = newArray[index];
                newArray[index] = cur;
                cur = curNext;
            }
        }
        this.array = newArray;
    }


    public static void main(String[] args) {
        HashBucket hashBucket = new HashBucket();
        hashBucket.put(1,1);
        hashBucket.put(4,4);
        hashBucket.put(14,14);
        hashBucket.put(24,24);
        hashBucket.put(34,34);
        hashBucket.put(44,44);
        hashBucket.put(54,54);
        hashBucket.put(64,64);

        System.out.println(hashBucket.get(54));
        System.out.println("ffafas");
    }
}

4. Hash table lookup success and lookup failure

Search success and search failure are obtained after filling the data into the hash table.

4.1 finding successful solutions

A hash table is shown in the figure below:

Suppose you want to find 4 now, according to the hash function hash (key) = key% array Length, the 4 subscript is found, because the data under the 4 subscript is 4, the search is successful once, and the calculation method of the number of successful searches of 1 and 9 data is also the same.
If you want to find 24 at this time, first find the subscript of 4 according to the hash function, but the data of the subscript of 4 is 4 instead of 24. Calculate and find it once. At this time, move back one unit and find that 44 is not the 24 we are looking for, then it is the second time. At this time, move back one unit and find that it is 24, which is the third time.

Therefore, the average search length of successful search in the figure above: (1 + 1 + 2 + 3 + 1) / 5 = 8 / 5.

If the value of the 9 subscript is not the data to be found, it will be returned to the 0 subscript to start the search, and the search will be counted once after the 9 subscript fails to be found.

4.2 finding failed method

A hash table is shown in the figure below:

If data 1 is found for the first time, hash (key) = key% array The subscript of length is 1. Assuming that the data in subscript 1 is not the data 1 we are looking for, if the search fails once, move back one unit to find it. At this time, if the value of subscript 2 is empty, it indicates that the search has failed twice.

If the data 44 is found at this time and the subscript is 4 according to the hash function, but the data under the subscript 4 is not 44 at this time, it is not guaranteed to search on the premise of successful search. Therefore, move back one unit to find it. At this time, it is found that the data of subscript 5 is 44, and the number of search failures starts to be calculated from this time. Let's still assume that the current 5 subscript fails to find. If the search fails once, move back one unit to find it. Similarly, if the value of subscript 6 is not 44 and the search fails, move back one unit to find it. After reaching subscript 7, if the data of the subscript is found to be empty, the search must fail. This time, the search is also counted as one time. Therefore, the length of search failure adds up to three times.

Note: the operation of finding failure is to find the next empty place after finding the value to be found.

2, Using generics to realize the bottom implementation of open hash and open hash in the source code

1. Implement open hash with generics

If a hash table is implemented with a generic type, its key and val are reference types and cannot be directly compared with the greater than or less than sign as above.

Step: first define a class Person and create a constructor with one parameter. Then change the HashBuck class into a generic class, directly add < K, V > after the class name, and the value of each subscript in the hash table is a linked list, which is composed of nodes. Therefore, set public node < K, V > [] array = (node < K, V > []) new node [10];.

Because our key is a reference type, we can't directly use the hash function to find the subscript, so how to convert a reference into a value to find the subscript?

At this time, our own class (key) should implement the equals and hashcode methods at the same time. The hashcode method is to convert a reference type into a value, so as to find the corresponding subscript to be stored. The equals method determines whether two references are the same.

For example:
When = = is used to compare, the address is compared, which is false.

class Student {
    public int id;

    public Student(int id) {
        this.id = id;
    }
}
public class Test {
    public static void main(String[] args) {
        Student student1 = new Student(12);
        Student student2 = new Student(12);
        System.out.println(student1==student2);
    }
}
//Print result: false

However, after rewriting the equals method, the following example is equivalent to judging whether they are the same person, because their IDs are the same.

class Student {
    public int id;

    public Student(int id) {
        this.id = id;
    }
    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Student student = (Student) o;
        return id == student.id;
    }

    @Override
    public int hashCode() {
        return Objects.hash(id);
    }
}
public class Test {
    public static void main(String[] args) {
        Student student1 = new Student(12);
        Student student2 = new Student(12);
        System.out.println(student1.equals(student2));
    }
}
//Print result: true

Interview questions:

  1. What is the difference between hashcode and equals?
    A: in HashMap, hashcode is used to locate the subscript position where the current key value should be stored. equals is used to traverse the linked list and compare which key is the same after hashcode locates a subscript.
  2. Why should hashcode and equals appear at the same time?
    A: in HashMap, the stored subscript must be located through hashcode first, but if you want to obtain a key, you need to use the equals method. Therefore, hashcode and equals are used together in HashMap. Therefore, if you want to store your own custom data types in HashMap in the future, you must re hashcode and equals methods in the classes of custom data types.
  3. When the equals of two data are the same, are hashcodes the same? If the hashcodes of the two data are the same, must the equals be the same? If the hashcodes of two data are different, are equals necessarily the same?
    A: when the equals of two data are the same, the hashcode must be the same. If the hashcodes of two data are the same, equals is not necessarily the same. If the hashcodes of two data are different, the equals must be different.

Overall Code:

class Person {
    public String id;

    public Person(String id) {
        this.id = id;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Person person = (Person) o;
        return Objects.equals(id, person.id);
    }

    @Override
    public int hashCode() {
        return Objects.hash(id);
    }
}
class HashBuck<K,V> {
    static class Node<K,V> {
        public K key;
        public V val;
        public Node<K,V> next;

        public Node(K key, V val) {
            this.key = key;
            this.val = val;
        }
    }

    public Node<K,V>[] array = (Node<K,V>[])new Node[10];
    public int usedSize ;

    public void put(K key,V val) {
        int hash = key.hashCode();
        int index = hash%array.length;
        Node<K,V> cur = array[index];
        while(cur!=null) {
            if(cur.val.equals(val)) {
                return;
            }
            cur=cur.next;
        }
        Node<K,V> node = new Node<>(key,val);
        node.next=array[index];
        array[index]=node;
        usedSize++;
        if(loadFactor()>=0.75) {
            resize();
        }
    }
    public void resize() {
        Node<K,V>[] newArray = (Node<K,V>[])new Node[2*array.length];
        for (int i = 0; i < array.length; i++) {
            Node<K,V> cur = array[i];
            while(cur!=null) {
                int hash = cur.key.hashCode();
                int index = hash% newArray.length;
                Node<K,V> curNext = cur.next;
                cur.next=newArray[index];
                newArray[index]=cur;
            }
        }
        array=newArray;
    }

    public double loadFactor() {
        return usedSize*1.0/array.length;
    }

    public V get(K key) {
        int hash = key.hashCode();
        int index = hash%array.length;
        Node<K,V> cur = array[index];
        while(cur!=null) {
            if(cur.key.equals(key)) {
                return cur.val;
            }
            cur=cur.next;
        }
        return null;
    }
}
public class Test {
    public static void main(String[] args) {
        Person person1 = new Person("123");
        Person person2 = new Person("123");
        HashBuck<Person,String> hashBuck = new HashBuck<>();
        hashBuck.put(person1,"zjr");
        System.out.println(hashBuck.get(person2));
    }
}

2. The underlying implementation of open hash on the source code

The basic principle of hash bucket has been mentioned in Section 3.4, so how does it implement open hash in Java?
Note: the bottom layers of HashMap and HashSet in Java are hashed.

When storing:
1. Find out whether there is the same key in the linked list of the current location.
2. When the array exceeds 64 and a linked list exceeds 8, the linked list becomes a red black tree for processing, and its search efficiency is faster.
3,JDK1. After 8, when there are elements inserted in the array, the linked list implements the tail insertion method.

Click HashMap in Java to learn its source code:
1,

A node class is defined in the HashMap class, and the map is implemented Entry < K, V >, indicating that there are nodes in HashMap.

2. The default initial capacity is 16.
3. The maximum length of the array is 2 ^ 30

4. The default load factor is 0.75f

5. The premise of becoming a red black tree is that the size of the array exceeds 64

6. The array name in HashMap is table, and the initial length is 0.

7. Calling the parameterless construction method is to assign a value to the load factor.

2.1 HashMap bottom interview questions

If you want to know the interview questions, you must understand the following figure and the source code of the reference part.

First put:

Second put:

How to adjust in case of conflict:

Corresponding to the first question:

1. If new HashMap(19), how big is the bucket array? Answer: 2 ^ 5 = 32.
2. When will HashMap open the bucket array to occupy memory? A: the first time.
3. When will hashmap be expanded? Answer: when the load factor is 0.75f.
4. What happens when two objects have the same hashcode? Answer: hash conflict or collision.
5. If the hashcode of two keys is the same, how do you get the value object? A: start traversing the linked list at the array position of hashcode.
6. Do you understand the problem of resizing HashMap? Answer: re hash.

Keywords: data structure

Added by rxero on Wed, 26 Jan 2022 22:15:50 +0200