Analysis of Java collection source code -- HashMap

1, Member variables in hashMap.

Because there are many member variables in hashMap, they are listed here first. Can be more clear.

// Default initial capacity (array default initialization size)
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // 16
// Maximum capacity
static final int MAXIMUM_CAPACITY = 1 << 30; //
// Default load factor (the load factor needs to be calculated when expanding the map)
static final float DEFAULT_LOAD_FACTOR = 0.75f;
//  Treelization threshold of bucket: that is, the threshold of converting the linked list into a red black tree. When storing data, when the len gt h of the linked list is > this value, the linked list will be converted into a red black tree
static final int TREEIFY_THRESHOLD = 8;
// Linked list restore threshold of bucket: that is, the threshold for the red black tree to be converted into a linked list. When the capacity is expanded (resize ()) (at this time, the data storage location of HashMap will be recalculated), it will be recalculated
// After storing the location, when the number in the original red black tree is < 6, the red black tree will be converted into a linked list
static final int UNTREEIFY_THRESHOLD = 6;
// Minimum tree capacity threshold: that is, only when the capacity in the hash table is > this value can the tree linked list be allowed (that is, the linked list is converted into a red black tree)
// Otherwise, if there are too many elements in the bucket, the capacity will be expanded directly instead of being tree shaped
//  In order to avoid the conflict of capacity expansion and tree selection, this value cannot be less than 4 * tree_ THRESHOLD
static final int MIN_TREEIFY_CAPACITY = 64;
// Array (cannot be serialized)
transient Node<K,V>[] table;
// Collection of all key value pairs (cannot be serialized)
transient Set<Map.Entry<K,V>> entrySet;
// Number of key value pairs in map (cannot be serialized)
transient int size;
// Total number of operations on map (cannot be serialized)
transient int modCount;
// Next size value to be resized (capacity * load factor)
int threshold;
// Load factor of hash table.
final float loadFactor; 

Let's briefly explain the meaning of these parameters, and then analyze their functions.

The HashMap class has the following main member variables:

  • transient int size;
    • The number of KV pairs in the Map is recorded
  • loadFactor
    • Load factor, which is used to measure the degree of HashMap fullness. The default value of loadFactor is 0.75f (static final float DEFAULT_LOAD_FACTOR = 0.75f;).
  • int threshold;
    • Critical value. When the actual number of KV exceeds the threshold, HashMap will expand the capacity. Threshold = capacity * loading factor
  • In addition to the above important member variables, there is another concept closely related to them in HashMap: capacity
    • Capacity, if not specified, the default capacity is 16 (static final int default_initial_capacity = 1 < < 4;)

Maybe you're still a little confused after reading it. What's the relationship between size and capacity? Why define these two variables. What do loadFactor and threshold do?

2, hashMap expansion

The difference between size and capacity in HashMap is actually quite simple to explain. We know that HashMap is like a "bucket", so capacity is the maximum number of elements that can be loaded in the bucket "currently", and size indicates how many elements have been loaded in the bucket. Let's look at the following code:

    Map<String, String> map = new HashMap<String, String>();
    map.put("name", "ssl");

    Class<?> mapType = map.getClass();
    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));

    Field size = mapType.getDeclaredField("size");
    size.setAccessible(true);
    System.out.println("size : " + size.get(map));

We defined a new HashMap, put an element in it, and then print capacity and size by reflection. The output result is: capacity: 16, size: 1

By default, the capacity of a HashMap is 16. The main advantage of designing 16 is that bitwise and alternative modulo can be used to improve the efficiency of hash.

Why did I just say that capacity is the maximum number of elements that can be loaded in this bucket "currently"? How to understand it at present. In fact, HashMap has a capacity expansion mechanism. When a HashMap is initialized for the first time, its capacity is 16 by default. When the capacity expansion conditions are met, it needs to be expanded from 16 to 32.

We know that one of the overloaded constructors of HashMap supports the input of initialCapacity, so let's try to set it to see the result.

    Map<String, String> map = new HashMap<String, String>(1);

    Class<?> mapType = map.getClass();
    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));

    Map<String, String> map = new HashMap<String, String>(7);

    Class<?> mapType = map.getClass();
    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));


    Map<String, String> map = new HashMap<String, String>(9);

    Class<?> mapType = map.getClass();
    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));

Execute the above three codes and output: capacity: 1, capacity: 8, capacity: 16.

That is, by default, the capacity of HashMap is 16. However, if the user specifies a number as the capacity through the constructor, Hash will select a power greater than the first 2 of the number as the capacity. (1->1,7->8,9->16)

Here is a small suggestion: when initializing HashMap, you should specify its size as much as possible. Especially when you know the number of elements stored in the map. (Alibaba Java development protocol)

As mentioned earlier, HashMap has a capacity expansion mechanism, that is, when the capacity expansion conditions are met, the capacity will be expanded from 16 to 32, 64, 128

So, what does this expansion condition mean?

In fact, the expansion condition of HashMap is that when the number (size) of elements in HashMap exceeds the threshold, it will be expanded automatically.

In HashMap, threshold = loadFactor * capacity.

loadFactor is the loading factor, which indicates the full degree of HashMap. The default value is 0.75f. Setting it to 0.75 has one advantage, that is, 0.75 is exactly 3 / 4, and capacity is the power of 2. Therefore, the product of two numbers is an integer.

For a default HashMap, by default, capacity expansion will be triggered when its size is greater than 12 (16 * 0.75).

    Map<String, String> map = new HashMap<>();
    map.put("hollis1", "hollischuang");
    map.put("hollis2", "hollischuang");
    map.put("hollis3", "hollischuang");
    map.put("hollis4", "hollischuang");
    map.put("hollis5", "hollischuang");
    map.put("hollis6", "hollischuang");
    map.put("hollis7", "hollischuang");
    map.put("hollis8", "hollischuang");
    map.put("hollis9", "hollischuang");
    map.put("hollis10", "hollischuang");
    map.put("hollis11", "hollischuang");
    map.put("hollis12", "hollischuang");
    Class<?> mapType = map.getClass();

    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));

    Field size = mapType.getDeclaredField("size");
    size.setAccessible(true);
    System.out.println("size : " + size.get(map));

    Field threshold = mapType.getDeclaredField("threshold");
    threshold.setAccessible(true);
    System.out.println("threshold : " + threshold.get(map));

    Field loadFactor = mapType.getDeclaredField("loadFactor");
    loadFactor.setAccessible(true);
    System.out.println("loadFactor : " + loadFactor.get(map));

    map.put("hollis13", "hollischuang");
    Method capacity = mapType.getDeclaredMethod("capacity");
    capacity.setAccessible(true);
    System.out.println("capacity : " + capacity.invoke(map));

    Field size = mapType.getDeclaredField("size");
    size.setAccessible(true);
    System.out.println("size : " + size.get(map));

    Field threshold = mapType.getDeclaredField("threshold");
    threshold.setAccessible(true);
    System.out.println("threshold : " + threshold.get(map));

    Field loadFactor = mapType.getDeclaredField("loadFactor");
    loadFactor.setAccessible(true);
    System.out.println("loadFactor : " + loadFactor.get(map));

When the number of elements in the HashMap reaches 13, the capacity will expand from 16 to 32.

HashMap also provides a method that supports passing in two parameters, initialCapacity and loadFactor, to initialize capacity and load factor. However, it is generally not recommended to modify the value of loadFactor.

3, hash function in HashMap

Hash:

The Hash algorithm translates the input Hash into a Hash with a fixed length. This transformation is a compression mapping, that is, the space of Hash value is usually much smaller than that of input. Different inputs may be hashed into the same output, so it is impossible to uniquely determine the input value from the Hash value. In short, it is a function that compresses messages of any length to a message digest of a fixed length.

All hash functions have the following basic characteristics: if the hash value calculated according to the same hash function is different, the input value must be different. However, if the hash values calculated by the same hash function are the same, the input values are not necessarily the same.

The phenomenon that two different input values have the same hash value calculated according to the same hash function is called collision.

Common hash functions are as follows:

Direct addressing method: directly take the keyword k or k plus a constant (k+c) as the hash address.

Digital analysis method: extract the number with uniform value in the keyword as the hash address.

Divide and leave the remainder method: divide the keyword k by a number p not greater than the hash table length m, and take the resulting remainder as the hash table address.

Piecewise superposition: divide the keyword into equal parts according to the address bits of the hash table, and the last part can be relatively short. Then add these parts and discard the highest carry. The result is the hash address of the keyword.

Square middle method: if all parts of the keyword are unevenly distributed, you can first find its square value, and then take the middle digits as the hash address as required.

Pseudo random number method: a pseudo-random number is used as the hash function.  

Collision has been described above. The important index to measure the quality of a hash function is the probability of collision and the solution of collision. Any hash function cannot completely avoid collision. The common methods to solve collision are as follows:

  • Open addressing method:
    • Open addressing method is to find the next empty hash address in case of conflict. As long as the Hash list is large enough, the empty hash address can always be found and the record can be stored.
  • Chain address method
    • Take each cell of the hash table as the head node of the linked list, and all elements with hash address i form a synonym linked list. That is, in case of conflict, the keyword is chained to the tail of the linked list with the unit as the head node.
  • double hashing
    • When the hash address conflicts, use other functions to calculate the address of another hash function until the conflict no longer occurs.
  • Establish public overflow area
    • The hash table is divided into basic table and overflow table, and the conflicting elements are put into the overflow table.

Data structure of HashMap:

We can see from the above figure that there is an array on the left, and each member of the array is a linked list. All elements contained in the data structure contain a pointer for links between elements. We assign elements to different linked lists according to their own characteristics. In turn, we find the correct linked list through these characteristics, and then find the correct elements from the linked list. Among them, the method to calculate the subscript of the element array according to the element characteristics is the hash algorithm, that is, the protagonist hash() function in this paper (of course, it also includes the indexOf() function).

Source code analysis:

Firstly, in the same version of JDK, the implementation of hash methods in HashMap, HashTable and ConcurrentHashMap are different. There are also differences in different versions of JDK (Java 7 and Java 8). I'll try to introduce it all. I believe that after reading this article, you will fully understand the hash method.

Before the code, let's do a simple analysis. As we know, the function of hash method is to locate the position of K-V in the linked list array according to the Key. That is, the input of the hash method should be a Key of type Object, and the output should be an array subscript of type int. If you were asked to design this method, what would you do?

In fact, it's simple. We just need to call the hashCode() method of the Object object, which will return an integer, and then use this number to modulus the capacity of the HashMap or HashTable. Yes, in fact, this is the basic principle. However, in terms of specific implementation, it is implemented by two methods: int hash(Object k) and int indexFor(int h, int length). However, considering the efficiency and other issues, the implementation of HashMap will be a little more complex.

hash: this method mainly converts an Object into an integer.

indexFor: this method mainly converts the integer generated by hash into the subscript in the linked list array.

 HashMap in Java7

final int hash(Object k) {
    int h = hashSeed;
    if (0 != h && k instanceof String) {
        return sun.misc.Hashing.stringHash32((String) k);
    }

    h ^= k.hashCode();
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

static int indexFor(int h, int length) {
    return h & (length-1);
}

As I said earlier, the indexFor method mainly converts the integer generated by hash into the subscript in the linked list array. Then return h& (length-1); What does that mean? In fact, he just takes the mold. All of Java uses bit operation (&) instead of modulo operation (%), and the main consideration is efficiency. The efficiency of bit operation (&) is much higher than that of replacing modulo operation (%), mainly because bit operation directly operates the memory data and does not need to be converted to decimal, so the processing speed is very fast.

In fact, using bit operation instead of modular operation, in addition to performance, another advantage is that it can well solve the problem of negative numbers. Because we know that the result of hashcode is of type int, and the value range of int is - 2 ^ 31 ~ 2 ^ 31 - 1, that is [- 2147483648, 2147483647]; For negative numbers, we know that there is some trouble in taking negative numbers. If binary bit operation is used, this problem can be avoided. First, whether the value of hashcode is positive or negative. The value of length-1 must be a positive number. Then, the first bit of his binary must be 0 (the signed number uses the highest bit as the sign bit, "0" represents "+" and "1" represents "-"), so after the bitwise sum operation of the two numbers, the first bit must be 0, that is, the result must be a positive number.

 HashMap in Java8

Before Java 8, HashMap and other map based classes used the chain address method to resolve conflicts. They used a one-way linked list to store elements with the same index value. In the worst case, this method will reduce the performance of HashMap's get method from O(1) to O(n). In order to solve the problem of HashMap performance degradation in frequent conflicts, Java 8 uses a balanced tree instead of a linked list to store conflicting elements. This means that we can improve the worst-case performance from O(n) to O(logn). As for the optimization of HashMap in Java 8, I will continue to introduce it in depth in a later article.

If the malicious program knows that we are using the Hash algorithm, in the case of pure linked list, it can send a large number of requests, resulting in Hash collision, and then keep accessing these key s, resulting in the HashMap busy with linear search and finally falling into paralysis, that is, a denial of service attack (DoS).

The principle of hash function in Java 8 is basically similar to that in Java 7. This step is optimized in Java 8. Only 16 bit right displacement XOR mixing is done once instead of four times, but the principle remains the same.

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

 

4, Why is the default initial value in hashMap set to 16

As a data structure, HashMap requires hash operation in the process of put ting the element, in order to calculate the specific location of the element stored in the HashMap.

In fact, the process of hash operation is to hashcode the Key of the target element, and then modulus the capacity of the Map. In order to improve the efficiency of modulus, JDK engineers use bit operation instead of modulus operation, which requires that the capacity of the Map must be a power of 2.

As the default capacity, too large or too small is not appropriate, so 16 is adopted as a more appropriate empirical value.

In order to ensure that the capacity of the Map is the power of 2 in any case, HashMap limits it in two places.

First, if the user sets the initial capacity, HashMap calculates the power of the first 2 greater than this number as the initial capacity.

In addition, the capacity expansion is also doubled, that is, 4 becomes 8 and 8 becomes 16.

5, Why is the default load factor initial value of hashMap set to 0.75

HashMap is a K-V structure. In order to improve the speed of query and insertion, the bottom layer adopts the data structure of linked list array.

However, when calculating the position of the element, the hash algorithm needs to be used, and the hash algorithm adopted by HashMap is the chain address method. This approach has two extremes.

If the hash conflict probability in HashMap is high, HashMap will degenerate into a linked list (not really degenerate, but operate like a direct linked list). As we know, the biggest disadvantage of the linked list is that the query speed is relatively slow, and it needs to traverse one by one from the header.

Therefore, in order to avoid a large number of hash conflicts in HashMap, it is necessary to expand its capacity at an appropriate time.

The expansion condition is when the number of elements reaches the critical value. Calculation method of critical value in HashMap:

Critical value( threshold) = Load factor( loadFactor) * Capacity( capacity)

The load factor represents the maximum full degree that an array can achieve. This value should not be too large or too small.

If the loadFactory is too large, for example, equal to 1, there will be a high probability of hash conflict, which will greatly reduce the query speed.

If the loadFactory is too small, for example, equal to 0.5, frequent capacity expansion will greatly waste space.

Therefore, this value needs to be between 0.5 and 1. According to the mathematical formula. This value is reasonable in log(2).

In addition, in order to improve the capacity expansion efficiency, the capacity of HashMap has a fixed requirement, that is, it must be the power of 2.

Therefore, if the loadFactor is 3 / 4, the product of and capacity can be an integer.

Therefore, in general, we do not recommend modifying the value of loadFactory unless for special reasons.

For example, if I clearly know that my Map only saves 5 kv and will never change, I can consider specifying loadFactory.

 

6, Why is it recommended to set the initial value of hashMap and how much is appropriate

If we do not set the initial capacity, HashMap will be expanded many times with the increasing number of elements. The expansion mechanism in HashMap determines that the hash table needs to be rebuilt for each expansion, which greatly affects the performance

So, since it is suggested that we specify the initial value size when initializing the collection, how much is appropriate when we create a HashMap?

Some people will naturally think that I'll set as many elements as I'm going to insert. For example, if I'm going to insert seven elements, then new HashMap(7).

However, this is not only wrong, but also the capacity of the Map created in the above way is not 7.

Because when we use HashMap(int initialCapacity) to initialize the capacity, HashMap will not use the initialCapacity we passed in as the initial capacity directly.

JDK will help us calculate a relatively reasonable value as the initial capacity by default. The so-called reasonable value is actually to find the first power of 2 greater than the value passed in by the user.

In other words, when we create a HashMap with new HashMap(7), JDK will help us create a Map with a capacity of 8 through calculation; When we create a HashMap with new HashMap(9), JDK will help us create a Map with a capacity of 16 through calculation.

However, this value seems reasonable, but in fact it is not. Because the default capacity calculated by HashMap according to the capacity entered by the user does not take into account the factor of loadFactor, but simply and mechanically calculates the first power of about 2 of this number.

In other words, if the default value we set is 7, after JDK processing, the capacity of the HashMap will be set to 8. However, when the number of elements reaches 8 * 0.75 = 6, the HashMap will be expanded, which is obviously something we don't want to see.

So, what value is more reasonable?

Here, we can refer to the implementation of putAll method in JDK8, which is also adopted in guava (version 21.0).

The calculation method of this value is:

return (int) ((float) expectedSize / 0.75F + 1.0F);

Therefore, we can think that when we clearly know the number of elements in the HashMap, setting the default capacity to expectedSize / 0.75F + 1.0F is a relatively good choice in terms of performance, but it will also sacrifice some memory.

However, the above operation is a method of exchanging memory for performance. When it is really used, the impact of memory should be considered. However, in most cases, we still think that memory is a relatively rich resource.

But then again, sometimes, whether we should set the initial value of HashMap or not, and how much this value is set, does it really have such a big impact? In fact, not necessarily!

However, large performance optimization is not a stack of optimization details one by one?

If not, use maps when you write code in the future newHashMapWithExpectedSize(7); It can also brighten the eyes of colleagues and bosses.

 

 

Added by sarathi on Fri, 04 Mar 2022 02:07:55 +0200