Java Security Learning Notes -- a simple analysis of the source code of HashMap class using deserialization chain

preface

The HashSet and Hashtable used by both CC6 and CC7 chains of ysoserial deserialization vulnerability exploitation tool are hash table based data storage structures. When it comes to hash table, the HashMap class is the most used hash table storage structure in Java. At the same time, HashMap class is also the underlying implementation basis of HashSet class in CC6 chain, The addition, deletion, modification and query methods of HashSet are based on a HashMap class instantiated at the beginning of the constructor. The usage and structure of Hashtable class and HashMap class in CC7 chain are also very similar. It is meaningful to simply analyze the source code of HashMap.

testing environment

jdk1.8

Node internal static class

HashMap class implements the structure of key value data storage through such an internal class. Node class implements the Entry interface of Map class. Every element added is actually instantiating a node class. The key and vlue attributes in node class are assigned through the constructor. At the same time, a hash function stores the hash value of the element, This value is generated every time a function is added.

static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

        Node(int hash, K key, V value, Node<K,V> next) {
            this.hash = hash;
            this.key = key;
            this.value = value;
            this.next = next;
        }

        public final K getKey()        { return key; }
        public final V getValue()      { return value; }
        public final String toString() { return key + "=" + value; }

        public final int hashCode() {
            return Objects.hashCode(key) ^ Objects.hashCode(value);
        }

        public final V setValue(V newValue) {
            V oldValue = value;
            value = newValue;
            return oldValue;
        }

        public final boolean equals(Object o) {
            if (o == this)
                return true;
            if (o instanceof Map.Entry) {
                Map.Entry<?,?> e = (Map.Entry<?,?>)o;
                if (Objects.equals(key, e.getKey()) &&
                    Objects.equals(value, e.getValue()))
                    return true;
            }
            return false;
        }
    }

hash() method

Hash method is called perturbation function. Its function is to make the hash result more uniform. The calculation method of perturbation function is to XOR the value obtained by shifting hashcode and unsigned hashcode to the right by 16 bits.

   static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

As for why the perturbation function is calculated in this way, let's briefly say that the perturbation function is designed to reduce hash conflict and make each element more evenly distributed. The representation range of int integer in java is - 2 ^ 31-2 ^ 31-1, that is - 2147483648-2147483647, while the initial capacity of HashMap is 16, and each expansion is twice the original capacity, In a short time, it will not exceed the 16th power of 2. In this way, the first 16 bits of int rarely participate in the calculation of hash value, which will make the distribution not very uniform. Suppose that hashCode is a 32-bit int integer and the last 16 bits are 1

1010 1011 0101 0101 1111 1111 1111 1111

The algorithm for calculating the subscript value of elements in hashMap is hash & (length-1). The capacity of hashMap is always n times of 2. The binary form of length-1 is fixed. The first bit is 0 and the following bits are 1. Suppose that lengt is 16,

16-1 binary = 01111. In this way, the hashCode directly and 01111 is 01111 unchanged. All 32-bit int integers and the last 16 bits are 1 have the same hashCode after calculating the hash value, the distribution is uneven, and the hash conflict probability is greater.

hashCode() method

When calling the perturbation function, you need to call the hashcode method of the Object represented by the key value. If the hashcode method is not rewritten in the class, you will call the hashcode method of the Object class. This method has only one method declaration in the source code of the Object class, but it will actually return an int integer, which is calculated according to the position of the class in memory.

Look at the hashCode method of a String class:

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

The cyclic char array takes out the characters of the string. The first cycle h is the ascii code of the first character, and the second cycle h is 31 * the ascii code of the first character + the ascii code of the second character. Then repeat the steps of the second cycle, and finally return h as a hash value.

putValue method

public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }

The commonly used put method is to wrap the putValue method. Look directly at the putValue method

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

The first if judgment is that the hashMap has been initialized. After initialization, the default length of hashMap is 16 (1 shifts 4 bits to the left). Why 16? This is also an interesting place.

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

The second if judgment section focuses on if ((P = tab [i = (n - 1) & hash]) = = null). The function of this line is to judge whether the subscript position is in hash conflict. If it is in conflict, HashMap uses chain address method + red black tree to solve the conflict. How can this subscript be obtained? The hash value and the length value of HashMap are used for bit and operation, and the hash value is generally much larger than the length of HashMap. Here, it is actually taking remainder. Why is it taking remainder? According to the principle of bit operation, an integer moving n bits to the right is divided by 2 N times, and the removed n bits are the remainder, for example: 19% 8 = 3

8 is the third time of 2, the binary of 19 is 10011, the right shift of 3 bits is 10011, 011 is removed, and 011 is converted into hexadecimal is 3, so we can directly calculate the remainder result from the removed bits. Binary bits 1000 of 8, binary bits 0111 of 7, 7 & 19 = 3. It is found that the first bit of binary of N times of 2 is always 1, and the rest is 0, As long as we subtract one, it will become that the first bit is zero, and all the remaining bits are 1. In this way, we can get the remainder from the digital sum of the modulo. That's why the default capacity of HashMap is 16, and the capacity of HashMap is expanded n times of 2 every time.

Then why not use modulo symbols directly? This is also used in some algorithm problems, that is, for computers, bit operations are faster than addition, subtraction, multiplication and division.

This if statement judges whether there is an element in the subscript position. If there is no element, it will add an element to the subscript position. If there is an element, it will enter the code in else. First, it will judge whether the key value is the same. If it is the same, it will modify the value value. If it is different, it is equivalent to a hash conflict. At this time, the linked list will be used to store the elements with the same subscript, Moreover, in JAVA8, when the length of the linked list is greater than 8, the linked list will be converted into a red black tree to store elements. The specific conversion is not analyzed, which is related to the algorithm.

summary

After a brief analysis of the methods often contacted in the source code of HashMap, and after analyzing the putVal function, we can also deduce the general implementation method for the implementation of get method. We also have a further understanding of HashMap, HashSet and hashtable. If there are other needs to be analyzed in depth later.

Keywords: Java security Cyber Security Information Security

Added by jek1134 on Tue, 25 Jan 2022 19:30:32 +0200