[Special Topic on algorithm technology] how to implement consistent hash ing in Java

History of consistency hash

[Consistent Hashing algorithm] was proposed in the paper Consistent hashing and random trees as early as 1997. At present, it is more and more widely used in cache system;

Purpose of consistency hash

The consistent Hash algorithm is a commonly used algorithm in distributed systems. The consistent Hash algorithm solves the problem of poor scalability of the ordinary remainder Hash algorithm, and can ensure that as many requests as possible hit the server to which they were originally routed in the case of online and offline servers.

Problem background

In business development, we often persist the data into the database. If we need to read these data, in addition to reading directly from the database, in order to reduce the access pressure of the database and improve the access speed, we introduce more cache to access the data.

Distributed cache

Distributed cache, which stores data of different objects on different machines. In order to realize the load balancing of these cache machines, there are generally two Hash algorithms to evenly allocate data node storage: ordinary Hash algorithm

General Hash algorithm

Defects of Hash mold taking method

In a Redis cluster, if we pass a piece of data through the Hash, and then take the modulus according to the number of cluster nodes to get which node should be placed, the disadvantage of this approach is that after capacity expansion (adding a node), a large number of caches fail.

Case analysis of ordinary Hash

For example, if you have N cache servers (hereinafter referred to as cache), how can you map an object to N caches? You may use the following general method to calculate the hash value of the object, and then map it evenly to N caches;

hash(object)%N 

If everything is running normally, consider the following two situations;

  • When a cache server m goes down (this must be considered in practical application), all objects mapped to cache m will become invalid. What to do? You need to remove cache m from the cache. At this time, the cache is N-1, and the mapping formula becomes hash(object)%(N-1);

  • Due to heavy access, you need to add a cache. At this time, the cache is N+1, and the mapping formula becomes hash(object)%(N+1);

  • This means that suddenly almost all caches fail. For the server, this is a disaster, and the flood of access will directly rush to the background server; (causing cache avalanche mechanism)

Consistent Hash algorithm

Consistent hash algorithm is a method to solve this kind of problem. It can ensure that when the machine increases or decreases, the impact on the hit probability of cache access is minimized. Let's talk about the specific process of the consistency hash algorithm in detail.

  • The consistency hash algorithm is implemented by a data structure called consistency hash ring. The starting point of this ring is 0 and the end point is 2 ^ 32 - 1. The starting point is connected with the end point. The integers in the middle of the ring are distributed counterclockwise, so the integer distribution range of this ring is [0,2 ^ 32-1]

  • The whole hash value space is organized into a virtual ring. The IP address or host name of the node is used as the keyword for hash calculation, and the result is used as the position of the node in the ring. After the data passes through the hash, find the nearest node clockwise for storage, as shown in the hash position of data, which should be stored in node2.

  • Compared with Hash modulus, the advantage of consistency Hash algorithm is that less cache data is affected after capacity expansion. If n nodes are expanded to n+1, the number of caches affected is 0~1/n, that is, the cache of one node is invalidated at most.

  • Its disadvantage is that the cache is unevenly distributed on each node. After all, the hash value is random, and the position of the node in the ring is also random.

Improved consistent Hash algorithm

Consistency Hash algorithm + virtual node

In order to solve the problem of uneven data distribution, we introduce the concept of virtual node. We calculate multiple hashes for each service node, and place one service node at each calculation result location, which is called virtual node. The data located to the virtual node is stored on the real node corresponding to the virtual node, so that the data distribution is relatively uniform. The more virtual nodes, the more uniform the distribution.

After introducing "virtual node", the mapping relationship is transformed from {object - > node} to {object - > virtual node}. Mapping relationship when querying the cache where the object is located

Generally, the number of virtual nodes is more than 32, and the number of dubbo is 160.

Handle the increase or decrease of the machine

For online business, it is common to increase or decrease the deployment of one machine.

For example, add the deployment of machine c4 and add machine c4 between machines c3 and c2 in the hash ring. At this time, only the objects between machines c3 and c4 need to be reassigned to a new machine. For our example, only object o4 is reassigned to c4, and other objects are still on the original machine.

Implementation principle of consistent Hash algorithm

In business development, we often persist data into the database. If you need to read these data, in addition to reading directly from the database, in order to reduce the access pressure of the database and improve the access speed, we introduce more cache to access the data. The process of reading data is generally as follows:

Implementation of Hash algorithm in Java code

A TreeMap is used as a ring, key is the subscript of the virtual node, and value is the hash of the real node. Personally, you can add a map < T, set < integer > > to maintain the relationship between real nodes and virtual nodes.

/**
 * Consistent Hash algorithm
 * Algorithm details: http://blog.csdn.net/sparkliang/article/details/5279393
 * Algorithm implementation: https://weblogs.java.net/blog/2007/11/27/consistent-hashing
 * @author xiaoleilu
 *
 * @param <T>   Node type
 */
public class ConsistentHash<T> implements Serializable{
    private static final long serialVersionUID = 1L;
    
    /** Hash The calculation object is used to customize the hash algorithm */
    Hash32<Object> hashFunc;
    /** Number of replicated nodes */
    private final int numberOfReplicas;
    /** Consistent Hash ring */
    private final SortedMap<Integer, T> circle = new TreeMap<>();
    
    /**
     * Construct, using Java's default Hash algorithm
     * @param numberOfReplicas The number of replicated nodes. Increasing the number of replicated nodes of each node is conducive to load balancing
     * @param nodes Node object
     */
    public ConsistentHash(int numberOfReplicas, Collection<T> nodes) {
        this.numberOfReplicas = numberOfReplicas;
        this.hashFunc = key -> {
            //Fnv1ash algorithm is used by default
            return HashUtil.fnvHash(key.toString());
        };
        //Initialize node
        for (T node : nodes) {
            add(node);
        }
    }

    /**
     * structure
     * @param hashFunc hash Algorithm object
     * @param numberOfReplicas The number of replicated nodes. Increasing the number of replicated nodes of each node is conducive to load balancing
     * @param nodes Node object
     */
    public ConsistentHash(Hash32<Object> hashFunc, int numberOfReplicas, Collection<T> nodes) {
        this.numberOfReplicas = numberOfReplicas;
        this.hashFunc = hashFunc;
        //Initialize node
        for (T node : nodes) {
            add(node);
        }
    }

    /**
     * Add node < br >
     * Each additional node will increase the number of replication nodes in the closed loop < br >
     * For example, if the number of replicated nodes is 2, two virtual nodes will be added every time this method is called, and the two nodes point to the same Node
     * Since the hash algorithm will call the toString method of node, it is de duplicated according to toString
     * @param node Node object
     */
    public void add(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.put(hashFunc.hash32(node.toString() + i), node);
        }
    }

    /**
     * Remove the corresponding virtual node while removing the node
     * @param node Node object
     */
    public void remove(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.remove(hashFunc.hash32(node.toString() + i));
        }
    }

    /**
     * Gets the nearest clockwise node
     * @param key Fetch Hash for a given key and obtain the actual node corresponding to the nearest virtual node in the clockwise direction
     * @return Node object
     */
    public T get(Object key) {
        if (circle.isEmpty()) {
            return null;
        }
        int hash = hashFunc.hash32(key);
        if (false == circle.containsKey(hash)) {
            SortedMap<Integer, T> tailMap = circle.tailMap(hash);   //Returns a partial view of this map with a key greater than or equal to hash
            hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
        }
        //Just hit
        return circle.get(hash);
    }
}

reference material

Added by khefner on Tue, 16 Nov 2021 10:02:40 +0200