Cluster timing task single point control solution (distributed lock + renewal lock solution)

1. Business requirements description

There is a requirement in the development process: there are some Spring scheduled tasks in the cluster environment. These scheduled tasks summarize and generate some log data every minute, but we only want one server in the cluster to execute and design a high-performance and stable execution framework (the specific business involves some internal business problems of the company, so I won't describe it in detail here)

2. Design scheme thinking

Question 1: first, cluster single point control, which is obviously a resource competition problem Distributed locks are usually used to control resource competition in clusters, and the middleware is often redis or zookeeper. Redis is used in this development.

Thinking question 2: since it is a distributed lock and has high reliability, we have to consider some problems at this time: suppose a machine in the cluster grabs a task by locking, and the machine suddenly loses power during execution, resulting in the lock not being released. Will other machines never be able to grab a task?

Thinking question 3: develop a low coupling execution framework to decouple the code of other developers. The first thing we think of is custom annotation + Spring AOP programming, that is, using AspectJ

Thinking question 4: when using the distributed lock solution, the time stamp must be used for locking. If the events of multiple servers in the cluster environment are slightly different, how to control it?

3. Scheme realization

3.1 first of all, the Spring timer already has an annotation: @ Scheduled, which can realize Scheduled tasks on all machines by using cron expression. Now consider defining a new annotation on this basis: @ Scheduled one uses AOP programming to cut in, and the code is as follows:

Annotation definition:

@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface ScheduledOne {
    String name() default  "";//Name of the distributed lock
}

AOP section:

    @Around("@annotation(org.springframework.scheduling.annotation.Scheduled) &&" +
            "@annotation(com.xxx.scheduled.annotation.ScheduledOne)")
    public Object scheduledOneAround(ProceedingJoinPoint joinPoint) throws Throwable {
        //The lock name is generated according to the annotated method name
        String lockName = getLockName(joinPoint);
        RedisLockBean lockBean = new RedisLockBean(lockName);
        if (redisLockService.tryLock(lockBean)) {
            // Layer 1 locking: first, lock according to className+methodName as the key to prevent the execution of the current scheduled task from being completed and the next execution time from coming.
            try {
                boolean acquireLock = isExec(joinPoint);
                //Layer 2 locking: after obtaining the lock, check whether the task has been executed by the faster machine. The default timeout is 2h, that is, the difference between the fastest machine in the cluster and the slowest machine cannot exceed two hours
                if (acquireLock) {
                    //If the lock is set successfully, it means that it has not been executed. Start execution. Note that the above lock lock is not released after the program is executed, but the timeout time is set for 24 hours to prevent other slow machines from executing again
                    Object result;
                    result = joinPoint.proceed();
                    return result;
                }
            } finally {
                redisLockService.unlock(lockBean);
            }
        }
        return null;
    }

tryLock() locks in redis according to the method name + class name. This step can ensure that only one machine can grab the scheduled task at the current time. Back to question 4, suppose that our scheduled task is executed every minute, and the cron expression is: "0 * * *?", If a machine event is slow, for example, the first machine grabs the task once at 14:10:00 on 2021-08-20, and the second machine event is full for 5 seconds, when the first machine arrives at 14:10:05 on 2021-08-20, because there is no machine competing with the second machine, the second machine naturally grabs the task and starts to execute, which is obviously inappropriate. Therefore, in order to solve this problem, the isExec() method in the code can be regarded as the second locking. The code is as follows:

    //Judge whether there is a machine with fast time that has been executed
    private boolean isExec(ProceedingJoinPoint joinPoint) {
        String lock = jedisConnectionFactory.jedisExecAndClose(jedis -> jedis.set(getLockKey(joinPoint), getLockValue(), "NX", "PX", timeUnit.toMillis(2)));
        return StringUtil.isNotBlank(lock) && lock.equalsIgnoreCase("OK");
    }

It is used to judge whether the machine with fast events has been executed, and the lock timeout event is set to 2 hours by default, which means that the time difference of all machines in the cluster is allowed to be up to 2 hours. At this time, the tasks in the cluster can continue to be executed. There is also a thinking problem 2 that has not been solved: that is, what if the machine that grabbed the task goes down?

The solution we designed here is that the lock added by tryLock() has a default timeout of 5 seconds. After the lock is obtained, the lock is encapsulated into a bean, that is, LockBean, which is added to the redis lock management class of the local machine. This class will maintain a thread and constantly scan the locks obtained by the local machine, If the timeout is about to time out (for example, there are only 5 seconds left before expiration. Of course, this time can be customized), then this thread will automatically renew this lock for a period of time, and the specific events will be encapsulated in the LockBean. When the local task ends normally, this LockBean will be removed from the management class. If the machine goes down, the renewal lock thread will naturally not work, and it will not work in the redis database The lock will be automatically released after a certain timeout.

Through the above case, we can find several precautions:

1. First, the default timeout of the lock should not be set too long;

2. The @ scheduled annotation only supports cron expression, which is accurate to a certain time unit. fixRate cannot be used to control the execution frequency. Because the events started by the web service on each machine are inconsistent.

If you have any good solutions, you can discuss them together! (except for the idea of extracting scheduled tasks into a single project and then deploying them separately, because it is decoupled from the web business)

Keywords: Java Redis Spring

Added by TheWart on Wed, 15 Dec 2021 15:46:52 +0200