Who moved my Redis? Frequent AOF rewriting

Container Redis services are frequently AOF rewritten

Today, let's look at an online Redis case.

01 background introduction

An online redis instance frequently performs aof rewriting. After entering the redis data directory, you can see the following scenario:

We can see:

1. In the data directory of redis, many temp XXX are generated Aof file.

2. The start time is about 12:23.

3. Every time a new aof file is written to about 8 ~ 10 G, switching will occur.

In general, if there are multiple temporary files of aof, it must be that the aof is not completed this time, and then Redis re initiates the aof rewrite operation.

02 troubleshooting process

1. Server level

View the monitoring indicators at the server level of the current Redis.

At about 12:23, it was found that the CPU utilization, load and disk Util values were normal.

The only exception is the memory utilization, which seems to be shaking all the time, but it seems that OOM will not happen. After all, the memory utilization is less than 70%.

2. View the read / write volume and memory usage of the current Redis instance

Viewing the business read / write volume during the problem period, you can find:

The read request is about 1500 and the write request is about 5000

It seems that the amount of reading and writing is also relatively small. It should have nothing to do with reading and writing.

Let's look at the current memory usage:

About 17G.

3. View the memory monitoring used by Redis itself

Through the memory usage monitoring of Redis, it can be found that the memory used by Redis has been maintained at about 17G, but there is a pile of sawteeth behind it. The memory increases and then decreases, and the process has been repeated.

4. View Redis server logs

The purpose is to see what Redis is doing during this period. Open the log and intercept the following parts:

24:S 28 Dec 12:23:46.598 * Starting automatic rewriting of AOF on 400% growth
24:S 28 Dec 12:23:47.027 * Background append only file rewriting started by pid 281
24:S 28 Dec 12:26:42.848 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:26:42.955 * Starting automatic rewriting of AOF on 401% growth
24:S 28 Dec 12:26:43.361 * Background append only file rewriting started by pid 282
24:S 28 Dec 12:29:13.027 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:29:13.133 * Starting automatic rewriting of AOF on 401% growth
24:S 28 Dec 12:29:13.583 * Background append only file rewriting started by pid 283
24:S 28 Dec 12:32:03.068 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:32:03.175 * Starting automatic rewriting of AOF on 402% growth
24:S 28 Dec 12:32:03.630 * Background append only file rewriting started by pid 284
24:S 28 Dec 12:35:01.723 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:35:01.828 * Starting automatic rewriting of AOF on 403% growth
24:S 28 Dec 12:35:02.256 * Background append only file rewriting started by pid 285
24:S 28 Dec 12:38:11.077 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:38:11.185 * Starting automatic rewriting of AOF on 403% growth
24:S 28 Dec 12:38:11.636 * Background append only file rewriting started by pid 286
24:S 28 Dec 12:41:01.469 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:41:01.576 * Starting automatic rewriting of AOF on 404% growth
24:S 28 Dec 12:41:02.036 * Background append only file rewriting started by pid 287
24:S 28 Dec 12:43:51.044 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:43:51.152 * Starting automatic rewriting of AOF on 404% growth
24:S 28 Dec 12:43:51.584 * Background append only file rewriting started by pid 288
24:S 28 Dec 12:47:04.495 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:47:04.599 * Starting automatic rewriting of AOF on 405% growth
24:S 28 Dec 12:47:05.072 * Background append only file rewriting started by pid 289
24:S 28 Dec 12:50:18.393 # Background AOF rewrite terminated by signal 9
24:S 28 Dec 12:50:18.499 * Starting automatic rewriting of AOF on 405% growth
24:S 28 Dec 12:50:19.004 * Background append only file rewriting started by pid 290
24:S 28 Dec 12:53:11.886 # Background AOF rewrite terminated by signal 9

It is not difficult to guess from the log that Redis has reached the threshold of aof rewriting, that is, 400%,

redis > config get auto*
1) "auto-aof-rewrite-percentage"
2) "400"

Then start rewriting, but the rewritten subprocess is killed by the system signal 9. It can be understood that the subprocess is killed - 9.

Here, confirm the conjecture through the system log / var/log/message:

Dec 28 12:23:45  kernel: memory+swap: usage 20740096kB, limit 9007199254740988kB, failcnt 0
Dec 28 12:23:45  kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec 28 12:23:45  kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod3ae01d1b_0d39_11eb_adc7_1c98ec1b7874.slice/do
cker-9c9cd155c58b6f28b9d5771a230ea711d48fdb5fb50c99caa14a1afb474d6b54.scope: cache:28KB rss:20740068KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:2
0739924KB inactive_file:12KB active_file:8KB unevictable:0KB
Dec 28 12:23:45  kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Dec 28 12:23:45  kernel: [22253]     0 22253     2932      154      10        0           999 startredis.sh
Dec 28 12:23:45  kernel: [22302]     0 22302    27565      290      58        0           999 sudo
Dec 28 12:23:45  kernel: [22313]  1000 22313  4672591  4659513    9119        0           999 redis-server
Dec 28 12:23:45  kernel: [26944]  1000 26944  4719696  4688593    9214        0           999 redis-server
Dec 28 12:23:45  kernel: Memory cgroup out of memory: Kill process 26944 (redis-server) score 1905 or sacrifice child
Dec 28 12:23:45  kernel: Killed process 26944 (redis-server) total-vm:18878784kB, anon-rss:18753972kB, file-rss:400kB, shmem-rss:0kB

Confirm that the child process of Redis is kill ed due to OOM.

5. Analysis

Then, where exactly is the memory limit set, resulting in the child process being kill ed? maxmemory?

Generally, when the memory occupied by Redis reaches maxmemory, the elimination policy will be triggered. This policy is controlled by maxmemory policy. After viewing, the current Redis configured policies are as follows:

redis > config get maxmemory-policy
1) "maxmemory-policy"
2) "allkeys-lru"

This strategy means that all key s are eliminated according to lru algorithm. Logically, it will not cause the Redis sub process to hang up.

In fact, the answer is in the system log:

Memory cgroup out of memory

Because online Redis runs in K8S cluster in the form of pod, pod itself is a container. The internal of pod is conducive to Cgroup technology, which limits memory.

Query the corresponding limit value through K8S api, as follows:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    resources/cpu-limit: "1.2076"
    resources/cpu-request: "0.6038"
    resources/egress-bandwidth-limit: "209715200"
    resources/egress-bandwidth-request: "1048576"
    resources/memory-limit: "21237858304"
    resources/memory-request: "10618929152"
    scheduler/policy: normal
    scheduler/service: redis
  creationTimestamp: "2020-10-13T09:48:21Z"
  labels:
    port: "22392"
    service_type: redis

What is the specific value of 21237858304? Use MySQL to calculate:

select 21237858304/1024/1024/1024;
+----------------------------+
| 21237858304/1024/1024/1024 |
+----------------------------+
|            19.779296875000 |
+----------------------------+
1 row in set (0.00 sec)

That's about 19.77GB.

Now that everything is clear, let's look at the memory usage diagram to understand:

Originally, Redis used about 17G of memory, but then it kept rising, then falling and cycling. The essential reason is that aof rewriting requires additional memory space. In the process of Redis aof rewriting, the total memory usage reaches the memory limit of Cgroup, resulting in the child process being kill ed, and then the memory decreases. This process is repeated all the time, so a lot of saw teeth are generated, that is, a lot of temp XXX aof file.

03

Summary & & divergence

In case of Redis related problems, the troubleshooting idea can start from the following steps:

1. Server level

First, make an overall observation. The angle here may be relatively high. Mainly observe the common indicators of the server: CPU, load, memory, disk, network bandwidth, TCP connections, etc

2. Redis instance level

After confirming that the server is correct, check the running status of the Redis instance itself

Read / write QPS, memory usage, Maxmemory, aof rewrite strategy, etc

3. Redis log analysis

View some log information of the current Redis at the point of failure through the Redis log

4. System log, monitoring and auxiliary troubleshooting

Here, there will probably be a simple conclusion, and then guess according to this conclusion and find other elements to support the conclusion.

5. Source code analysis

If you can't solve it here, you may need to view it from the source code level.

Divergence:

1. cgroup technology is mentioned in this article. If you don't understand it, you can see the previous article:

docker learning notes of cloud native technology (5)

2. If the memory occupied by Redis reaches the maxmemory configuration of Redis, and rewriting occurs, what happens to Redis? If you are interested, you can test it yourself.

Added by mtwildcard on Tue, 04 Jan 2022 12:02:01 +0200