I'll get a time and the machine will go down

1. Background

linux time management includes the concepts of clocksource, clockevent, timer, tick, timekeeper and so on. These concepts organically form a complete time code system. Of course, there will be bugs in the code. This paper starts with a bug to deepen the understanding of the theory in practice. Get the time, but crash.

2. Fault phenomenon

The OPPO cloud kernel team received the connectivity alarm and found that the machine was reset:

PID: 0      TASK: ffff8d2b3775b0c0  CPU: 1   COMMAND: "swapper/1"
 #0 [ffff8d597f6489f0] machine_kexec at ffffffffa5a63b34
 #1 [ffff8d597f648a50] __crash_kexec at ffffffffa5b1e242
 #2 [ffff8d597f648b20] panic at ffffffffa615d85b
 #3 [ffff8d597f648ba0] nmi_panic at ffffffffa5a9859f
 #4 [ffff8d597f648bb0] watchdog_overflow_callback at ffffffffa5b4a881
 #5 [ffff8d597f648bc8] __perf_event_overflow at ffffffffa5ba26b7
 #6 [ffff8d597f648c00] perf_event_overflow at ffffffffa5babd24
 #7 [ffff8d597f648c10] intel_pmu_handle_irq at ffffffffa5a0a850
 #8 [ffff8d597f648e38] perf_event_nmi_handler at ffffffffa616d031
 #9 [ffff8d597f648e58] nmi_handle at ffffffffa616e91c
#10 [ffff8d597f648eb0] do_nmi at ffffffffa616ebf8
#11 [ffff8d597f648ef0] end_repeat_nmi at ffffffffa616dd89
    [exception RIP: __getnstimeofday64+144]
    RIP: ffffffffa5b03940  RSP: ffff8d597f643c78  RFLAGS: 00000212
    RAX: 15b5c8320b8602cd  RBX: ffff8d597f643cb0  RCX: 000000005f89ee29
    RDX: 00000000ee4479fe  RSI: 0000012b5478f3b2  RDI: 0009709c7629b240
    RBP: ffff8d597f643c90   R8: 00000000007001de   R9: ffff8d596d5c0000
    R10: 000000000000007a  R11: 000000000000000e  R12: ffffffffa662ea80
    R13: 000000003ccbcfb6  R14: ffff8d895de08000  R15: 0000000000000081
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff8d597f643c78] __getnstimeofday64 at ffffffffa5b03940
#13 [ffff8d597f643c98] getnstimeofday64 at ffffffffa5b0398e
#14 [ffff8d597f643ca8] ktime_get_real at ffffffffa5b03a45
#15 [ffff8d597f643cd0] netif_receive_skb_internal at ffffffffa603b936
#16 [ffff8d597f643d00] napi_gro_receive at ffffffffa603c588
#17 [ffff8d597f643d28] mlx5e_handle_rx_cqe_mpwrq at ffffffffc052ef1d [mlx5_core]
#18 [ffff8d597f643db8] mlx5e_poll_rx_cq at ffffffffc052f4b8 [mlx5_core]
#19 [ffff8d597f643e08] mlx5e_napi_poll at ffffffffc05304c6 [mlx5_core]
#20 [ffff8d597f643e78] net_rx_action at ffffffffa603bf1f
#21 [ffff8d597f643ef8] __do_softirq at ffffffffa5aa2155
#22 [ffff8d597f643f68] call_softirq at ffffffffa617a32c
#23 [ffff8d597f643f80] do_softirq at ffffffffa5a2e675

From the stack point of view, our process 0 caused a crash because it got a time in the process of processing soft interrupt packet receiving. hardlock has given a lot of analysis before, but it takes a long time to turn off the interrupt. See call for specific places to turn off the interrupt_ Softirq function.

3. Fault phenomenon analysis

1) Theoretical knowledge

In the process of processing soft interrupts of network packets, time stamps will be made, that is, for machines in oppo cloud, the above call stack path is a hot and mature path. Problems with mature paths are rare, so it is necessary to share them. When timekeeping is initialized, it is difficult to select the best clock source, because it is likely that the best one has not been initialized. Therefore, the policy is to adopt a clock source that must be ready during timekeeping initialization, such as the clock source based on jiffies.
Generally speaking, the timekeeping module updates the time values of various system clocks when the tick arrives, ktime_ The get call is likely to occur between two ticks. At this time, it is not enough to rely only on the value accuracy of the current system clock. After all, the time value is updated by per tick. Therefore, in order to obtain high accuracy, the ns value is obtained through timekeeping_get_ns completed, timekeeping_get_ns is the protagonist of this article. This function obtains the nanosecond value of the current time of the real time clock, which is calculated by adding the time value of the real time clock at the last tick (xtime_nsec) and the delta time value from the current time to the last tick. After the system runs, real time clock+ wall_to_monotonic is the uptime of the system, and real time clock+ wall_to_monotonic + sleep time is the boot time of the system.

2) Actual combat analysis

According to the call stack, simply look__ getnstimeofday64 has only one loop, that is, reading timekeeper_seq
For the sequence lock, the code analysis is as follows:

int __getnstimeofday64(struct timespec64 *ts)
{
  struct timekeeper *tk = &timekeeper;
  unsigned long seq;
  s64 nsecs = 0;

  do {
    seq = read_seqcount_begin(&timekeeper_seq);

    ts->tv_sec = tk->xtime_sec;//caq: second value assignment
    nsecs = timekeeping_get_ns(&tk->tkr_mono);

  } while (read_seqcount_retry(&timekeeper_seq, seq));

  ts->tv_nsec = 0;
  timespec64_add_ns(ts, nsecs);//caq: there's a cycle here,

  /*
   * Do not bail out early, in case there were callers still using
   * the value, even in the face of the WARN_ON.
   */
  if (unlikely(timekeeping_suspended))
    return -EAGAIN;
  return 0;
}

However, from the compilation:

0xffffffffa5b0393b <__getnstimeofday64+139>:    xor    %edx,%edx----Clear u32 ret = 0;
0xffffffffa5b0393d <__getnstimeofday64+141>:    nopl   (%rax)
0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s That's 1000000000 ns,The cycle is here, and in the stack rax For 15 b5c8320b8602cd
0xffffffffa5b03946 <__getnstimeofday64+150>:    add    $0x1,%edx---------ret++;edx is the lower 32 bit of rdx,rdx Is 00000000 ee4479fe,therefore edx Is 0 xee4479fe,That is 3997465086
0xffffffffa5b03949 <__getnstimeofday64+153>:    cmp    $0x3b9ac9ff,%rax-------------------------------------Is the remaining less than 1 ns
0xffffffffa5b0394f <__getnstimeofday64+159>:    ja     0xffffffffa5b03940 <__getnstimeofday64+144>
 /include/linux/time.h: 215---corresponding timespec_add_ns
0xffffffffa5b03951 <__getnstimeofday64+161>:    add    %rcx,%rdx---delta Calculated seconds+The previously saved second value is the latest second value
0xffffffffa5b03954 <__getnstimeofday64+164>:    mov    %rax,0x8(%rbx)----remainder ns,Assign to a->tv_nsec = ns;
0xffffffffa5b03958 <__getnstimeofday64+168>:    mov    %rdx,(%rbx)---Finish adding delta The latest second value of the second value, assigned to a->tv_sec
0xffffffffa5b0395b <__getnstimeofday64+171>:    cmpl   $0x1,0xc55702(%rip)        # 0xffffffffa6759064----if(timekeeping_suspended)
 /kernel/time/timekeeping.c: 512
0xffffffffa5b03962 <__getnstimeofday64+178>:    pop    %rbx
0xffffffffa5b03963 <__getnstimeofday64+179>:    pop    %r12
0xffffffffa5b03965 <__getnstimeofday64+181>:    pop    %r13

From the stack, we can see that our loop is__ getnstimeofday64+144

0xffffffffa5b03940 <__getnstimeofday64+144>:    sub    $0x3b9aca00,%rax---------------------1s That's 1000000000 ns,The cycle is here, and in the stack rax For 15 b5c8320b8602cd

It turns out that we loop in timespec64_ add_ In NS function:

static __always_inline void timespec64_add_ns(struct timespec64 *a, u64 ns)
{
  a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
  a->tv_nsec = ns;
}
__iter_div_u64_rem Expand as follows:

static __always_inline u32
__iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
{
  u32 ret = 0;

  while (dividend >= divisor) {//This cycle
    /* The following asm() prevents the compiler from
       optimising this loop into a modulo operation.  */
    asm("" : "+rm"(dividend));

    dividend -= divisor;
    ret++;
  }

  *remainder = dividend;

  return ret;
}

Our participating director is NSEC_PER_SEC, that is, the 9th power of 10. The hexadecimal value is 0x3b9aca00. Since it is circulating, our divide is rax. Please note that the value is:

 RAX: 15b5c8320b8602cd

crash> p 0x15b5c8320b8602cd/0x3b9aca00
$7 = 1564376562

According to this calculation, to complete the calculation, you have to cycle 1564376562 so many times.
Such a large value, I really don't know how to cycle to monkey years and horses.
So how does this value come from? It turns out that this value is read closk twice before and after_ The cycle difference of source is calculated.

static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
{
  u64 delta;

  delta = timekeeping_get_delta(tkr);//caq: difference between last read and this read
  return timekeeping_delta_to_ns(tkr, delta);//caq: convert difference to ns
}

 delta The sources of are:
 static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
{
  u64 cycle_now, delta;
  struct clocksource *clock;

  /* read clocksource: */
  clock = tkr->clock;
  cycle_now = tkr->clock->read(clock);//The current value is read

  /* calculate the delta since the last update_wall_time */
  delta = clocksource_delta(cycle_now, clock->cycle_last, clock->mask);//Calculate difference

  return delta;
}

It turns out that the delta acquisition line reads the cycle value of the current clocksource, and then through the clocksource_delta calculates the corresponding difference. According to the above code, we first need to know which current clocksource is:

crash> timekeeper
timekeeper = $1 = {
  tkr_mono = {------------------------------timekeeping_get_ns(&tk->tkr_mono)
    clock = 0xffffffffa662ea80, ------------This is clocksource,This value is currently clocksource_tsc
    cycle_last = 16728674596502256, 
    mult = 7340510, 
    shift = 24, 
    xtime_nsec = 2657092090049088, This value is not ns,But to >>tkr->shift
    base = {
      tv64 = 2788453640047242
    }
  }, 
  tkr_raw = {
    clock = 0xffffffffa662ea80, 
    cycle_last = 16728674596502256, 
    mult = 8007931, 
    shift = 24, 
    xtime_nsec = 0, 
    base = {
      tv64 = 2788490058099290
    }
  }, 
  xtime_sec = 1602874921, ------------------Current seconds

timekeeper selects the clocksource with the highest accuracy to work:

crash> dis -l 0xffffffffa662ea80
0xffffffffa662ea80 <clocksource_tsc>:   addb   $0xa5,-0x5d(%rcx)--------------namely clocksource_tsc ,tsc Just one clock_source


crash> clocksource_tsc
clocksource_tsc = $2 = {
  read = 0xffffffffa5a34180, -----------read_tsc
  cycle_last = 16728674596502256, ------Taken from the time when the wall time was last updated cycle value
  mask = 18446744073709551615, 
  mult = 8007931, 
  shift = 24, ----------------------Pay attention to the number of digits
  max_idle_ns = 204347035648, 
  maxadj = 880872, 
  archdata = {
    vclock_mode = 1
  }, 
  name = 0xffffffffa647c1cd "tsc", ---name
  list = {
    next = 0xffffffffa6633ff8, 
    prev = 0xffffffffa665c9b0
  }, 
  rating = 300, --------------Priority,
  enable = 0x0, 
  disable = 0x0, 
  flags = 35, ---No, CLOCK_SOURCE_UNSTABLE sign
  suspend = 0x0, 
  resume = 0x0, 
  owner = 0x0
}

The calculation and analysis of the difference is as follows:

static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
            u64 delta)
{
  s64 nsec;//Note that this is the signed number

  nsec = delta * tkr->mult + tkr->xtime_nsec;
  nsec >>= tkr->shift;//Convert to ns

  /* If arch requires, add in get_arch_timeoffset() */
  return nsec + arch_gettimeoffset();
}

timekeeping_ delta_ to_ The NS return value is too large. There are two possibilities:
One is that the delta is too large. Delta * TKR - > mult overflows the value of s64. This is a bug. Another possibility is that the delta value read directly before and after is too large, which involves update_wall_time is not called in time to read the cycle of the current clocksource.

4. Fault recurrence

This s64 overflow bug has been fixed in the community.

-static inline s64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
                                          cycle_t delta)
 {
-       s64 nsec;
+       u64 nsec;

In addition, the change log of red hat is also repaired according to the upstream, but I think the risk is still there, because update_ wall_ Sometimes the update of time is not so timely. Even changing from s64 to u64 does not solve the overflow problem because of timekeeping_ delta_ to_ It is obvious from the NS function that not all the 64 bits of u64 are used in the cycle difference. I believe that someone in the community should eventually explode this problem.

5. Fault avoidance or resolution

Possible solutions are:
Add an alarm and intervene in softlock in time, which may lead to update_wall_time update is not timely.

Introduction to the author

Anqing senior backend Engineer

At present, I am mainly responsible for linux kernel, container, virtual machine and other virtualization work.

For more exciting content, please scan code to focus on [OPPO technology] official account.

Keywords: Linux Back-end

Added by Dingbats on Thu, 18 Nov 2021 13:32:02 +0200