Blood case caused by poolboy max'overflow

problem

This is an online problem. Under the condition of low qps(2000 database accesses per second), 100 worker processes and 100 Max ﹣ overflow processes, the performance of a service node suddenly declines, and only 1500 database accesses can be processed per second. As a result, the request processing delay increases from several ms to several hundred MS, and then recovers gradually

Reason

Gradually reduce the scope to the checkout of mongodb poolboy process pool:

check out

handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) ->
    #state{supervisor = Sup,
           workers = Workers,
           monitors = Monitors,
           overflow = Overflow,
           max_overflow = MaxOverflow} = State,
    case Workers of
        [Pid | Left] ->
            MRef = erlang:monitor(process, FromPid),
            true = ets:insert(Monitors, {Pid, CRef, MRef}),
            {reply, Pid, State#state{workers = Left}};
        [] when MaxOverflow > 0, Overflow < MaxOverflow ->
            {Pid, MRef} = new_worker(Sup, FromPid),
            true = ets:insert(Monitors, {Pid, CRef, MRef}),
            {reply, Pid, State#state{overflow = Overflow + 1}};
        [] when Block =:= false ->
            {reply, full, State};
        [] ->
            MRef = erlang:monitor(process, FromPid),
            Waiting = queue:in({From, CRef, MRef}, State#state.waiting),
            {noreply, State#state{waiting = Waiting}}
    end;

It can be seen that when max'overflow is not 0, a new worker will be created in case of instantaneous overload, and these workers will be linked to mongodb, which takes 1-2MS. The created consumption will block the master process

check in

When returning, the worker will be destroyed, resulting in the link creation / destruction all the time, and all the requests will be stuck in the master process, which will block all the requests due to the link creation and destruction of the master process, resulting in the fall of qps avalanche

handle_checkin(Pid, State) ->
    #state{supervisor = Sup,
           waiting = Waiting,
           monitors = Monitors,
           overflow = Overflow,
           strategy = Strategy} = State,
    case queue:out(Waiting) of
        {{value, {From, CRef, MRef}}, Left} ->
            true = ets:insert(Monitors, {Pid, CRef, MRef}),
            gen_server:reply(From, Pid),
            State#state{waiting = Left};
        {empty, Empty} when Overflow > 0 ->
            ok = dismiss_worker(Sup, Pid),
            State#state{waiting = Empty, overflow = Overflow - 1};
        {empty, Empty} ->
            Workers = case Strategy of
                lifo -> [Pid | State#state.workers];
                fifo -> State#state.workers ++ [Pid]
            end,
            State#state{workers = Workers, waiting = Empty, overflow = 0}
    end.

conclusion

Do not use poolboy's max'overflow. If the creation / destruction of children process consumes a certain amount, it is easy to block the poolboy master process, and frequent creation / destruction of worker s leads to avalanches

Every time I check a BUG, I take it for granted in retrospect. But it takes a lot of effort to trace. It's inconvenient to give the monitoring data in my blog. I can't help but omit a lot of inference process. I hope this conclusion will help you

Keywords: Erlang Database MongoDB supervisor

Added by Hitch54 on Thu, 05 Dec 2019 01:25:51 +0200