problem
This is an online problem. Under the condition of low qps(2000 database accesses per second), 100 worker processes and 100 Max ﹣ overflow processes, the performance of a service node suddenly declines, and only 1500 database accesses can be processed per second. As a result, the request processing delay increases from several ms to several hundred MS, and then recovers gradually
Reason
Gradually reduce the scope to the checkout of mongodb poolboy process pool:
check out
handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) -> #state{supervisor = Sup, workers = Workers, monitors = Monitors, overflow = Overflow, max_overflow = MaxOverflow} = State, case Workers of [Pid | Left] -> MRef = erlang:monitor(process, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{workers = Left}}; [] when MaxOverflow > 0, Overflow < MaxOverflow -> {Pid, MRef} = new_worker(Sup, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{overflow = Overflow + 1}}; [] when Block =:= false -> {reply, full, State}; [] -> MRef = erlang:monitor(process, FromPid), Waiting = queue:in({From, CRef, MRef}, State#state.waiting), {noreply, State#state{waiting = Waiting}} end;
It can be seen that when max'overflow is not 0, a new worker will be created in case of instantaneous overload, and these workers will be linked to mongodb, which takes 1-2MS. The created consumption will block the master process
check in
When returning, the worker will be destroyed, resulting in the link creation / destruction all the time, and all the requests will be stuck in the master process, which will block all the requests due to the link creation and destruction of the master process, resulting in the fall of qps avalanche
handle_checkin(Pid, State) -> #state{supervisor = Sup, waiting = Waiting, monitors = Monitors, overflow = Overflow, strategy = Strategy} = State, case queue:out(Waiting) of {{value, {From, CRef, MRef}}, Left} -> true = ets:insert(Monitors, {Pid, CRef, MRef}), gen_server:reply(From, Pid), State#state{waiting = Left}; {empty, Empty} when Overflow > 0 -> ok = dismiss_worker(Sup, Pid), State#state{waiting = Empty, overflow = Overflow - 1}; {empty, Empty} -> Workers = case Strategy of lifo -> [Pid | State#state.workers]; fifo -> State#state.workers ++ [Pid] end, State#state{workers = Workers, waiting = Empty, overflow = 0} end.
conclusion
Do not use poolboy's max'overflow. If the creation / destruction of children process consumes a certain amount, it is easy to block the poolboy master process, and frequent creation / destruction of worker s leads to avalanches
Every time I check a BUG, I take it for granted in retrospect. But it takes a lot of effort to trace. It's inconvenient to give the monitoring data in my blog. I can't help but omit a lot of inference process. I hope this conclusion will help you