StarRocks BE node crash cause lookup: std::bad_alloc

problem analysis

StarRocks BE 5 nodes suddenly dropped within a few minutes. Find BE Out log, the output is as follows:

tcmalloc: large alloc 1811947520 bytes == 0x77f9f0000 @  0x384f94f 0x39ce2dc 0x399646a
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
*** Aborted at 1641348199 (unix time) try "date -d @1641348199" if you are using GNU date ***
PC: @     0x7fa8c7db4387 __GI_raise
*** SIGABRT (@0x2ab9) received by PID 10937 (TID 0x7fa7f0658700) from PID 10937; stack trace: ***
    @          0x2da5562 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa8c99cc630 (unknown)
    @     0x7fa8c7db4387 __GI_raise
    @     0x7fa8c7db5a78 __GI_abort
    @          0x12e91ff _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x391d6f6 __cxxabiv1::__terminate()
    @          0x391d761 std::terminate()
    @          0x391d8b5 __cxa_throw
    @          0x12e80de _ZN12_GLOBAL__N_110handle_oomEPFPvS0_ES0_bb.cold
    @          0x39ce27e tcmalloc::allocate_full_cpp_throw_oom()
    @          0x399646a std::__cxx11::basic_string<>::_M_mutate()
    @          0x3996e90 std::__cxx11::basic_string<>::_M_replace_aux()
    @          0x1c5c4fd apache::thrift::protocol::TBinaryProtocolT<>::readStringBody<>()
    @          0x1c5c6ac apache::thrift::protocol::TVirtualProtocol<>::readMessageBegin_virt()
    @          0x1e3d3c9 apache::thrift::TDispatchProcessor::process()
    @          0x2d91062 apache::thrift::server::TConnectedClient::run()
    @          0x2d88d13 apache::thrift::server::TThreadedServer::TConnectedClientRunner::run()
    @          0x2d8ab10 apache::thrift::concurrency::Thread::threadMain()
    @          0x2d7c500 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvSt10shared_ptrIN6apache6thrift11concurrency6ThreadEEES8_EEEEE6_M_runEv
    @          0x3998d40 execute_native_thread_routine
    @     0x7fa8c99c4ea5 start_thread
    @     0x7fa8c7e7c9fd __clone
Analysis log, key words are: std::bad_alloc
Obviously, the avalanche effect occurs due to insufficient memory. If there are many nodes, they may not all hang up.

BE yes C++Developed, error explanation reference:

operator new bad_alloc is a serious resource problem because memory cannot be allocated and objects cannot be constructed. It certainly cannot run according to the original logic, and it is likely that there is not enough memory for you to clean up.
In this case, it is right to let the program hang up

Solution ideas

Increase memory

The best way is to increase memory. After all, as the amount of data increases, memory usage is bound to increase, and you may not be able to cope with the sudden increase in the amount of imported data.

Optimize import configuration

There is a configuration item in the current version of StarRock (1.19):

mem_limit=80%    # The proportion of the total memory of the machine that the BE can use. If the BE is deployed separately, it does not need to BE configured. If it is deployed mixed with other services that occupy more memory, it should BE configured separately
load_process_max_memory_limit_bytes=107374182400    # The maximum memory occupied by all import threads on a single node is 100GB
load_process_max_memory_limit_percent=80    # The maximum proportion of memory occupied by all import threads on a single node, 80%

You can limit memory usage by setting this option.

Other memory optimization parameters can be viewed:

Set memory allocation parameters

It is recommended to use cat / proc / sys / VM / overcommit_ Set memory to 1.

echo 1 | sudo tee /proc/sys/vm/overcommit_memory

Table optimization

Memory table: StarRocks supports caching all table data in memory to speed up query. Memory table is suitable for storing multi-dimensional tables with few data rows.

However, the optimization of the memory table is not perfect in actual use. It is recommended not to use the memory table for the time being.

Upgrade StarRocks

The new version of StarRocks(2.0) optimizes memory management and can also solve problems to a certain extent:

  • Memory management optimization

    • Reconstruct the memory statistics / control framework, accurately count the memory usage, and completely solve the problem of OOM
    • Optimize metadata memory usage
    • Solve the problem that the execution thread is stuck for a long time after large memory is released
    • Process graceful exit mechanism, supporting memory leak check #1093

Welcome to WeChat official account: Data Architecture Exploration

Keywords: Database

Added by BlueSkyIS on Thu, 06 Jan 2022 09:28:52 +0200