presto memory pool distribution
In order to prevent OOM of nodes in the cluster, Presto has a circular thread to obtain the current cluster nodes and the overall memory occupation of the cluster. We know Presto is divided into RESERVED_POOL and GENERAL_POOL.
Judge whether the node is blocked (out of memory):
If reserved is used_ POOL (which means that the maximum SQL uses this POOL), the method to judge whether the cluster memory exceeds the memory is:
-
RESERVED_POOL memory occupied by SQL
-
GENERAL_ There are blocked nodes in the pool
Because RESERVED_POOL will lead to a waste of memory. Our cluster configuration parameters do not use this POOL, but only GENERAL_POOL, so you only need to check general_ How does POOL judge whether the node is Block.
if (poolInfo.getFreeBytes() + poolInfo.getReservedRevocableBytes() <= 0) { blockedNodes++; }
getReservedRevocableBytes is used to obtain the memory spill to disk. At present, our cluster does not allow memory spill to disk, because Presto is oriented to ad-hoc scenarios and requires fast speed. If you need to spill to disk, spark is a better choice. In addition, the stability of the earlier version of Presto spill was tested before spiling to disk, and there are few scenarios.
So judge general_ Is there any remaining memory in the pool? If it is less than or equal to 0, it indicates that the node is in Block state.
Kill strategy:
Traverse all queries. If the query has resource_ If the overcommit flag is and the memory overflows, the SQL Kill with this flag will be dropped.
if (resourceOvercommit && outOfMemory) { // If a query has requested resource overcommit, only kill it if the cluster has run out of memory DataSize memory = succinctBytes(getQueryMemoryReservation(query)); query.fail(new PrestoException(CLUSTER_OUT_OF_MEMORY, format("The cluster is out of memory and %s=true, so this query was killed. It was using %s of memory", RESOURCE_OVERCOMMIT, memory))); queryKilled = true; }
If there is no RESOURCE_OVERCOMMIT flag, then check whether the memory exceeds the memory allowed by the cluster:
if (!resourceOvercommit) { long userMemoryLimit = min(maxQueryMemory.toBytes(), getQueryMaxMemory(query.getSession()).toBytes()); if (userMemoryReservation > userMemoryLimit) { query.fail(exceededGlobalUserLimit(succinctBytes(userMemoryLimit))); queryKilled = true; } // enforce global total memory limit if system pool is disabled long totalMemoryLimit = min(maxQueryTotalMemory.toBytes(), getQueryMaxTotalMemory(query.getSession()).toBytes()); if (!isLegacySystemPoolEnabled && totalMemoryReservation > totalMemoryLimit) { query.fail(exceededGlobalTotalLimit(succinctBytes(totalMemoryLimit))); queryKilled = true; } }
From the above, we can know that if you want the above configuration to take effect, query Max memory needs to be configured with reasonable values. For example, we have five workers, and the maximum allowable memory of each Worker is 10G (if it exceeds 10GB, each Worker will automatically Kill SQL). Then the maximum allowable memory of this cluster is 5*10G = 50GB. If you configure query Max memory (maxQueryMemory) is 100GB, and the logic above will not go, resulting in the failure of this logic.
At the same time, if it is found that some nodes in the cluster have been OOM, but no SQL is killed after 5S, the following two strategies will be used to trigger Presto SQL Kill, query low-memory-killer. Policy specifies:
public static class LowMemoryKillerPolicy { public static final String NONE = "none"; public static final String TOTAL_RESERVATION = "total-reservation"; public static final String TOTAL_RESERVATION_ON_BLOCKED_NODES = "total-reservation-on-blocked-nodes"; }
- Total reservation means to kill the SQL with the largest memory in the cluster
- Total reservation on blocked nodes means to kill the query that uses the most memory on the node that is out of memory (blocked).
If we pass RESOURCE_GROUP limits the concurrency size of the cluster to 10, the maximum memory of a single node to 10, and the maximum memory allowed by the cluster is less than 100GB. This configuration will certainly not cause OOM in the cluster (regardless of memory leakage). However, this will lead to insufficient resource utilization. Therefore, the number of concurrency * the maximum memory of a single machine is generally configured online, which is much greater than the maximum memory allowed by the cluster, It is expected that LowMemoryKillerPolicy can help us Kill the SQL that occupies the most memory. However, in fact, there will still be Worker OOM. At this time, the general idea is:
1. dmesg checks the system log to determine whether it is the operating system triggered OOM or JVM OOM. If it is the operating system triggered OOM, this problem may be caused by out of heap memory or Presto memory leakage. However, in more cases, it is necessary to determine whether the Presto node has other services. Generally, the memory of other services suddenly increases, resulting in insufficient system memory.
2. Whether the configuration is correct, whether the maximum memory of a single machine is limited, whether the maximum memory of a cluster is limited and less than the maximum memory of a single machine * the number of cluster nodes, and whether the LowMemoryKillerPolicy is configured.
3. If the above configurations are correct, two issues need to be considered
1) The reason why the LowMemoryKillerPolicy is not triggered can make the logic of judging node blocking (insufficient memory) stricter, because it is not a real-time process.
2) Whether the node has memory leakage.
It can be seen from the above that Presto's memory management is rough. If there is not enough memory, kill SQL, while some other engines are more delicate. SQL will Block until there is available memory. However, Presto is oriented to ad-hoc scenarios. This memory management method is simpler and effective.
Transferred from: http://armsword.com/2020/02/18/presto-memory-kill-policy/