background
- Read the fueling source code! - by Lu Xun
- A picture is worth a thousand words. --By Golgi
Explain:
- Kernel version: 4.14
- ARM64 processor, Contex-A53, dual core
- Using tool: Source Insight 3.5, Visio
1. overview
This article will analyze watermark.
Simply put, when using the zoned page frame allocator to allocate pages, the available free pages are compared with the watermark of the zone to determine whether to allocate memory.
At the same time, watermark is also used to determine the sleep and wake-up of kswapd kernel threads, so as to retrieve and compress memory.
Recall the struct zone structure mentioned earlier:
struct zone { /* Read-mostly fields */ /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long watermark[NR_WMARK]; unsigned long nr_reserved_highatomic; .... } enum zone_watermarks { WMARK_MIN, WMARK_LOW, WMARK_HIGH, NR_WMARK }; #define min_wmark_pages(z) (z->watermark[WMARK_MIN]) #define low_wmark_pages(z) (z->watermark[WMARK_LOW]) #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
As you can see, there are three kinds of watermarks, and they can only be accessed through specific macros.
WMARK_MIN
The lowest point of insufficient memory. If the calculated available pages are lower than this value, page counting cannot be performed.WMARK_LOW
By default, the value is 125% of wmark ﹣ min. at this time, kswapd will be awakened. You can change the scale value by modifying the watermark ﹣ scale ﹣ factor.WMARK_HIGH
By default, the value is 150% of wmark ﹣ max. at this time, kswapd will sleep. You can change the scale value by modifying the watermark ﹣ scale ﹣ factor.
Here's the picture:
The details will be further analyzed below.
1. watermark initialization
First, let's take a look at the initialization related call functions:
NR? Free? Buffer? Pages: Statistics of available pages in zone? DMA and zone? Normal, managed? Pages - high? Pages;
Setup per zone wmarks: calculate the watermark value according to min free Kbytes. A picture will be clear and easy to understand:
-
refresh_zone_stat_thresholds:
Let's review struct pglist data and struct zone.
typedef struct pglist_data { ... struct per_cpu_nodestat __percpu *per_cpu_nodestats; ... } pg_data_t; struct per_cpu_nodestat { s8 stat_threshold; s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; }; struct zone { ... struct per_cpu_pageset __percpu *pageset; ... } struct per_cpu_pageset { struct per_cpu_pages pcp; #ifdef CONFIG_NUMA s8 expire; u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS]; #endif #ifdef CONFIG_SMP s8 stat_threshold; s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS]; #endif };
From the data structure, we can see that for Node and Zone, there is a per CPU structure to store information, and refresh [Zone] stat [thresholds] is related to these two structures. It is used to update the stat [threshold] field in these two structures, and the specific calculation method is not shown. In addition, the percopu [drive] mark is calculated, which needs to be used in watermark judgment. The function of threshold is to judge and trigger a certain behavior, such as memory compression.
setup_per_zone_lowmem_reserve:
Set the lowmem reserve size of each zone, and the implementation logic in the code is shown in the figure below.calculate_totalreserve_pages:
Calculate the reserved pages of each zone and the total reserved pages of the system, in which high watermark will be regarded as the reserved pages. As shown in the picture:
2. watermark judgment
In the old rule, first look at the function call graph:
-
__zone_watermark_ok:
The key function of watermark judgment can be seen from the call relationship in the figure that the final processing is done through it. Let's use pictures to illustrate the overall logic:
In the figure above, the left side determines whether there are enough free pages. The right side directly queries whether the free [area [] can be allocated finally.
Zone ﹐ watermark ﹐ OK: directly call ﹐ zone ﹐ watermark ﹐ OK '. There is no other logic.
-
zone_watermark_fast:
It can be seen from the name that this is a quick judgment. The quick embodiment is to make a judgment decision when order = 0. If the condition is met, it will directly return true. Otherwise, it will call "zone" and "watermark".
Stick a code, clear and clear:
static inline bool zone_watermark_fast(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx, unsigned int alloc_flags) { long free_pages = zone_page_state(z, NR_FREE_PAGES); long cma_pages = 0; #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */ if (!(alloc_flags & ALLOC_CMA)) cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); #endif /* * Fast check for order-0 only. If this fails then the reserves * need to be calculated. There is a corner case where the check * passes but only the high-order atomic reserve are free. If * the caller is !atomic then it'll uselessly search the free * list. That corner case is then slower but it is harmless. */ if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx]) return true; return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); }
-
zone_watermark_ok_safe:
In the zone ﹣ watermark ﹣ OK ﹣ safe function, the call of zone ﹣ page ﹣ state ﹣ snapshot is mainly added to calculate the free ﹣ pages. This calculation process will be more accurate than that directly through zone ﹣ page ﹣ state (Z, NR ﹣ free ﹣ pages).
bool zone_watermark_ok_safe(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx) { long free_pages = zone_page_state(z, NR_FREE_PAGES); if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); return __zone_watermark_ok(z, order, mark, classzone_idx, 0, free_pages); }
The percopu drive mask is set in the refresh zone stat thresholds function, which has been discussed above.
Each zone maintains three fields for page statistics, as follows:
struct zone { ... struct per_cpu_pageset __percpu *pageset; ... /* * When free pages are below this point, additional steps are taken * when reading the number of free pages to avoid per-cpu counter * drift allowing watermarks to be breached */ unsigned long percpu_drift_mark; ... /* Zone statistics */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; }
In memory management, the kernel reads the free page and compares it with the watermark value. In order to read the correct free page face value, it must read the VM ﹣ stat [] and ﹣ percpu *pageset calculator at the same time. If it is read every time, the efficiency will be reduced. Therefore, the value of percpu "drive" mark is set. Only when it is lower than this value, more accurate calculation will be triggered to maintain performance.
__When the counter value of percpu *pageset is updated, when the counter value exceeds the value of stat_threshold, it will be updated to vm_stat [], as shown below:
The zone_page_state_snapshot is invoked in zone_watermark_ok_safe, and the difference from zone_page_state is shown in the following figure:
This is the end of watermark's analysis, finish!