ZONE of linux kernel

struct zone

It can be seen from the three memory models of linux that the linux kernel divides the physical memory into different ZONE areas according to the actual use. ZONE management occupies an important position in the physical memory. The corresponding structure in the kernel is struct zone. In version 5.8.10, the structure is as follows:

struct zone {
	/* Read-mostly fields */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long _watermark[NR_WMARK];
	unsigned long watermark_boost;

	unsigned long nr_reserved_highatomic;

	/*
	 * We don't know if the memory that we're going to allocate will be
	 * freeable or/and it will be released eventually, so to avoid totally
	 * wasting several GB of ram we must reserve some of the lower zone
	 * memory (otherwise we risk to run OOM on the lower zones despite
	 * there being tons of freeable ram on the higher zones).  This array is
	 * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
	 * changes.
	 */
	long lowmem_reserve[MAX_NR_ZONES];

#ifdef CONFIG_NUMA
	int node;
#endif
	struct pglist_data	*zone_pgdat;
	struct per_cpu_pageset __percpu *pageset;

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * spanned_pages is the total pages spanned by the zone, including
	 * holes, which is calculated as:
	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
	 *
	 * present_pages is physical pages existing within the zone, which
	 * is calculated as:
	 *	present_pages = spanned_pages - absent_pages(pages in holes);
	 *
	 * managed_pages is present pages managed by the buddy system, which
	 * is calculated as (reserved_pages includes pages allocated by the
	 * bootmem allocator):
	 *	managed_pages = present_pages - reserved_pages;
	 *
	 * So present_pages may be used by memory hotplug or memory power
	 * management logic to figure out unmanaged pages by checking
	 * (present_pages - managed_pages). And managed_pages should be used
	 * by page allocator and vm scanner to calculate all kinds of watermarks
	 * and thresholds.
	 *
	 * Locking rules:
	 *
	 * zone_start_pfn and spanned_pages are protected by span_seqlock.
	 * It is a seqlock because it has to be read outside of zone->lock,
	 * and it is done in the main allocator path.  But, it is written
	 * quite infrequently.
	 *
	 * The span_seq lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 *
	 * Write access to present_pages at runtime should be protected by
	 * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
	 * present_pages should get_online_mems() to get a stable value.
	 */
	atomic_long_t		managed_pages;
	unsigned long		spanned_pages;
	unsigned long		present_pages;

	const char		*name;

#ifdef CONFIG_MEMORY_ISOLATION
	/*
	 * Number of isolated pageblock. It is used to solve incorrect
	 * freepage counting problem due to racy retrieving migratetype
	 * of pageblock. Protected by zone->lock.
	 */
	unsigned long		nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif

	int initialized;

	/* Write-intensive fields used from the page allocator */
	ZONE_PADDING(_pad1_)

	/* free areas of different sizes */
	struct free_area	free_area[MAX_ORDER];

	/* zone flags, see below */
	unsigned long		flags;

	/* Primarily protects free_area */
	spinlock_t		lock;

	/* Write-intensive fields used by compaction and vmstats. */
	ZONE_PADDING(_pad2_)

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* pfn where compaction free scanner should start */
	unsigned long		compact_cached_free_pfn;
	/* pfn where async and sync compaction migration scanner should start */
	unsigned long		compact_cached_migrate_pfn[2];
	unsigned long		compact_init_migrate_pfn;
	unsigned long		compact_init_free_pfn;
#endif

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
	int			compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* Set to true when the PG_migrate_skip bits should be cleared */
	bool			compact_blockskip_flush;
#endif

	bool			contiguous;

	ZONE_PADDING(_pad3_)
	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;

The structure members define the information needed to manage physical memory in the zone. The main members are as follows:

  • unsigned long _watermark[NR_WMARK]: for the watermark corresponding to the zone, the kernel will make different processing according to the physical memory used when it reaches different watermarks, such as reclaiming memory to release enough physical space and possibly trigger OOM
  • long lowmem_reserve[MAX_NR_ZONES];: Reserved physical memory
  • struct pglist_data    *zone_pgdat: pglist to which the zone belongs_ Data node
  • unsigned long        zone_start_pfn: the starting physical pfn of the zone
  • atomic_long_t        managed_pages: the physical memory of the zone managed by the buddy algorithm
  • unsigned long        spanned_pages: size equal to spanned_pages = zone_end_pfn - zone_start_pfn, if there is a hole in it, including
  • unsigned long        present_pages: spanned_pages - absent_pages(pages in holes), the actual number of physical pages in the zone, equal to spanned_pages - number of empty pages
  • const char * Name: zone name
  • struct free_area    free_area[MAX_ORDER]: buddy the memory managed according to order
  • unsigned long} flags: zone flag bit
  • atomic_long_t        vm_stat[NR_VM_ZONE_STAT_ITEMS]: memory statistics by usage status
  • atomic_long_t        vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]: memory statistics of the whole NUMA node

 zone_type

The kernel divides zones into different type s according to different uses, which are located in include \ Linux \ mmzone H in the document:

  • ZONE_DMA: it is mainly to be compatible with ISA equipment. In this equipment, DMA can only access memory addresses lower than 16M, which can only be delimited separately for management.
  • ZONE_DMA32: it is compatible with 32-bit systems. Generally, zone is used_ During DMA, because 16M memory is too small, and DMA addressing of some devices can reach 32 bits, zone is divided in order to be compatible with 32-bit system in 64 bit system_ Dma32, the physical memory is a DMA lower than 32 bits to meet the 32-bit addressing range. The origin of the memory area is described in detail below.
  • ZONE_NORMAL: the physical memory area normally used. Most of the requested memory uses this area
  • ZONE_HIGHMEM: only occurs in 32-bit systems. At this time, physical memory can be directly mapped to 896M memory in the kernel in 32-bit systems. However, in order to be compatible with systems with memory greater than 896M, memory greater than 896M is mapped to high-end memory to make up for the lack of address space. Note that high-end memory mapping does not have one-to-one mapping method, It is mapped when it is used. In a 64 bit system, zone is not required because the address space is used enough_ HIGHMEM.
  • ZONE_MOVABLE: removable or recyclable area. This zone is generally called pseudo zone, and the physical memory managed comes from ZONE_NORMAL or ZONE_HIGHMEM mainly prevents memory fragmentation and supports hot plug function. The kernel will ZONE_NORMAL or zone_ The removable memory in highmem is reorganized in ZONE_MOVABLE for easy discovery
  • ZONE_DEVICE: device memory, pluggable.

ZONE_DMA32 history

zone_ The source of dma32 has a long history. Understanding its source is very important to understand the whole physical memory design. There is a paragraph in LSW specifically describing the source of this area: zone_DMA32 source:

Linux systems typically divide main memory into three zones. Most memory fits into the "normal" zone, ZONE_NORMAL. At the low end, however, there are 16MB of memory which are partitioned into the DMA zone ZONE_DMA; this memory is then reserved for situations where it is specifically needed. The most common user of DMA memory is older peripherals which can only address 24 bits of memory. Finally, on the high end, ZONE_HIGHMEM contains all memory which cannot be directly addressed by the kernel.

Not all systems implement all of these zones. Some newer architectures do not support ancient peripherals and leave out ZONE_DMA. In general, 64-bit systems have no addressing problems and do not need ZONE_HIGHMEM. The ia64 architecture settled on a different implementation of ZONE_DMA, defining it to cover all memory addressed below 4GB.

As it turns out, there are uses for a 4GB zone. Quite a few devices have trouble accessing memory which cannot be addressed with 32 bits. Drivers for such devices have been forced to use ZONE_DMA, the I/O memory management unit (on systems which have one), or bounce buffers. None of those solutions is ideal: ZONE_DMA is a small and scarce resource, IOMMU space can also be scarce, and bounce buffers are slow. All of these problems could be avoided if DMA memory could be reliably allocated below the 4GB boundary.

Andi Kleen has decided that the time has come for the x86-64 architecture to support a 32-bit DMA zone. So his patch adds a new zone (ZONE_DMA32) and an associated GFP flag (GFP_DMA32) for allocations. According to Andi, the reason which prevented the addition of this zone in the first place (the fact that the virtual memory subsystem had a very hard time balancing memory between zones) has gone away. Meanwhile, the lack of this zone is causing real problems.

In the early devices, because the DMA addressing range is up to 16M(24 bit), it needs to be reserved for the device in the process of kernel design. However, in a 64 bit system, if all DMA memory is set within 16M, the resources are obviously tight and insufficient. Especially in the IOMMU scenario, it can be addressed to 32-bit 4GB memory. In order to solve this problem, Andi Kleen decided to introduce ZONE_DMA32 is a new area that contains less than 4GB of physical memory, which meets the addressing range of 32-bit DMA system and solves zone_ Insufficient DMA resources.

ZONE_ Origin of movable

The linux kernel divides the memory area into zones, and then the management application and release of memory in each zone are solved by the buddy algorithm. However, the buddy algorithm has a big defect that with the system applying for and releasing memory for a long time, there will be a lot of memory fragmentation in the actual physical memory, At this time, when applying for large continuous physical memory, there is actually enough physical memory, but in fact, large continuous physical memory cannot be applied due to memory fragmentation. This problem has plagued the kernel community for a long time. Finally, Mel Gorman proposed A relatively gifted memory fragmentation solution And accepted by the community:

Mel Gorman's fragmentation avoidance patches have been discussed here a few times in the past. The core idea behind Mel's work is to identify pages which can be easily moved or reclaimed and group them together. Movable pages include those allocated to user space; moving them is just a matter of changing the relevant page table entries. Reclaimable pages include kernel caches which can be released should the need arise. Grouping these pages together makes it easy for the kernel to free large blocks of memory, which is useful for enabling high-order allocations or for vacating regions of memory entirely.

In Mel Gorman's scheme, physical pages are divided into several types such as moved, reclaim and unmoved, and physical memory pages of move and reclaim types are reorganized into a new ZONE, namely ZONE_ Movable (therefore, ZONE_MOVABLE is generally called pseudo zone). When the kernel needs to apply for large continuous memory, when the memory is insufficient, the moved or reclaim memory is recycled to squeeze out enough continuous memory for application. At the same time, the extruded moved memory is re applied for new physical memory for users, so that the whole process can not be perceived.

ZONE_MOVABLE zone plays two important roles:

  • It can effectively prevent memory fragmentation
  • It supports memory hot plug, especially in virtualization scenarios. When not so much physical memory is needed, it can be released for processing to other programs in the system. When it is necessary to apply for new physical memory, the process can be reinserted. linus supports this very much. At the same time, in some scenarios, unnecessary physical memory can be turned off to save power.

 In particular, Linus is opposed to the idea. The biggest potential use for hot-unplugging is for virtualization; it allows a hypervisor to move memory resources between guests as their needs change. Linus points out that most virtualization mechanisms already have mechanisms which allow the addition and removal of individual pages from guests; there is, he says, no need for any other support for memory changes.

Another use for this technique is allowing systems to conserve power by turning off banks of memory when they are not needed. Clearly, one must be able to move all useful data out of a memory bank before powering it down. Linus is even more dismissive of this idea:

The whole DRAM power story is a bedtime story for gullible children. Don't fall for it. It's not realistic. The hardware support for it DOES NOT EXIST today, and probably won't for several years. And the real fix is elsewhere anyway...

zone size allocation

The physical memory size managed by various types of zone s is allocated during system initialization. The following describes the size division process based on X86 platform.

zone_sizes_init()

zone_sizes_init() allocates the entry function for zone size, which is located in arch \ x86 \ mm \ init C in the document:

void __init zone_sizes_init(void)
{
	unsigned long max_zone_pfns[MAX_NR_ZONES];

	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));

#ifdef CONFIG_ZONE_DMA
	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA_PFN, max_low_pfn);
#endif
#ifdef CONFIG_ZONE_DMA32
	max_zone_pfns[ZONE_DMA32]	= min(MAX_DMA32_PFN, max_low_pfn);
#endif
	max_zone_pfns[ZONE_NORMAL]	= max_low_pfn;
#ifdef CONFIG_HIGHMEM
	max_zone_pfns[ZONE_HIGHMEM]	= max_pfn;
#endif

	free_area_init(max_zone_pfns);
}
  • Separate ZONE_DMA, ZONEDMA32y and zone_ Set the maximum physical page of normal to max_zone_pfns array, where ZONE_DMA cannot exceed 16M, ZONE_DMA32 cannot exceed 4GB, max_low_pfn is zone_ Maximum physical memory page for normal
  • free_area_init(): initialize the zone. The input parameter is the size of each zone (note that there is no ZONE_MOVABLE size at this time).

free_area_init()

This function is mainly based on max_zone_pfn array initializes each zone

void __init free_area_init(unsigned long *max_zone_pfn)
{
	unsigned long start_pfn, end_pfn;
	int i, nid, zone;
	bool descending;

	/* Record where the zone boundaries are */
	memset(arch_zone_lowest_possible_pfn, 0,
				sizeof(arch_zone_lowest_possible_pfn));
	memset(arch_zone_highest_possible_pfn, 0,
				sizeof(arch_zone_highest_possible_pfn));

	start_pfn = find_min_pfn_with_active_regions();
	descending = arch_has_descending_max_zone_pfns();

	for (i = 0; i < MAX_NR_ZONES; i++) {
		if (descending)
			zone = MAX_NR_ZONES - i - 1;
		else
			zone = i;

		if (zone == ZONE_MOVABLE)
			continue;

		end_pfn = max(max_zone_pfn[zone], start_pfn);
		arch_zone_lowest_possible_pfn[zone] = start_pfn;
		arch_zone_highest_possible_pfn[zone] = end_pfn;

		start_pfn = end_pfn;
	}

	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
	find_zone_movable_pfns_for_nodes();

	/* Print out the zone ranges */
	pr_info("Zone ranges:\n");
	for (i = 0; i < MAX_NR_ZONES; i++) {
		if (i == ZONE_MOVABLE)
			continue;
		pr_info("  %-8s ", zone_names[i]);
		if (arch_zone_lowest_possible_pfn[i] ==
				arch_zone_highest_possible_pfn[i])
			pr_cont("empty\n");
		else
			pr_cont("[mem %#018Lx-%#018Lx]\n",
				(u64)arch_zone_lowest_possible_pfn[i]
					<< PAGE_SHIFT,
				((u64)arch_zone_highest_possible_pfn[i]
					<< PAGE_SHIFT) - 1);
	}

	/* Print out the PFNs ZONE_MOVABLE begins at in each node */
	pr_info("Movable zone start for each node\n");
	for (i = 0; i < MAX_NUMNODES; i++) {
		if (zone_movable_pfn[i])
			pr_info("  Node %d: %#018Lx\n", i,
			       (u64)zone_movable_pfn[i] << PAGE_SHIFT);
	}

	/*
	 * Print out the early node map, and initialize the
	 * subsection-map relative to active online memory ranges to
	 * enable future "sub-section" extensions of the memory map.
	 */
	pr_info("Early memory node ranges\n");
	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
		pr_info("  node %3d: [mem %#018Lx-%#018Lx]\n", nid,
			(u64)start_pfn << PAGE_SHIFT,
			((u64)end_pfn << PAGE_SHIFT) - 1);
		subsection_map_init(start_pfn, end_pfn - start_pfn);
	}

	/* Initialise every node */
	mminit_verify_pageflags_layout();
	setup_nr_node_ids();
	init_unavailable_mem();
	for_each_online_node(nid) {
		pg_data_t *pgdat = NODE_DATA(nid);
		free_area_init_node(nid);

		/* Any memory on that node */
		if (pgdat->node_present_pages)
			node_set_state(nid, N_MEMORY);
		check_for_memory(pgdat, nid);
	}
}

The main process is as follows:

  • Call find_min_pfn_with_active_regions interface, obtain the base of the first region in the memblock as the starting PFN of the zone
  • arch_has_descending_max_zone_pfns: it is specifically related to the structure. The physical addresses managed by the zone are divided in ascending or descending order
  • According to max_zone_pfn array and actual start_fn, the preliminary zone distribution is obtained, arch_zone_lowest_possible_pfn is the corresponding zone starting pfn, arch_zone_highest_possible_pfn corresponds to the end of zone pfn.
  • find_zone_movable_pfns_for_nodes: from the zone according to the configuration and actual physical memory_ Get the movable pfn of each node in normal and save it to the ZONE_MOVABLE_ In the pfn array, it is used to organize subsequent data into zone_ In movable.
  • Connect each zone's {arch_zone_lowest_possible_pfn and arch_zone_highest_possible_pfn prints out, and there is no zone at this time_ Movable information
  • Continue to ZONE_MOVABLE_ Print out the PFN array and print ZONE_MOVABLE information.
  • Print detailed physical memory information in all memblock s to facilitate viewing startup information.
  • mminit_verify_pageflags_layout: pageflags processing validation
  • setup_nr_node_ids: if it is a NUMA system, calculate the possible node id
  • init_unavailable_mem: Yes, not by memblock Memory and memblock Initialize the physical memory of the reserved tube and the memblock and reserved
  • free_area_init_node: initializes the physical memory information of each node.

find_zone_movable_pfns_for_nodes()

This function mainly obtains the movable part from the existing zone according to the configuration. The code logic is a little complicated:

static void __init find_zone_movable_pfns_for_nodes(void)
{
	int i, nid;
	unsigned long usable_startpfn;
	unsigned long kernelcore_node, kernelcore_remaining;
	/* save the state before borrow the nodemask */
	nodemask_t saved_node_state = node_states[N_MEMORY];
	unsigned long totalpages = early_calculate_totalpages();
	int usable_nodes = nodes_weight(node_states[N_MEMORY]);
	struct memblock_region *r;

	/* Need to find movable_zone earlier when movable_node is specified. */
	find_usable_zone_for_movable();

	/*
	 * If movable_node is specified, ignore kernelcore and movablecore
	 * options.
	 */
	if (movable_node_is_enabled()) {
		for_each_memblock(memory, r) {
			if (!memblock_is_hotpluggable(r))
				continue;

			nid = memblock_get_region_node(r);

			usable_startpfn = PFN_DOWN(r->base);
			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
				min(usable_startpfn, zone_movable_pfn[nid]) :
				usable_startpfn;
		}

		goto out2;
	}

	/*
	 * If kernelcore=mirror is specified, ignore movablecore option
	 */
	if (mirrored_kernelcore) {
		bool mem_below_4gb_not_mirrored = false;

		for_each_memblock(memory, r) {
			if (memblock_is_mirror(r))
				continue;

			nid = memblock_get_region_node(r);

			usable_startpfn = memblock_region_memory_base_pfn(r);

			if (usable_startpfn < 0x100000) {
				mem_below_4gb_not_mirrored = true;
				continue;
			}

			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
				min(usable_startpfn, zone_movable_pfn[nid]) :
				usable_startpfn;
		}

		if (mem_below_4gb_not_mirrored)
			pr_warn("This configuration results in unmirrored kernel memory.\n");

		goto out2;
	}

	/*
	 * If kernelcore=nn% or movablecore=nn% was specified, calculate the
	 * amount of necessary memory.
	 */
	if (required_kernelcore_percent)
		required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
				       10000UL;
	if (required_movablecore_percent)
		required_movablecore = (totalpages * 100 * required_movablecore_percent) /
					10000UL;

	/*
	 * If movablecore= was specified, calculate what size of
	 * kernelcore that corresponds so that memory usable for
	 * any allocation type is evenly spread. If both kernelcore
	 * and movablecore are specified, then the value of kernelcore
	 * will be used for required_kernelcore if it's greater than
	 * what movablecore would have allowed.
	 */
	if (required_movablecore) {
		unsigned long corepages;

		/*
		 * Round-up so that ZONE_MOVABLE is at least as large as what
		 * was requested by the user
		 */
		required_movablecore =
			roundup(required_movablecore, MAX_ORDER_NR_PAGES);
		required_movablecore = min(totalpages, required_movablecore);
		corepages = totalpages - required_movablecore;

		required_kernelcore = max(required_kernelcore, corepages);
	}

	/*
	 * If kernelcore was not specified or kernelcore size is larger
	 * than totalpages, there is no ZONE_MOVABLE.
	 */
	if (!required_kernelcore || required_kernelcore >= totalpages)
		goto out;

	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
	usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

restart:
	/* Spread kernelcore memory as evenly as possible throughout nodes */
	kernelcore_node = required_kernelcore / usable_nodes;
	for_each_node_state(nid, N_MEMORY) {
		unsigned long start_pfn, end_pfn;

		/*
		 * Recalculate kernelcore_node if the division per node
		 * now exceeds what is necessary to satisfy the requested
		 * amount of memory for the kernel
		 */
		if (required_kernelcore < kernelcore_node)
			kernelcore_node = required_kernelcore / usable_nodes;

		/*
		 * As the map is walked, we track how much memory is usable
		 * by the kernel using kernelcore_remaining. When it is
		 * 0, the rest of the node is usable by ZONE_MOVABLE
		 */
		kernelcore_remaining = kernelcore_node;

		/* Go through each range of PFNs within this node */
		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
			unsigned long size_pages;

			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
			if (start_pfn >= end_pfn)
				continue;

			/* Account for what is only usable for kernelcore */
			if (start_pfn < usable_startpfn) {
				unsigned long kernel_pages;
				kernel_pages = min(end_pfn, usable_startpfn)
								- start_pfn;

				kernelcore_remaining -= min(kernel_pages,
							kernelcore_remaining);
				required_kernelcore -= min(kernel_pages,
							required_kernelcore);

				/* Continue if range is now fully accounted */
				if (end_pfn <= usable_startpfn) {

					/*
					 * Push zone_movable_pfn to the end so
					 * that if we have to rebalance
					 * kernelcore across nodes, we will
					 * not double account here
					 */
					zone_movable_pfn[nid] = end_pfn;
					continue;
				}
				start_pfn = usable_startpfn;
			}

			/*
			 * The usable PFN range for ZONE_MOVABLE is from
			 * start_pfn->end_pfn. Calculate size_pages as the
			 * number of pages used as kernelcore
			 */
			size_pages = end_pfn - start_pfn;
			if (size_pages > kernelcore_remaining)
				size_pages = kernelcore_remaining;
			zone_movable_pfn[nid] = start_pfn + size_pages;

			/*
			 * Some kernelcore has been met, update counts and
			 * break if the kernelcore for this node has been
			 * satisfied
			 */
			required_kernelcore -= min(required_kernelcore,
								size_pages);
			kernelcore_remaining -= size_pages;
			if (!kernelcore_remaining)
				break;
		}
	}

	/*
	 * If there is still required_kernelcore, we do another pass with one
	 * less node in the count. This will push zone_movable_pfn[nid] further
	 * along on the nodes that still have memory until kernelcore is
	 * satisfied
	 */
	usable_nodes--;
	if (usable_nodes && required_kernelcore > usable_nodes)
		goto restart;

out2:
	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
	for (nid = 0; nid < MAX_NUMNODES; nid++)
		zone_movable_pfn[nid] =
			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);

out:
	/* restore the node_state */
	node_states[N_MEMORY] = saved_node_state;
}

The process of this function is as follows:

  • early_calculate_totalpages(): calculate the total pages of all physical pages from memblock.
  • usable_nodes = nodes_weight(node_states[N_MEMORY]): get the number of available nodes with memory in the system (NUMA scenario)
  • find_usable_zone_for_movable: find the zone index and 64 bit zone in the 64 bit system that can be supported_ Normal, 32 system takes precedence over ZONE_HIGHMEM.
  • movable_node_is_enabled: judge move_ For node release and configuration, a node can be specially configured as a moveable node according to the actual situation, and all physical memory of the node is removable, that is, the node supports hot plug and can be moved through cmd before kernel startup_ The node parameter specifies. If configured, the system is no longer divided into zone from the memory of other nodes_ MOVABLE. If there is no configuration, continue.
  • If the cmdline # kernelcore is configured as mirror, the memory greater than 0x100000 will be regarded as movable, and the configuration for movablecore will be ignored. Kernelcore is configured as cmd lin to tell the system how much memory can not be moved.
  • If you configure cmdline kernelcore as a percentage, you can get the actual memory required according to the actual usage of the current system_ Kernelcore as not moving
  • If movablecore is configured. How much memory is required to be removable is calculated as a percentage_ movablecore.
  • If both the kernel core and the movable core are configured with percentages, first calculate the percentage required according to the non movable part_ Kernel core principle, the rest is required_movablecore removable memory. The advantage of this calculation is to prevent kernel core + movablecore from exceeding 100%.
  • kernelcore_node = required_kernelcore / usable_nodes, according to the actual situation of the node, evenly divide the immovable part of the memory into each node.
  • for_each_node_state(nid, N_MEMORY): traverse each node and divide it into movable and non movable parts.
  • for_each_mem_pfn_range: traverse the memblock of each node, and according to the actual situation, divide the immovable part from required_kernelcore minus
  • If there is still a remaining part (end_pfn < = usable_startpfn) after the non movable part is divided in the current node, the remaining part will be regarded as the movable part and saved to the zone_movable_pfn array.
  • After traversing the node, check (required_kernelcore < kernelcore_node) before entering the next node. If it exceeds, recalculate the kernelcore_ node.
  • After traversing all nodes, you can obtain a detailed zone that can be used as mobile physical memory_ movable_ pfn.

Related cmdline startup parameters

cmdline is the startup parameter passed to the kernel when starting the kernel, so that the kernel can be configured as needed. Relevant kernel cmdline command startup parameters can be obtained from documentation \ admin guide \ kernel parameters Txt document

movable_node

Configure the physical memory of the entire node to be removable or pluggable

[KNL] Boot-time switch to make hotplugable memory      NUMA nodes to be movable. This means that the memory     of such nodes will be usable only for movable            allocations which rules out almost all kernel      allocations. Use with caution!.

 kernelcore

Configure the memory of the non removable part of the system. The configuration supports two methods of percentage nn% obtaining "mirror"

Format: nn[KMGTPE] | nn% | "mirror"
    This parameter specifies the amount of memory usable by   the kernel for non-movable allocations.  The requested  amount is spread evenly throughout all nodes in the
system as ZONE_NORMAL.  The remaining memory is used for   movable memory in its own zone, ZONE_MOVABLE.  In the  event, a node is too small to have both ZONE_NORMAL and   ZONE_MOVABLE, kernelcore memory will take priority and  other nodes will have a larger ZONE_MOVABLE.

      ZONE_MOVABLE is used for the allocation of pages that    may be reclaimed or moved by the page migration  subsystem.  Note that allocations like PTEs-from-HighMem  still use the HighMem zone if it exists, and the Normal   zone if it does not.

     It is possible to specify the exact amount of memory in    the form of "nn[KMGTPE]", a percentage of total system  memory in the form of "nn%", or "mirror".  If "mirror"            option is specified, mirrored (reliable) memory is used  for non-movable allocations and remaining memory is used   for Movable pages.  "nn[KMGTPE]", "nn%", and "mirror"
 are exclusive, so you cannot specify multiple forms.

 movablecore

Configure the percentage of memory that can be moved in the system.

 Format: nn[KMGTPE] | nn%
 This parameter is the complement to kernelcore=, it   specifies the amount of memory used for migratable    allocations.  If both kernelcore and movablecore is           specified, then kernelcore will be at *least* the  specified value but may be more.  If movablecore on its  own is specified, the administrator must be careful  that the amount of memory usable for all allocations  is not too small.

reference material

https://lwn.net/Articles/152462/

https://lwn.net/Articles/224829/

https://lwn.net/Articles/843326/

Keywords: Linux kernel

Added by volka on Tue, 18 Jan 2022 15:35:23 +0200