Linux0.11 analysis of execve process of system call


This article is based on Linux 0 The source code can be described in oldlinux.11 Org.

execve function introduction

Execve is a function used to run user program (a.out) or shell script. It is a system call function commonly used in linux programming. The essence of running user programs on the linux command line is to execute execve system calls.

execve essence

In execve Execve in C file is defined in this way_ Syscall 3 (int, execve, const, char *, file, char * *, argv, char * *, ENVP), where_ syscall3() is a macro. Expand it as follows:

int execve(const char * file,char ** argv,char ** envp) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
	: "=a" (__res) \
	: "0" (__NR_execve),"b" ((long)(file)),"c" ((long)(argv)),"d" ((long)(envp))); \
if (__res>=0) \
	return (int) __res; \
errno=-__res; \
return -1; \

You can see that the essence of execve is the system call int 0x80 (similar to the trigger of soft interrupt), and the system call number is__ NR_execve is assigned in eax. The parameters passed in are file, argv and envp, which are respectively passed in by ebx, ecx and edx registers.
Note:__ NR_execve in unistd Defined in H, the value is 11, which is sys_ call_ Index value of table (used to find the corresponding system call function sys_execve in the table)

execve system call process

After executing int 0x 80, the CPU will jump to_ system_call execution_ system_call as follows:

	cmpl $nr_system_calls-1,%eax #Compare the system call number with the system call max value
	ja bad_sys_call #If out of range, jump to bad_sys_call is an incorrect system call number
	push %ds #The user data terminal ds is put into the stack to protect the site
	push %es #The user data terminal ds is put into the stack to protect the site
	push %fs #The user data terminal ds is put into the stack to protect the site
	pushl %edx
	pushl %ecx		
	pushl %ebx		# Put edx (file), ecx (argv) and ebx (envp) on the stack as C language call parameters
	movl $0x10,%edx		# ds and es point to the kernel data segment
	mov %dx,%ds
	mov %dx,%es
	movl $0x17,%edx		# fs points to the user data segment (which is the bridge between the kernel and users)
	mov %dx,%fs
	call _sys_call_table(,%eax,4) #Call sys_ call_ No. in table__ NR_execve term function, namely sys_execve

_ system_ After checking the correctness of the system call number, call protects the site of the user data segment. According to eax, i.e__ NR_execve calls sys_call_table sys in system call table_ Execve function, sys_execve is also an assembly function, as follows:

	lea EIP(%esp),%eax #Get the address where the return address of the system call is stored in the stack
	pushl %eax #Put the address on the stack
	call _do_execve #Call do_execve function
	addl $4,%esp #Discard the address

A very important operation here is to put the address of the return address of the system call stored in the stack into the stack (see the PTR pointer in the figure below). Note that this is the address of the stack! Not the return address of the system call (the address of the next statement of int 0x80)!! Here, let's take a look at the current kernel stack as follows:

Yellow part: when the system calls, the CPU automatically pushes the protected parameters. Because it wants to fall into the kernel state, CS, SS and ESP need to be saved and replaced with kernel code segment, kernel stack segment and kernel stack pointer, EFLAGS field and interrupt return address (system call return address)
Blue part: from_ system_ The data pushed into the stack by the call start code, in which the penultimate five parameters are the next do_ Parameters that execve will call.

After understanding the stack, execute the C function do_execve call, do_ The excve function is as follows (this article will skip the execution part of the shell and see how the executable file is executed first):

int do_execve(unsigned long * eip,long tmp,char * filename,
	char ** argv, char ** envp)//input : _system_call back addr, 
	struct m_inode * inode;
	struct buffer_head * bh;
	struct exec ex;
	unsigned long page[MAX_ARG_PAGES];//Store the physical page address. A total of 32 physical pages can be used to store parameters
	int i,argc,envc;
	int e_uid, e_gid;
	int retval;
	int sh_bang = 0;//Involving the shell, I won't look at it for the time being
	unsigned long p=PAGE_SIZE*MAX_ARG_PAGES-4;//Point to the end -4] address of [32 physical pages]

	if ((0xffff & eip[1]) != 0x000f)//The EIP pointer points to the EIP in the yellow area of the stack diagram, then eip[1] is the user code segment CS
		panic("execve called from supervisor mode");//If CS points to the kernel data segment, it will be down and the kernel is not allowed to use it
	for (i=0 ; i<MAX_ARG_PAGES ; i++)	//Clear 32 arrays storing physical page addresses
	if (!(inode=namei(filename))) //Get inode of filename executable
		return -ENOENT;//If it cannot be obtained, an error is returned
	argc = count(argv);//Calculate the number of arg parameters
	envc = count(envp);//Calculate the number of env parameters

	if (!S_ISREG(inode->i_mode)) {	//Check whether the inode of the executable file is a regular file, otherwise an error occurs
		retval = -EACCES;
		goto exec_error2;
	i = inode->i_mode;
	e_uid = (i & S_ISUID) ? inode->i_uid : current->euid;//If s_ If the isuid is set, the valid uid is the uid of inode, otherwise the uid of the current process will be used
	e_gid = (i & S_ISGID) ? inode->i_gid : current->egid;//If s_ If isgid is set, the valid GID is the GID of inode. Otherwise, the GID of the current process will be used
	if (current->euid == inode->i_uid)//If the process id is consistent with the user id of the file, the user id is used
		i >>= 6;
	else if (current->egid == inode->i_gid)//If the process id is consistent with the group id of the file, the group id is used
		i >>= 3;
	if (!(i & 1) &&
	    !((inode->i_mode & 0111) && suser())) {//Judge whether you have execution permission or super user
		retval = -ENOEXEC;//No permission to execute, enter the return value ENOEXEC
		goto exec_error2;//Jump to error handling location
	if (!(bh = bread(inode->i_dev,inode->i_zone[0]))) {//Read the first block of the file according to inode, and the block number is stored in inode - > I_ In zone [0], which device is read by inode - > I_ Dev to decide
		retval = -EACCES;//If the data cannot be read, the error EACCES is returned
		goto exec_error2;//Jump to error handling location
	ex = *((struct exec *) bh->b_data);	//exec header read
	if ((bh->b_data[0] == '#') && (bh->b_data[1] == '!') && (!sh_bang)) {//If the beginning of the file is #! At the beginning, it may be a shell script
		......//Here is the shell processing, which will not be repeated in this article
	brelse(bh);//Release the buffer head because the exec header has been obtained (only the exec header information in the first block is valid), so release it
	if (N_MAGIC(ex) != ZMAGIC || ex.a_trsize || ex.a_drsize ||
		ex.a_text+ex.a_data+ex.a_bss>0x3000000 ||
		inode->i_size < ex.a_text+ex.a_data+ex.a_syms+N_TXTOFF(ex)) {//Check whether the magic number of exec head is ZMAGIC,a_trsize and a_ Whether drsize is zero, the length of code data shall not be greater than 48M, and I of inode_ Size cannot be less than code segment size + data segment size + symbol table size + exec header occupation size
		retval = -ENOEXEC;//Otherwise, the error value ENOEXEC is returned
		goto exec_error2;
	if (N_TXTOFF(ex) != BLOCK_SIZE) {//exec header must occupy BLOCK_SIZE size
		printk("%s: N_TXTOFF != BLOCK_SIZE. See a.out.h.", filename);
		retval = -ENOEXEC;
		goto exec_error2;
	if (!sh_bang) {//Not a shell
		p = copy_strings(envc,envp,page,p,0);//Copy parameters to the page (which will allocate physical pages)
		p = copy_strings(argc,argv,page,p,0);//Copy environment variables to the page (where physical pages will be allocated)
		if (!p) {
			retval = -ENOMEM;
			goto exec_error2;
	if (current->executable)//If the current process is an executable
		iput(current->executable);//Then release the current executable inode node
	current->executable = inode;//The current executable inode is assigned the latest value
	for (i=0 ; i<32 ; i++)//All signal processing functions are cleared
		current->sigaction[i].sa_handler = NULL;
	for (i=0 ; i<NR_OPEN ; i++)//Handle to close when closing exec
		if ((current->close_on_exec>>i)&1)
	current->close_on_exec = 0;
	free_page_tables(get_base(current->ldt[1]),get_limit(0x0f));//Clear the page table mapping of the current process
	free_page_tables(get_base(current->ldt[2]),get_limit(0x17));//Clear the page table mapping of the current process
	if (last_task_used_math == current)
		last_task_used_math = NULL;
	current->used_math = 0;
	p += change_ldt(ex.a_text,page)-MAX_ARG_PAGES*PAGE_SIZE;//Change ldt
	p = (unsigned long) create_tables((char *)p,argc,envc);//Make parameter table (similar to pointer array)
	current->brk = ex.a_bss +
		(current->end_data = ex.a_data +
		(current->end_code = ex.a_text));//Write code end position, data end position, bss end position
	current->start_stack = p & 0xfffff000;//Record the page on which the stack pointer is located
	current->euid = e_uid;
	current->egid = e_gid;
	i = ex.a_text+ex.a_data;
	while (i&0xfff)//If the end of the data is not page aligned (4KB aligned), the data from the end of the data to the end of the page is cleared
		put_fs_byte(0,(char *) (i++));//I feel that page missing exception will be triggered here, because page_table was free, and later found a_ text+a_ The data is almost 4kb aligned. It seems that you can't get in here
	eip[0] = ex.a_entry;		//In the yellow part of the stack diagram above, the EIP system call return address is replaced by the user program entry address
	eip[3] = p;			//The ESP user stack pointer in the yellow part of the stack diagram above is replaced with p
	return 0;//Return 0
	for (i=0 ; i<MAX_ARG_PAGES ; i++)

If you don't understand the above code, continue to look down and analyze, and then intercept the key logic for analysis:

  1. First of all, filename is an executable file. We must read the data of the executable file (a.out) in the disk before it can run. In order to find its location in the hard disk, we need to rely on inode, so we use
    inode=namei(filename) reads the inode node of filename (this is found according to the root directory inode or the current directory inode all the way index).

  2. Get the inode node of the executable file, and use the BH = break (inode - > i_dev, inode - > i_zone[0]) block device reading function to read its first piece of data, where I_ Dev is the device number (from which block device to read), i_zone[0] is the logical block number (which part of the block device is read). After reading the first block (1KB) of the executable file, ex = * ((struct exec *) BH - > b_ Data) read out the exec header. In the first block, only the data of the exec header is valid, which is used to record some information of the executable file. The exec header structure struct exec is as follows:

    struct exec {
      unsigned long a_magic;	/* Use macros N_MAGIC, etc for access */
      unsigned a_text;		/* length of text, in bytes */
      unsigned a_data;		/* length of data, in bytes */
      unsigned a_bss;		/* length of uninitialized data area for file, in bytes */
      unsigned a_syms;		/* length of symbol table data in file, in bytes */
      unsigned a_entry;		/* start address */
      unsigned a_trsize;		/* length of relocation info for text, in bytes */
      unsigned a_drsize;		/* length of relocation info for data, in bytes */

    As can be seen from the figure, the exec header contains the code length, data length, bss segment length, symbol table length, starting address and data code relocation information of the binary file (this is not used).

  3. With the basic data of the executable file, the next step is to prepare the running environment. You need to modify the LDT local descriptor table, otherwise the code data may not be accessible because it is limited in length, and you need to cut off all the page table mappings of the current process first. Use free_ page_ tables(get_base(current->ldt[1]),get_ limit(0x0f)),free_ page_ tables(get_base(current->ldt[2]),get_ Limit (0x17)) cuts off the page table mapping of code segment and data segment respectively. Using change_ The LDT (ex.a_text, page) function modifies the LDT local descriptor table, the beginning of the code segment is consistent with the beginning of the data segment, both starting from 0, and the length limit of the code segment is modified to ex.a_ The length of text is changed to 0x4000000, and all data in the whole process can be accessed. Because the space of a process is 64M, the length limit here is exactly 64M. The code snippet is as follows:

    static unsigned long change_ldt(unsigned long text_size,unsigned long * page)
    	unsigned long code_limit,data_limit,code_base,data_base;
    	int i;
    	code_limit = text_size+PAGE_SIZE -1;//Calculate the page occupied by the code snippet
    	code_limit &= 0xFFFFF000;//Page alignment
    	data_limit = 0x4000000;//64M
    	code_base = get_base(current->ldt[1]);//Gets the start of the code segment
    	data_base = code_base;//The data segment is consistent with the beginning of the code segment, which is 0 here
    	set_base(current->ldt[1],code_base);//Set snippet start
    	set_limit(current->ldt[1],code_limit);//Set code segment length limit
    	set_base(current->ldt[2],data_base);//Set data segment start
    	set_limit(current->ldt[2],data_limit);//Set data segment length limit
    /* make sure fs points to the NEW data segment */
    	__asm__("pushl $0x17\n\tpop %%fs"::);//The fs assignment 0x17 represents the data segment pointing to the user
    	data_base += data_limit;//Point to the end
    	for (i=MAX_ARG_PAGES-1 ; i>=0 ; i--) {//Here is the filling parameter page. Fill the physical page with parameters into the page table to form a mapping
    		data_base -= PAGE_SIZE;
    		if (page[i])
    	return data_limit;
  4. After understanding the preparation of the environment, we have to consider how to pass in the parameters and how to pass in the argv parameters and env environment variables to the user program. The system reserves 32 pages (4KB*32) of space for storing parameters and environment variables. p is used as the index of the space (similar to stack pointer, growing downward). p initially points to the last 4bytes of the 32 page space (unsigned long p=PAGE_SIZE*MAX_ARG_PAGES-4), copy_strings(envc,envp,page,p,0) and copy_strings(argc,argv,page,p,0) push the parameters and environment variables into the 32 page space from top to bottom like a stack. If not allocated, allocate the physical page. The code is as follows:

    static unsigned long copy_strings(int argc,char ** argv,unsigned long *page,
    		unsigned long p, int from_kmem)
    	char *tmp, *pag;
    	int len, offset = 0;
    	unsigned long old_fs, new_fs;
    	if (!p)
    		return 0;	/* bullet-proofing */
    	new_fs = get_ds();//Save kernel data segment
    	old_fs = get_fs();//It's useless to save the user data segment, because we all get data from the user data segment_ kmem== 0
    	if (from_kmem==2)//0, skip
    	while (argc-- > 0) {//argc is the number of parameters
    		if (from_kmem == 1)//0, skip
    		if (!(tmp = (char *)get_fs_long(((unsigned long *)argv)+argc)))//Point to the argc parameter. If it is empty, the machine will be down
    			panic("argc is wrong");
    		if (from_kmem == 1)//0, skip
    		len=0;		/* remember zero-padding */
    		do {
    		} while (get_fs_byte(tmp++));//Calculate the length of the parameter, that is, the length of the string, ending with '\ 0'
    		if (p-len < 0) {	 //Confirm whether the 32 page space (32*4KB=128KB) cannot accommodate the new parameters
    			return 0;
    		while (len) {//Cycle parameter length (stored in bytes)
    			--p; --tmp; --len;//p points to a new byte space, tmp points to the end of the parameter to be copied, and len represents the remaining length
    			if (--offset < 0) {//Intra page offset less than 0
    				offset = p % PAGE_SIZE;//Reset in page offset
    				if (from_kmem==2)//0, skip
    				if (!(pag = (char *) page[p/PAGE_SIZE]) &&
    				    !(pag = (char *) page[p/PAGE_SIZE] =
    				      (unsigned long *) get_free_page())) //Assign if the page does not exist
    					return 0;
    				if (from_kmem==2)//0, skip
    			*(pag + offset) = get_fs_byte(tmp);//Copy a byte from user space into a physical page
    	if (from_kmem==2)//0, skip
    	return p;

    Subsequently, parameters and environment variables are collectively referred to as parameters. It can be felt through the code that the parameter variable is copied from the user space and stored in the position pointed to by the parameter space p on page 32 (from top to bottom). The essence of the parameter variable is a string. During copying, the '\ 0' at the end of each string will also be copied to divide each parameter. After copying, the logical space of the parameter page is shown in the following figure (assuming that there are two parameters and two environment variables):

  5. The number of physical pages occupied by the parameter is mapped to the number of pages. The operation of page table mapping is in change_ In the second half of the LDT function (the first half of the code has been explained above and omitted here), the reserved 32 page parameter page starts from the last page and puts the page occupied by the parameter_ Page (), put it at the end of 64M (the logical space of each process in Linux 0.11 is 64M), and establish the mapping. The code is as follows:

     static unsigned long change_ldt(unsigned long text_size,unsigned long * page)
     	unsigned long code_limit,data_limit,code_base,data_base;
     	int i;
     	.....//Other codes are omitted
     	data_base += data_limit;//Point to the end
     	for (i=MAX_ARG_PAGES-1 ; i>=0 ; i--) {//Here is the filling parameter page. Fill the physical page with parameters into the page table to form a mapping. You can map as many pages as you use
     		data_base -= PAGE_SIZE;//Point to the last page that has not been used (logical page)
     		if (page[i])//Physical page exists (it means that the page is loaded with parameters or environment variables)
     			put_page(page[i],data_base);//Mapping physical pages to logical pages
     	return data_limit;
  6. Make the environment variable and parameter pointer table. It can be seen from the above that the parameters and environment variables are only copied and the page table mapping is realized. However, the parameters and environment variables are not very convenient to use, and there is no clear boundary between the parameters and environment variables (even if the string is read according to p, it is not known whether the string is an environment variable or a parameter), so use create_tables((char *)p,argc,envc) to make pointer tables.

    static unsigned long * create_tables(char * p,int argc,int envc)
    	unsigned long *argv,*envp;
    	unsigned long * sp;
    	sp = (unsigned long *) (0xfffffffc & (unsigned long) p);//4byte alignment
    	sp -= envc+1;//Leave the number of environment variables + 1 pointer space
    	envp = sp;//Record the first address of environment variable pointer space
    	sp -= argc+1;//Set aside the number of parameters + 1 pointer space
    	argv = sp;//Record parameter pointer space first address
    	put_fs_long((unsigned long)envp,--sp);//First address of pointer space stored in environment variable
    	put_fs_long((unsigned long)argv,--sp);//First address of stored parameter pointer space
    	put_fs_long((unsigned long)argc,--sp);//Number of stored parameters
    	while (argc-->0) {
    		put_fs_long((unsigned long) p,argv++);//Put the first address of each parameter into the parameter pointer space in turn
    		while (get_fs_byte(p++)) ;//Read the first address of the next parameter (because it is a string, all are separated by 0)
    	put_fs_long(0,argv);//The pointer space must end with NULL
    	while (envc-->0) {//Similarly, put it into the environment variable
    		put_fs_long((unsigned long) p,envp++);
    		while (get_fs_byte(p++)) /* nothing */ ;
    	put_fs_long(0,envp);//The pointer space must end with NULL
    	return sp;

    After the pointer table is made, the parameter space on page 32 is as follows (assuming that there are only 2 parameters and 2 environment variables):

As shown in the above figure, you can see that two pointer arrays (pointer tables) are formed. If written in C language, one is unsigned int *arg0_ptr_ptr[] = {arg0_ptr, arg1_ptr, NULL} and unsigned int *env0_ptr_ptr[] = {env0_ptr, env1_ptr, NULL}, then we can imagine that the parameters argc and argv passed in by the main() function in the executable a.out are int argc and int * * arg0 in the figure_ ptr_ ptr.

  1. Modify the jump address and user stack, change the return address of the system call to the entry address of the executable file (0 in Linux 0.11), and modify eip[0] = ex.a_entry;, eip[3] = p; It is so simple that the return address of the system call (interrupt) is changed to the entry address of the user program, and the user stack is changed to P. Here, the stack space is shown in the figure below (EIP is the PTR in the figure below, so it is not difficult to calculate which parameters in the stack eip[0] and EIP [3] modify):

    As can be seen from the figure, the modified value is in the red box, which can be compared with the unmodified stack diagram above. When the system call returns, the CPU instruction pointer points to ex.a_entry, the stack is p.

  2. So far, the system call of execve has been basically completed. Some people must have doubts. The page table of the current process only maps the page table of the parameter block. The code is not copied or mapped. Will there be no exception when the system call (interrupt) returns? Yes, exceptions do occur. When the instructions or data of the executable file a.out are executed that are not copied into the physical memory, a page missing exception will occur (the reason for the exception is that the page table present value is 0, unmapped and does not exist). The page missing exception will copy the code of a.out into the newly allocated physical memory page, and then map the logical address to the physical address, That is, fill in the page table. After exception handling, return the address where the exception occurred and re execute the code. Therefore, when the so-called user program (including the game you usually play) runs, the code of the whole game is not necessarily in your running memory. In fact, it is copied into the running memory only after it is accessed (copy by page, 4KB).
    Reference article: Linux0.11 system exception page exception

  3. Look at the 64M logical address distribution of the current process, as shown in the following figure:

    It can be clearly seen from the figure that the parameters and environment variables are placed at the top of 64M, and p is close behind. The code segment starts from nr*-x040000000 (nr is the index value of the task structure). Of course, when accessing the code segment, it can start from 0 address, because the addressing of CPU in protected mode will add the segment base address (the segment base address is stored in ldt, and the base address in the middle of the figure is nr*0x04000000), Therefore, when compiling the application, it starts from the 0 address, including ex.a_ The entry value is also 0. The blue area in the figure above doesn't exist at present. It can only appear after the application is executed. The stack and the heap are opposite, but the space is large enough to avoid meeting.


evecve system call divides the current process again, delimits a reasonable space for the executable file, and puts the parameters at the 64M end of the current process, which is ready for the execution of the application.

Added by Lahloob on Tue, 08 Feb 2022 10:49:25 +0200