Introduction to assembly language 4: get through C and assembly language

 

review

Last time we talked about the register and memory access related contents involved in the assembly. Let's sort it out first:

  • Registers are super small temporary memories that store the data that the CPU will use immediately or the results just processed in the CPU
  • There's too much data to process, so the registers can't fit. You need more registers, but it's expensive
  • Memory can solve the above problems, but memory is slower than registers. The advantage is that it is relatively cheap and has large capacity

Episode: the relationship between C language and assembly language

There are still some doubts. Let's explain them for the time being. First, in C programming, we never care about registers. It's hard to learn that such a thing suddenly appears in assembly language. Next, let's unify the knowledge of C language and assembly language to help understand.

First, let's look at a C language program:

int x, y, z;

int main() {
    x = 2;
    y = 3;
    z = x + y;
    return z;
}

Considering that our compilation tutorial has just begun, I'll try to simplify the C program as much as possible, so that the knowledge required when dealing with the equivalent compilation content later is introduced earlier.

Save as test01 C file, compile and run the program first:

(note that gcc here takes a parameter - m32, because we need to compile a 32-bit (x86) executable)

$ gcc -m32 test01.c -o test01
$ ./test01 ; echo $?
5

Well, here, our program returns a value: 5.

OK, let's see what we should do if we want to implement almost the same process with assembly?

First, there are three global variables:

int x, y, z;

There must be. (the reason why global variables are used here is that the assembly knowledge related to local variables has not been introduced. I will deal with them first, and then I will talk about the content of local variables later)

First of all, in C language, you can think that each variable will occupy a certain memory space, that is, x, y and z here occupy an "integer", that is, 4 bytes of storage space.

Last time, we introduced the knowledge of accessing memory in assembly. Of course, we also know how to set aside a certain space in the data area. This time, we will copy the methods mentioned above:

global main

main:
    mov eax, 0
    ret

section .data

x    dw    0
y    dw    0
z    dw    0

This program is equivalent to the following C code:

int x, y, z;

int main() {
    return 0;
}

That is, now there are three global variables, but now the assembler does nothing, only returns 0.

The C code here is completely equivalent to the above assembly code to some extent. Even, our c language compiler can directly translate the C code into the above assembly code. The rest of the work is handed over to nasm to compile again. By converting the assembly into an executable file, we can get the final program. Of course, this can be done in theory. In fact, some compilers do this, but the assembly format generated by others is not nasm, but other types, but the truth is the same.

In other words, a sufficiently concise C compiler only needs to be able to translate C code into assembly code and hand over the rest to the assembler to complete, so as to realize the complete C language compiler and get the final executable file. In fact, the C compiler can do this, and some even do it.

Well, let's not talk about these first. Let's complete the previous program until it is equivalent to the first C code. Next, let's focus on this:

x = 2;
y = 3;

That is to put the numbers 2 and 3 into the memory area corresponding to x and y respectively. Quite simply, we can do this:

mov eax, 2
mov [x], eax
mov eax, 3
mov [y], eax

That is, first throw 2 into register eax, and then put the contents of eax back into the memory corresponding to x. In the same way, y deals with it.

OK, the next addition statement:

z = x + y;

You can also do:

mov eax, [x]
mov ebx, [y]
add eax, ebx
mov [z], eax

Well, this code should be understandable. Let's briefly explain the idea:

  • Put the contents of the memory corresponding to x and y into eax and ebx respectively
  • Add the shape of eax = eax + ebx, and the final sum is stored in eax
  • Then store the contents of eax in the memory corresponding to z

Finally, we have another thing to deal with, that is, the return statement:

return z;

This is also easy to do. According to the Convention, the value in eax is the return value of the function:

mov eax, [z]
ret

Even if the whole program is finished, we have completely written the assembly language equivalent form of C code. The final code is as follows:

global main

main:
    mov eax, 2
    mov [x], eax
    mov eax, 3
    mov [y], eax
    mov eax, [x]
    mov ebx, [y]
    add eax, ebx
    mov [z], eax
    mov eax, [z]
    ret


section .data
x       dw      0
y       dw      0
z       dw      0

First save it as a file TEST02 ASM, compile and run to see the effect:

$ nasm -f elf test02.asm -o test02.o 
$ gcc -m32 test02.o -o test02
$ ./test02 ; echo $?
5

Done. The result is completely consistent with the previous C code.

Uncover the true face of C program

Do you think you'll be finished with the equivalent assembly code of YY? Next, let's continue to use tools to find out and play it really.

Let's talk about the preparatory work. First, there are the following two documents:

test01.c  test02.asm

One is the complete C code mentioned above and the other is the complete assembly code mentioned above. Then compile them into executable files according to the previous instructions. After compilation, it is as follows:

$ gcc -m32 test01.c -o test01
$ nasm -f elf test02.asm -o test02.o
$ gcc -m32 -fno-lto test02.o -o test02
$ ls
test01  test01.c  test02  test02.asm  test02.o

(note that you should follow the compilation command here)

test01 is compiled from C code and test02 is compiled from assembly code.

Sacrifice gdb

OK, next, let's welcome our general gdb on the stage.

Let's take a look at what our C compiled program looks like after disassembly:

gdb ./test01

Then enter the command to view the disassembly Code:

(gdb) set disassembly-flavor intel
(gdb) disas main
Dump of assembler code for function main:
   0x080483ed <+0>: push   ebp
   0x080483ee <+1>: mov    ebp,esp
   0x080483f0 <+3>: mov    DWORD PTR ds:0x804a024,0x2
   0x080483fa <+13>:    mov    DWORD PTR ds:0x804a028,0x3
   0x08048404 <+23>:    mov    edx,DWORD PTR ds:0x804a024
   0x0804840a <+29>:    mov    eax,ds:0x804a028
   0x0804840f <+34>:    add    eax,edx
   0x08048411 <+36>:    mov    ds:0x804a020,eax
   0x08048416 <+41>:    mov    eax,ds:0x804a020
   0x0804841b <+46>:    pop    ebp
   0x0804841c <+47>:    ret    
End of assembler dump.
(gdb) quit
$

OK, don't worry. Quit first. Let's look at the disassembly code of our assembler:

gdb ./test02
(gdb) set disassembly-flavor intel
(gdb) disas main
   0x080483f0 <+0>: mov    eax,0x2
   0x080483f5 <+5>: mov    ds:0x804a01c,eax
   0x080483fa <+10>:    mov    eax,0x3
   0x080483ff <+15>:    mov    ds:0x804a01e,eax
   0x08048404 <+20>:    mov    eax,ds:0x804a01c
   0x08048409 <+25>:    mov    ebx,DWORD PTR ds:0x804a01e
   0x0804840f <+31>:    add    eax,ebx
   0x08048411 <+33>:    mov    ds:0x804a020,eax
   0x08048416 <+38>:    mov    eax,ds:0x804a020
   0x0804841b <+43>:    ret    
   0x0804841c <+44>:    xchg   ax,ax
   0x0804841e <+46>:    xchg   ax,ax
End of assembler dump.
(gdb) quit

Well, we've all seen the disassembly code. First, check whether the disassembly code of test02 here is consistent with the assembly code we wrote:

   0x080483f0 <+0>: mov    eax,0x2
   0x080483f5 <+5>: mov    ds:0x804a01c,eax
   0x080483fa <+10>:    mov    eax,0x3
   0x080483ff <+15>:    mov    ds:0x804a01e,eax
   0x08048404 <+20>:    mov    eax,ds:0x804a01c
   0x08048409 <+25>:    mov    ebx,DWORD PTR ds:0x804a01e
   0x0804840f <+31>:    add    eax,ebx
   0x08048411 <+33>:    mov    ds:0x804a020,eax
   0x08048416 <+38>:    mov    eax,ds:0x804a020
   0x0804841b <+43>:    ret

The direct comparison with the compilation written above is that due to the format problem, some of the addresses and labels inside are beyond recognition, but as long as we can identify them, we don't need to make them all clear. This is the previous assembly code:

    mov eax, 2
    mov [x], eax
    mov eax, 3
    mov [y], eax
    mov eax, [x]
    mov ebx, [y]
    add eax, ebx
    mov [z], eax
    mov eax, [z]
    ret

As soon as you count down, it's the same. Take a closer look at each instruction. It's basically the same. Of course, x, y and z have disappeared and become some strange symbols. I won't delve into them here for the time being.

Let's look at the assembly code of C program:

   0x080483ed <+0>: push   ebp
   0x080483ee <+1>: mov    ebp,esp
   0x080483f0 <+3>: mov    DWORD PTR ds:0x804a024,0x2
   0x080483fa <+13>:    mov    DWORD PTR ds:0x804a028,0x3
   0x08048404 <+23>:    mov    edx,DWORD PTR ds:0x804a024
   0x0804840a <+29>:    mov    eax,ds:0x804a028
   0x0804840f <+34>:    add    eax,edx
   0x08048411 <+36>:    mov    ds:0x804a020,eax
   0x08048416 <+41>:    mov    eax,ds:0x804a020
   0x0804841b <+46>:    pop    ebp
   0x0804841c <+47>:    ret 

Here, leave aside the following instructions (these instructions themselves are useful, but in this example, they can be removed for the time being. Specifically, what they do, later), and remove them:

push ebp
mov ebp, esp
....
pop ebp

So C program disassembly becomes like this:

   0x080483f0 <+3>: mov    DWORD PTR ds:0x804a024,0x2
   0x080483fa <+13>:    mov    DWORD PTR ds:0x804a028,0x3
   0x08048404 <+23>:    mov    edx,DWORD PTR ds:0x804a024
   0x0804840a <+29>:    mov    eax,ds:0x804a028
   0x0804840f <+34>:    add    eax,edx
   0x08048411 <+36>:    mov    ds:0x804a020,eax
   0x08048416 <+41>:    mov    eax,ds:0x804a020
   0x0804841c <+47>:    ret

Or does it look unclear? What should I do? We trace the numbers 2, 3 and add instructions inside, replace those strange symbols with the labels x, y and z we know, and then look at:

   0x080483f0 <+3>: mov    [x],0x2
   0x080483fa <+13>:    mov    [y],0x3
   0x08048404 <+23>:    mov    edx,[x]
   0x0804840a <+29>:    mov    eax,[y]
   0x0804840f <+34>:    add    eax,edx
   0x08048411 <+36>:    mov    [z],eax
   0x08048416 <+41>:    mov    eax,[z]
   0x0804841c <+47>:    ret

Compare the assembly code we wrote earlier? Is it basically eight, nine and ten? There are only two differences: 1 The order of registers used is different, but it doesn't hurt; 2. There are two assembly instructions, and the disassembly code compiled in C corresponds to one instruction.

Here we found that

mov eax, 2
mov [x], eax

Can be reduced to one statement:

mov [x], 2

OK, according to the information provided by the C compiler, our assembler can also be simplified as follows:

global main

main:
    mov [x], 0x2
    mov [y], 0x3
    mov eax, [x]
    mov ebx, [y]
    add eax, ebx
    mov [z], eax
    mov eax, [z]
    ret


section .data
x       dw      0
y       dw      0
z       dw      0

However, when we compile the assembly in this way, we make an error. It can't be completely written in this way. We have to make some minor modifications to change the first two instructions to:

    mov dword [x], 0x2
    mov dword [y], 0x3

In this way, there will be no problem. Through research, we have written an assembler equivalent to the code compiled by the previous C program:

global main

main:
    mov dword [x], 0x2
    mov dword [y], 0x3
    mov eax, [x]
    mov ebx, [y]
    add eax, ebx
    mov [z], eax
    mov eax, [z]
    ret

section .data
x       dw      0
y       dw      0
z       dw      0

summary

Well, here, we have implemented a simple C program in assembly language through nasm, gcc and gdb.

Let's talk about the key points of this paragraph:

  • C program in the compilation stage, logically, will be transformed into an equivalent assembler
  • The assembler is compiled into machine instructions through the built-in (or external) assembler of the compiler (there is a link stage in the process of reaching the executable file, which will be mentioned later)
  • We can know the assembly form of a C program through gdb disassembly

In fact, the purpose of learning assembly language is not to program in assembly language in the future, but to further understand some details of high-level language at the bottom, such as an assignment statement of C language and an addition expression of C language, with the help of the understanding of assembly language. That is to realize what the program is doing and what the CPU is doing in the computer through assembly, and understand the essence of the computer program in the eyes of the CPU.

Later, it will be a very good choice to learn assembly language by combining various materials. In the process of practicing and understanding assembly, we can also know more clearly the various writing methods in C language and what meaning they represent, so as to deepen our understanding of C language.

crap

There are more codes and operations involved in this section. Of course, it's best to finish it patiently. If it's not enough in two days a day, it's worth it.

Keywords: Assembly Language

Added by Angus on Fri, 28 Jan 2022 09:51:46 +0200