[exploration of JVM principle] analysis of calling and execution process of bytecode instruction set (syntax analysis)

Article introduction

This article explains how java code is compiled into bytecode and executed on the Java virtual machine. It is very important to understand how java code is not compiled into bytecode and executed in the Java virtual machine, because it can help you understand what happens to your program at run time.
This understanding can not only ensure that you have a logical understanding of language features, but also understand the compromises and side effects of language features when making specific discussions.

In bytecode, the number before each instruction (or opcode) indicates the position of this byte.

For example, an instruction such as 1: iconst_1 is only one byte long and has no operands, so the position of the next bytecode is 2.
For another example, such an instruction 1: bipush 5 will occupy two bytes, opcode bipush will occupy one byte, and operand 5 will occupy one byte.
Then, the position of the next bytecode is 3, because the byte occupied by the operand is at position 2.

Java virtual machine is a stack based architecture. When a method includes initializing the execution of the main method, a stack frame will be created on the stack, in which the local variables in the method are stored.

variable

local variable

The local variable array contains all variables used during method execution, including a reference variable this, all method parameters and variables defined in the method body.

The method parameters of class methods (such as static method) start from 0.
Instance method. The 0th slot is used to store this, so the parameter needs to start from 1!.

Local variable type

boolean
byte
char
long
short
int
float
double
reference
returnAddress
All types except long and double occupy a slot in the local variable array. Long and double need two consecutive slots because they are 64 bit types.
When a new variable is created on the operand stack to store the value of the new variable. The value of the new variable is then stored in the corresponding position of the local variable array.
If this variable is not a basic type, the value on the corresponding slot stores a reference to this variable. This reference points to an object stored in the heap.

for example

int i = 5;

Compiled as bytecode

0: bipush 5((two bytes)
2: istore_0

bipush

Pushes a byte to the operand stack as an integer. In this example, 5 is pushed to the operand stack.

istore_0

It is a set in the format istore_n is one of the operands that stores an integer in the local variable table.

n is the position in the local variable table, and the value can only be 0,1,2,3. Another opcode, istore w, is used when the value is greater than 3. It places an operand in the appropriate position in the local variable array, which will be described in detail later.

The above code is executed in memory as follows:

Each method in this class file also contains a local variable table. If this code is included in a method, in the local variable table corresponding to this method in the class file, you will get the following entity (entry):

LocalVariableTable:
    Start  Length  Slot  Name   Signature
      0      1      1     i         I

Member variable (class variable)

A member variable (field) is stored on the heap as part of a class instance (or object). Information about this member variable is defined in the class bytecode field in the class file_ Info [] array, as follows:

ClassFile {
    u4          magic;
    u2          minor_version;
    u2          major_version;
    u2          constant_pool_count;
    cp_info     contant_pool[constant_pool_count – 1];
    u2          access_flags;
    u2          this_class;
    u2          super_class;
    u2          interfaces_count;
    u2          interfaces[interfaces_count];
    u2          fields_count;
    field_info      fields[fields_count];
    u2          methods_count;
    method_info     methods[methods_count];
    u2          attributes_count;
    attribute_info  attributes[attributes_count];
}

In addition, if this variable is initialized, the bytecode for initialization will be added to the instance constructor.

When the following code is compiled:

public class SimpleClass{
    public int simpleField = 100;
}

An additional summary will use the javap command to demonstrate adding member variables to a field_info array.

public int simpleField;
Signature: I
flags: ACC_PUBLIC

The bytecode for initialization is added to the constructor as follows:

public SimpleClass();
  Signature: ()V
  flags: ACC_PUBLIC
  Code:
    stack=2, locals=1, args_size=1
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: aload_0
       5: bipush        100
       7: putfield      #2                  // Field simpleField:I
      10: return

aload_0

Push an object reference in the local variable array slot to the top of the operand stack.

Although the above code shows that there is no constructor to initialize member variables, in fact, the compiler will create a default constructor to initialize member variables.

The first local variable actually points to this.
aload_ The 0 opcode pushes the reference variable this to the operand stack.
aload_0 is a set in the format aload_ One of the operands of. Their function is to push an object reference to the operand stack.
1. Where n refers to the location of this object reference in the accessed local variable array, and the value can only be 0, 1, 2 or 3.
2. Similar opcodes are iload_,lload_,fload_ And dload_, However, these opcodes are used to load values rather than an object reference. Here, i refers to int, l refers to long, f refers to float, and d refers to double.
3. If the index of the local variable is greater than 3, it can be loaded using Iload, lload, flow, dload and aload. These opcodes require a single operand to specify the index of the local variable to be loaded.

invokespecial

The invokespecial instruction is used to call instance methods, private methods, methods of the parent class of the current class, construction methods, etc.

Part of the opcode of the method called by:

invokedynamic(MethodHandle,Lamdba)
invokeinterface (interface method)
invokespecial (constructor, parent method, private method)
invokestatic (static method)
invokevirtual (instance method)

The invokespecial instruction is used in this code to call the constructor of the parent class.

bipush

Pushes a byte to the operand stack as an integer. In this example, 100 is pushed to the operand stack.

putfield

Followed by an operand #2, which is a reference to a member variable in the runtime constant pool (cp_info). In this example, this member variable is called simpleField. Assign a value to the member variable, and then the object containing the member variable is ejected from the operand stack.

Aload in front_ The 0 instruction pushes the object containing this member variable and the previous bipush instruction pushes 100 to the top of the operand stack respectively. putfield then removes them all from the top of the operand stack (Pop-Up). The end result is that the value of the member variable simpleFiled on this object is updated to 100.

The above code is executed in memory as follows:

java_class_variable_creation_byte_code

The putfield opcode has a single operand pointing to the second location in the constant pool.

The JVM maintains a constant pool, a runtime data structure similar to a symbol table, but contains more data.

The bytecode in Java needs data. Usually, because this data is too large to be stored directly in the bytecode, it is placed in the constant pool. The bytecode holds a reference to the constant pool. When a class file is created, some of them are constant pools, as shown below:

Constant pool:
   #1 = Methodref          #4.#16         //  java/lang/Object."<init>":()V
   #2 = Fieldref           #3.#17         //  SimpleClass.simpleField:I
   #3 = Class              #13            //  SimpleClass
   #4 = Class              #19            //  java/lang/Object
   #5 = Utf8               simpleField
   #6 = Utf8               I
   #7 = Utf8               <init>
   #8 = Utf8               ()V
   #9 = Utf8               Code
  #10 = Utf8               LineNumberTable
  #11 = Utf8               LocalVariableTable
  #12 = Utf8               this
  #13 = Utf8               SimpleClass
  #14 = Utf8               SourceFile
  #15 = Utf8               SimpleClass.java
  #16 = NameAndType        #7:#8          //  "<init>":()V
  #17 = NameAndType        #5:#6          //  simpleField:I
  #18 = Utf8               LSimpleClass;
  #19 = Utf8               java/lang/Object

Constant (class constant)

The variable modified by final is called a constant, and we identify it as ACC in the class file_ FINAL.

For example:

public class SimpleClass {
    public final int simpleField = 100;
    public  int simpleField2 = 100;
}

An ACC is added to the variable description_ Final parameter:

public static final int simpleField = 100;
Signature: I
flags: ACC_PUBLIC, ACC_FINAL
ConstantValue: int 100

However, the initialization operation in the constructor is not affected:

4: aload_0
5: bipush        100
7: putfield      #2                  // Field simpleField2:I

Static variable

Variables modified by static, which we call static class variables, are identified as ACC in the class file_ Static, as follows:

public static int simpleField;
Signature: I
flags: ACC_PUBLIC, ACC_STATIC

No bytecode was found in the instance constructor to initialize static variables. The initialization of static variables is in the class constructor. It uses putstatic opcode instead of putfield bytecode, which is a part of the class constructor.

static {};
  Signature: ()V
  flags: ACC_STATIC
  Code:
    stack=1, locals=0, args_size=0
       0: bipush         100
       2: putstatic      #2                  // Field simpleField:I
       5: return

Conditional statement

Conditional flow control, such as if else statement and switch statement, uses one instruction to compare two values and branches with other bytecodes at the bytecode level.

for loop and while loop statements are implemented in a similar way. The difference is that they usually contain a goto instruction to achieve the purpose of loop.
Do while loops do not require any goto instructions because their conditional branches are at the end of the bytecode. For more details about loops, see loops section.

Some opcodes can compare two integers or two references and then take a branch in a single instruction. Comparisons between other types, such as double,long or float, need to be implemented in two steps.

First, after comparison, push 1,0 or - 1 to the top of the operand stack. Next, a branch is executed based on whether the value on the operand stack is greater than, less than or equal to 0.

First, let's take the if else statement as an example. Other different types of instructions for branch jumping will be included in the following explanation.

if-else

The following code shows a simple if else statement to compare the size of two integers.

public int greaterThen(int intOne, int intTwo) {
    if (intOne > intTwo) {
        return 0;
    } else {
        return 1;
    }
}

This method is compiled into the following bytecode:

0: iload_1
1: iload_2
2: if_icmple        7
5: iconst_0
6: ireturn
7: iconst_1
8: ireturn

First, use iload_1 and iload_2 push the two parameters to the operand stack.
Then, use if_icmple compares two values at the top of the operand stack.
If intOne is less than or equal to intTwo, the operand branch becomes bytecode 7 and jumps to bytecode instruction line 7line.

Note that in Java code, the test in if condition is completely opposite to that in bytecode, because in bytecode, if the test in if condition statement is successfully executed, the content in else statement block will be executed, while in Java code, if the test in if condition statement is successfully executed, the content in if statement block will be executed.

In other words, if_ The icmple instruction is testing. If the if condition is not true, skip the if code block. The body of the if code block is the bytecode with sequence numbers of 5 and 6, and the body of the else code block is the bytecode with sequence numbers of 7 and 8.

java_if_else_byte_code

The following code example shows a slightly more complex example, which requires a two-step comparison:

public int greaterThen(float floatOne, float floatTwo) {
    int result;
    if (floatOne > floatTwo) {
        result = 1;
    } else {
        result = 2;
    }
    return result;
}

This method generates the following bytecode:

0: fload_1
 1: fload_2
 2: fcmpl
 3: ifle          11
 6: iconst_1
 7: istore_3
 8: goto          13
11: iconst_2
12: istore_3
13: iload_3
14: ireturn

In this example, first use the flow_ 1 and flow_ 2 push the two parameters to the top of the operand stack. This example is different from the previous one in that it requires two-step comparison. fcmpl first compares floatOne and floatTwo, and then pushes the result to the top of the operand stack. As follows:

floatOne > floatTwo -> 1

floatOne = floatTwo -> 0

floatOne < floatTwo -> -1 floatOne or floatTwo= Nan -> 1

Next, if the result of fcmpl is < = 0, ifle is used to jump to the bytecode at index 11.

The difference between this example and the previous example is that there is only a single return statement at the end of this method, and there is a goto instruction at the end of the if statement block to prevent the else statement block from being executed.
The goto branch corresponds to the bytecode Iload at sequence number 13_ 3. It is used to push the result stored in the third slot in the local variable table to the top of the scan operand stack, so that it can be returned by the return statement.

java_if_else_byte_code_extra_goto

Like the opcodes for numerical comparison, there are opcodes for reference equality comparison, such as = =, and for comparison with null, such as = = null and= Null, test the type of an object, such as instanceof.

if_cmp eq ne lt le gt ge this set of opcodes is used for the two integers at the top of the operand stack and jumps to a new bytecode. Desirable values are:

eq – be equal to
ne – Not equal to
lt – less than
le – Less than or equal to
gt – greater than
ge – Greater than or equal to

if_acmp eq ne these two opcodes are used to test whether two references are equal (eq) or unequal (NE), and then jump to a new bytecode specified by the operand.
The bytecodes ifnonnull/ifnull are used to test whether the two references are null or not, and then jump to a new bytecode specified by the operand.
The opcode lcmp is used to compare two integers at the top of the operand stack, and then push a value to the operand stack, as shown below:

If value1 > Value2 - > push 1 if value1 = Value2 - > push 0 if value1 < Value2 - > push - 1

fcmp l g / dcmp l g this set of opcodes is used to compare two float or double values, and then push a value to the operand stack, as shown below:

If value1 > Value2 - > push 1 if value1 = Value2 - > push 0 if value1 < Value2 - > push - 1

The difference between operands of type l or g is how they handle NaN.

fcmpg and dcmpg push int value 1 to the operand stack, while fcmpl and dcmpl push - 1 to the operand stack. This ensures that if one of the two values is NaN (Not A Number), the test will not succeed.
- For example, if x > y (where both X and y are double types) and one of X and Y is NaN, the fcmpl instruction will push - 1 to the operand stack.
- The next opcode will always be an ifle instruction. If the value at the top of the stack is less than 0, a branch jump will occur. As a result, if one of x and y is NaN, ifle will skip the if statement block to prevent the code in the if statement block from being executed.
instanceof if the object at the top of the operand stack is an instance of a class, this opcode pushes an int value of 1 to the operand stack. The operand of this opcode is used to specify the class by providing an index in the constant pool. If the object is null or not an instance of the specified class, the int value 0 is pushed to the operand stack.

if eq ne lt le gt ge all these opcodes are used to compare the value at the top of the operand stack with 0, and then jump to the bytecode at the specified position of the operand.

If successful, these instructions are always used for more complex conditional logic that cannot be completed with one instruction, for example, to test the result of a method call.

switch

The allowed types of a Java switch expression can be char, byte, short, int, character, byte, short Integer, string or an enum type. To support switch statements.

The Java virtual machine uses two special instructions: tableswitch and lookupswitch, which are implemented by integer values. Using only integer values does not cause any problems, because char,byte,short and enum types can be promoted internally to int types.

Adding support for strings in Java 7 is also implemented through integers. tableswitch passes faster, but usually takes up more memory.

Table switch works by listing all possible case values between the minimum and maximum case values. The minimum and maximum values are also provided, so if the switch variable is not within the enumerated case value, the JVM will immediately jump to the default statement block. The values of case statements not provided in Java code will also be listed, but point to the default statement block to ensure that all values between the minimum and maximum values will be listed.

For example, execute the following swith statement:

public int simpleSwitch(int intOne) {
    switch (intOne) {
        case 0:
            return 3;
        case 1:
            return 2;
        case 4:
            return 1;
        default:
            return -1;
    }

This code generates the following bytecode:

0: iload_1
1: tableswitch   {
         default: 42
             min: 0
             max: 4
               0: 36
               1: 38
               2: 42
               3: 42
               4: 40
    }
36: iconst_3
37: ireturn
38: iconst_2
39: ireturn
40: iconst_1
41: ireturn
42: iconst_m1
43: ireturn

The tableswitch instruction has values 0, 1 and 4 to match the case statements provided in Java code, and each value points to the bytecode of their corresponding code block. The tableswitch instruction also has values 2 and 3, which are not provided as case statements in Java code. They both point to the default code block. When these instructions are executed, the value at the top of the operand stack is checked to see if it is between the maximum and minimum values. If the value is not between the minimum and maximum values, the code execution will jump to the default branch, which is located at the bytecode with sequence number 42 in the above example. To ensure that the value of the default branch can be found by the tableswitch instruction, it is always at the first byte (after any required alignment padding). If the value is between the minimum value and the maximum value, it is used to index the interior of tableswitch to find the appropriate bytecode for branch jump.

For example, if the value is, code execution jumps to the bytecode at sequence number 38. The following figure shows how this bytecode is executed:

java_switch_tableswitch_byte_code

If the value in the case statement is "too far away" (for example, too sparse), this method is not desirable because it will occupy too much memory. When the case in the switch is sparse, you can use lookupswitch instead of tableswitch. Lookupswitch will list the bytecode corresponding to the branch for each case sentence example, but will not list all possible values.

When lookupswitch is executed, the value at the top of the operand stack is compared with each value in lookupswitch to determine the correct branch address. Using lookups switch, the JVM will find the correct match in the match list, which is a time-consuming operation. Using table switch, the JVM can quickly locate the correct value.
When a selection statement is compiled, the compiler must make a trade-off between memory and performance to decide which selection statement to choose. In the following code, the compiler will use lookupswitch:

public int simpleSwitch(int intOne) {
    switch (intOne) {
        case 10:
            return 1;
        case 20:
            return 2;
        case 30:
            return 3;
        default:
            return -1;
    }
}

The bytecode generated by this code is as follows:

0: iload_1
1: lookupswitch  {
         default: 42
           count: 3
              10: 36
              20: 38
              30: 40
    }
36: iconst_1
37: ireturn
38: iconst_2
39: ireturn
40: iconst_3
41: ireturn
42: iconst_m1
43: ireturn

For a more efficient search algorithm (more efficient than linear search), lookupswitch will provide the number of matching values and sort the matching values. The following figure shows how the above code is executed:

java_switch_lookupswitch_byte_code

String switch

In Java 7, the switch statement adds support for string types. Although the existing opcodes that implement switch statements only support int types, no new opcodes are added. The switch statement of string type is completed in two parts. First, compare the hash value between the top of the operand stack and the value corresponding to each case statement. This step can be done by lookups switch or table switch (depending on the sparsity of the hash value).

This will also cause the bytecode corresponding to a branch to call string Equals() makes an exact match. A tableswitch instruction will use string The result of equlas () jumps to the code of the correct case statement.

public int simpleSwitch(String stringOne) {
    switch (stringOne) {
        case "a":
            return 0;
        case "b":
            return 2;
        case "c":
            return 3;
        default:
            return 4;
    }
}

This string switch statement will generate the following bytecode:

0: aload_1
 1: astore_2
 2: iconst_m1
 3: istore_3
 4: aload_2
 5: invokevirtual #2                  // Method java/lang/String.hashCode:()I
 8: tableswitch   {
         default: 75
             min: 97
             max: 99
              97: 36
              98: 50
              99: 64
       }
36: aload_2
37: ldc           #3                  // String a
39: invokevirtual #4                  // Method java/lang/String.equals:(Ljava/lang/Object;)Z
42: ifeq          75
45: iconst_0
46: istore_3
47: goto          75
50: aload_2
51: ldc           #5                  // String b
53: invokevirtual #4                  // Method java/lang/String.equals:(Ljava/lang/Object;)Z
56: ifeq          75
59: iconst_1
60: istore_3
61: goto          75
64: aload_2
65: ldc           #6                  // String c
67: invokevirtual #4                  // Method java/lang/String.equals:(Ljava/lang/Object;)Z
70: ifeq          75
73: iconst_2
74: istore_3
75: iload_3
76: tableswitch   {
         default: 110
             min: 0
             max: 2
               0: 104
               1: 106
               2: 108
       }
104: iconst_0
105: ireturn
106: iconst_2
107: ireturn
108: iconst_3
109: ireturn
110: iconst_4
111: ireturn

This class contains this bytecode and the following constant pool values referenced by this bytecode. To learn more about constant pools, check out the runtime constant pools section of this article on JVM internals.

Constant pool:
  #2 = Methodref          #25.#26        //  java/lang/String.hashCode:()I
  #3 = String             #27            //  a
  #4 = Methodref          #25.#28        //  java/lang/String.equals:(Ljava/lang/Object;)Z
  #5 = String             #29            //  b
  #6 = String             #30            //  c

 #25 = Class              #33            //  java/lang/String
 #26 = NameAndType        #34:#35        //  hashCode:()I
 #27 = Utf8               a
 #28 = NameAndType        #36:#37        //  equals:(Ljava/lang/Object;)Z
 #29 = Utf8               b
 #30 = Utf8               c

 #33 = Utf8               java/lang/String
 #34 = Utf8               hashCode
 #35 = Utf8               ()I
 #36 = Utf8               equals
 #37 = Utf8               (Ljava/lang/Object;)Z

Note that the number of bytecodes required to execute this switch includes two tableswitch instructions and several invokevirtual instructions to call string equals(). For more details about invokevirtual, please refer to the method invocation section of the next article. The following figure shows how the time code is executed when entering "b":

If different case s match the same hash value, for example, the hash values of the strings "FB" and "Ea" are 28. This can be handled by slightly adjusting the equlas method flow as follows. Note that the bytecode at sequence number 34: ifeg 42 calls another string Equals () to replace the lookupsswitch opcode in the previous example where there was no hash conflict.

public int simpleSwitch(String stringOne) {
    switch (stringOne) {
        case "FB":
            return 0;
        case "Ea":
            return 2;
        default:
            return 4;
    }
}

The bytecode generated by the above code is as follows:

0: aload_1
 1: astore_2
 2: iconst_m1
 3: istore_3
 4: aload_2
 5: invokevirtual #2                  // Method java/lang/String.hashCode:()I
 8: lookupswitch  {
         default: 53
           count: 1
            2236: 28
    }
28: aload_2
29: ldc           #3                  // String Ea
31: invokevirtual #4                  // Method java/lang/String.equals:(Ljava/lang/Object;)Z
34: ifeq          42
37: iconst_1
38: istore_3
39: goto          53
42: aload_2
43: ldc           #5                  // String FB
45: invokevirtual #4                  // Method java/lang/String.equals:(Ljava/lang/Object;)Z
48: ifeq          53
51: iconst_0
52: istore_3
53: iload_3
54: lookupswitch  {
         default: 84
           count: 2
               0: 80
               1: 82
    }
80: iconst_0
81: ireturn
82: iconst_2
83: ireturn
84: iconst_4
85: ireturn

loop

Conditional flow control, such as if else statement and switch statement, is realized by using an instruction to compare two values and then jump to the corresponding bytecode. For more details about conditional statements, see the conditional section.
Loops, including for loops and while loops, are implemented in a similar way, except that they usually use a goto instruction to implement a bytecode loop. Do while loops do not require any goto instructions because their conditional branches are at the end of the bytecode.
Some bytecodes can compare two integers or two references, and then take a branch with a single instruction. Comparisons between other types, such as double,long or float, require two steps. First, perform a comparison and push 1, 0, or - 1 to the top of the operand stack. Next, a branch is executed based on whether the value at the top of the operand stack is greater than 0, less than 0, or equal to 0. For more details about the instructions for branch jump, you can see above.

while Loop

while loops a conditional branch instruction, such as if_fcmpge or if_icmplt (as described above) and a goto statement. After the loop, understand and execute the conditional branch instruction. If the condition is not true, terminate the loop. The last instruction in the loop is goto, which is used to jump to the beginning of the loop code until the conditional branch is not established, as shown below:

public void whileLoop() {
    int i = 0;
    while (i < 2) {
        i++;
    }
}

Compiled into:

0: iconst_0
 1: istore_1
 2: iload_1
 3: iconst_2
 4: if_icmpge       13
 7: iinc            1, 1
10: goto            2
13: return

if_ The cmpge instruction tests whether the local variable at position 1 is equal to or greater than 10. If greater than 10, the instruction jumps to the bytecode with sequence number 14 to complete the cycle. The goto instruction guarantees that bytecode loops until if_ The icmpge condition holds at a certain point. Once the loop ends, the program execution branch will immediately jump to the return instruction. Iinc instruction is one of the few instructions that can directly update a local variable without loading and storing values on the operand stack. In this example, iinc adds 1 to the value of the first local variable.

for loop

The for loop and the while loop use exactly the same pattern at the bytecode level. This is not surprising because all while loops can be rewritten with the same for loop. The example of the simple while loop above can be rewritten with a for loop to generate exactly the same bytecode, as shown below:

public void forLoop() {
    for(int i = 0; i < 2; i++) {
    }
}

do-while Loop

The do while loop is also very similar to the for loop and the while loop, except that they do not need to take the goto instruction as a conditional branch to become the last instruction for fallback to the beginning of the loop.

public void doWhileLoop() {
    int i = 0;
    do {
        i++;
    } while (i < 2);
}

The generated bytecode is as follows:

0: iconst_0
 1: istore_1
 2: iinc     1, 1
 5: iload_1
 6: iconst_2
 7: if_icmplt   2
10: return

Keywords: Java jvm Programmer

Added by pbeerman on Sun, 23 Jan 2022 09:26:13 +0200

Programming VIP

[exploration of JVM principle] analysis of calling and execution process of bytecode instruction set (syntax analysis)

Article introduction

variable

local variable

Local variable type

Member variable (class variable)

When the following code is compiled:

aload_0

invokespecial

bipush

putfield

Constant (class constant)

Static variable

Conditional statement

if-else

switch

String switch

loop

while Loop

for loop

do-while Loop

Popular Keywords