Compiler stage-1
To simplify compilation and improve portability, an intermediate code is designed to run in a virtual machine. VM is a fictitious machine, but it can be ported and run on real machines of various architectures.
The VM language covered in this chapter includes four commands: arithmetic, memory access, program flow, subroutine call commands.
7.1Background
7.1.1The Virtual Machine Paradigm
The operation of a high-level language relies on the compiler to compile it into machine code supported by the target platform, usually requiring a separate compiler to compile each pair of high-level language-machine code.
This dependency can be resolved by introducing intermediate code running on virtual machine. Compilation is simplified into two almost non-interacting phases:
- parse the High-level language and translate its commands into intermediate processing steps
- translate the intermediate steps into targe machine language
The advantage of this is that the first phase of compilation relies only on the high-level language, while the second phase relies only on the machine code of the target platform, greatly improving portability and the convenience of modular compilation. JRE and. NET frameworks all adopt this idea.
7.1.2Stack Machine Model
All operands and results in the VM are stored in the stack, and all calculations, reads, including subprocess call operations can be accomplished through stack operations.
- push
- pop
Specific role:
- Handle all calculations and logical operations
- Convenient subprocess invocation and memory allocation
7.2VM Specification Part-1
7.2.1General
The virtual machine is stack-based and function-based. There are four commands:
- Arithmetic commands - Perform operations and logical operations in a stack structure
- Memory access commands - Transfer data between stack structure and virtual memory nodes
- Program flow commands - facilitate conditional and unconditional branching
- Function call commands - Calls a function and returns a value
The vm program covered in this course contains one or more. vm files, each containing one or more function s, corresponding to program, class, method concepts in the high-level language.
7.2.2 Arithmetic and Logical Commands
The VM language designed for this course includes nine stack-oriented operations or logical commands:
- Seven binary commands: pop two operands, perform an operation, and push the result onto the stack;
- Two unary commands: pop an operand, and push results are stacked after logical processing;
# Note that commands (`gt', `lt', `eq`) return `Boolean` type results: # true = -1 (0xFFFF, or 1111111111111111) # false = 0 (0x0000, or 0000000000000000)
7.2.3 Memory Access Commands
The VM language has eight virtual memory segments, with the following instructions:
- push segment index // Push the value of segment[index] onto the stack
- pop segment index // Pop the top stack value and store it in segment[index]
In addition to the virtual memory segments above, we also need to maintain the data structure stack (where push and pop operations move data stores) and heap (RAM).
7.2.4 Program Flow and Function Calling Command
Program Flow Commands:
- label symbol,Label declaration
- goto symbol,Unconditional branching
- if-goto symbol,Conditional branching
Function Calling Commands:
- function functionName nLocals
- call functionName nArgs
- return
- PS. functionName is a symbol, while nLocals and nArgs are non-negative integers.
7.3Implementation
Implementing a designed VM language on the target platform requires mapping the VM data structure to the target hardware platform and compiling all VM commands into the form of instructions supported by the bit hardware platform.
7.3.1 Standard VM Mapping on the Hack Platform, Part 1
The following discussion maps VM s to the Hack hardware platform.
VM to Hack Translate
One. A vm program may contain more than one. vm file, the program will be compiled into one. asm assembly code.
RAM Usage
There is a total of 32K 16-bit RAM space on the Hack platform, of which the first 16K is generic RAM and the last 16K is used as I/O device mapping.
The above RAM addresses 0~15 can be used as R0~R15 by any assembler.
SP, LCL, ARG, THIS, THAT can also be used to refer to RAM[0~4]. This is to improve the readability of VM programs.
Memory Segments Mapping
- Of the eight memory segments discussed earlier, local,'argument', this, and that all have direct mappings in RAM (by storing the physical address of the memory segment in a dedicated register that has been created - LCL, ARG, THIS, THAT).
That is, all operations to get the ith bit data of a memory segment can be implemented in the assembly language by manipulating RAM[base + i], where base is the value stored in the registers specified by the memory segment at this time. - Second, for pointer, temp memory segments, these two memory segments are allocated to the specified space. The pointer is assigned to RAM[3~4] (also THIS, THAT), and temp is assigned to RAM[5~12]. Getting pointer i and temp i can be done by pointing to 3 + i and 5 + i, respectively.
- In addition, constan is the only virtual memory segment, and the VM implements the <constant i>instruction by providing the constant I directly.
- Finally, the static memory segment is discussed. As mentioned in the Assembly Language Rules in the previous chapter, the assembler automatically assigns a RAM address to each new symbol it encounters, starting with RAM16. This method can be used in VMs to allocate memory for static memory segments: static variable number j in fVM code is represented as the symbol f.j in assembly code.
Assembly Language Symbol
8.2VM Specification Part-2
8.2.1 Program Flow Commands
Program control instructions used by VM:
- label label, where label is any string that does not begin with a number, identifies the location of the code where only this declaration allows a program to jump from another location to that location of the code.
- goto label, unconditional jump.
- if-goto label condition jump. This operation is performed when the top element of the stack is pop ped out, jumps if it is not zero, and does nothing else. The jump location needs to be identified.
8.2.2 Function Calling Commands
Subprocesses in high-level languages are compiled as a function in the vm. Functions are called globally by the name of the string that follows (we want the method bar in the class Foo in the advanced language to be compiled as the function Foo.bar). It is worth noting that a complete user program in a high-level language usually consists of several code files containing a class, each of which is compiled into one at the first level of compilation. vm file, the second layer is also the last layer of Jack to VM compiled, these several. The vm file is compiled into one. asm file. Considering that different classes in the advanced language may have the same method, compile to. After the asm file, to prevent scope conflicts, the method bar in the class Foo is compiled into the function Foo. Bar.
The instructions in the VM are as follows:
- function f n, declares function f, has n local variables
- call f m, calling function f, with m parameters on the stack.
8.3.3 The Function Calling Protocol
Function call events can be viewed from two perspectives: the function call process and the called function.
- Caller perspective:
- The caller stacks the appropriate number of arguments before calling the function;
- Activate the function with the call command;
- After the called function executes and returns a value, the previously stacked parameters are gone, and the function's return value is replaced by the function's return value.
- After the called function executes and returns a value, the caller's memory segment argument, local, static, this, that, pointer is exactly the same as before the function was called, temp is undefined.
- Called Function Perspective:
- The function starts execution, and its parameter list argument is initialized to the value passed to its argument.
- The preset list of local local variables is initialized to 0;
- The static memory segment of the function is set to the static memory segment owned by the VM file to which it belongs.
- The operational stack is empty;
- this, that, pointer, temp memory segment undefined;
- A result is stacked before the function returns a value.
8.2.4 Initialization
A VM program consists of several VM functions (compiled in a high-level language). When the VM program runs, Sys needs to be executed first. Init function, which calls the Main function that the user needs to execute.
8.3 Implementation
This section mainly describes how to improve the Vm Translator built in Chapter 7. 8.3.1 describes how to maintain a critical stack structure and map it to a hardware platform. A specific example is given in 8.3.2.
8.3.1 Standard VM Mapping on the Hack Platform, Part Ⅱ
The Global Stack
Resource control of VM language is implemented through stack structure. Whenever a function is called, a new block is added to the global stack. The memory block contains
- argument, a list of parameters that have passed values;
- pointers, the state used by the VM for the function caller;
- local variables, local variable, initialized to 0;
- working stack, empty stack.
The following diagram shows the basic function stack structure:
It is worth noting that ARG, LCL, and SP in the called function are not visible to the caller and can only be used by the function.
In addition, according to previously set memory allocation rules, the global stack structure should start with RAM[256], so the first step the VM should perform is to set SP=256. Later, whenever pop, push, add, etc. are encountered, the value of SP needs to be updated. When you encounter call, function, return, and other instructions, you need to perform the stack operation shown in the following figure:
Function Calling Protocol Implementation
Image above
Assembly Language Symbols
Clearly, the VM implementation of program flow and function calling requires the creation and use of special symbol s. Summarize as follows:
Bootstrap Code
Execute compiled on Hack platform. asm file, first need:
- Mapping global stack after RAM[256]
- The first function to execute should be Sys.init
Since the previous hardware platform will start executing the code at the RAM[0] location after power-on, called bootstrap code, do the following if needed:
SP == 256 call Sys.init
Sys.init calls the Main function given by the user and enters an infinite loop.
8.3.2 Example
Implementation In Python
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """[two tier compile]: Compile procedure: High-level language --> intermediate-code --> target machine code Usage: > python script.py [option] [para] Recommend os: Unix-like """ __author__ = 'eric' import getopt import os.path import sys # ---------------------------------------------------------- # Description: # Translate .vm file into .asm file def main(): input_path = recv_opt_arg(sys.argv) print('input_path:\t' + input_path) if os.path.isdir(input_path): # Compile all VM code in a folder as an assembly file with the same name as the output file and folder print('[Log]:\tis dir') input_path = os.path.abspath(input_path) + '/' # get absolute path and make sure the last char is '/' output_path = input_path + input_path.split('/')[-2] + '.asm' # Output file and folder have the same name file_path_list = [] # a list that contain every to be processed file's full path files = os.listdir(input_path) for each_file in files: # Processing within a given folder. vm file if each_file[-3:] == '.vm': file_path_list.append(input_path + each_file) # Full path = File path + File name (with suffix) processing_dir_file(file_path_list, output_path) elif os.path.isfile(input_path): # Compile the given file as an assembly file, the output file has the same name as the compiled file print('[Log]:\tis file') if input_path[-3:] == '.vm': output_path = input_path[:-3] + '.asm' file_path_list = [os.path.abspath(input_path)] # get absolute input path processing_dir_file(file_path_list, output_path) else: print('[Log]:\tError_invalid_file_type') exit(0) else: # path is neither directory nor file print('[Log]:\tError_invalid_input_path_type') exit(0) return # ---------------------------------------------------------- # Description: # processing each file in file_path_list, and write asm_list into output_path # Input: # file_path_list, output_path def processing_dir_file(file_path_list, output_path): print('[Log]:\t---* processing file, list:', file_path_list) print('[Log]:\t---* output_path:', output_path) asm_list = [] for each_input_path in file_path_list: # each_file_path is the full path of each file to be processed if 'Sys.vm' in each_input_path: print('[Log]:\tAdd Bootstrap Code') asm_list = ['//Bootstrap Code'] + VMTranslator.c_init() + asm_list with open(each_input_path, 'r') as f: # import original jack_file into a list vm_list = f.readlines() file_name = each_input_path.split('/')[-1][0:-3] vm_translator = VMTranslator(vm_list, file_name) asm_list += vm_translator.get_asm_list() write_asm_file(output_path, asm_list) # ---------------------------------------------------------- # Class Description: # Instantiate: VMTranslator(vm_list, file_name) class VMTranslator(object): def __init__(self, vm_list, file_name): self.file_name = file_name print('[Log]:\t---* instantiate VMTranslator, file_name: ' + self.file_name) self.vm_list = vm_list # print('[Log]:\tvm_list, line:', len(self.vm_list), '\n', self.vm_list) self.no_comment_list = [] self.asm_list = ['//' + self.file_name] self.label_flag = 0 # Functions with the same name may exist in multiple vm files, label is identified by a file name preceded by the function name # ---* main process *--- self.format_file() # Remove comment and empty line # print('[Log]:\tno_comment_list, line:', len(self.no_comment_list), '\n', self.no_comment_list) self.parse_command() def get_asm_list(self): return self.asm_list # ---------------------------------------------------------- # Description: # Remove comment and empty line def format_file(self): for line in self.vm_list: if '//' in line: # remove comment line = line[0:line.index('//')] line = line.strip() # remove backspace on both side if len(line) != 0: # ignore empty line self.no_comment_list.append(line) # ---------------------------------------------------------- # parse vm file and translate it into assembly file # Input: # f_lines a list that contain every line of original .vm file def parse_command(self): for line in self.no_comment_list: command_list = line.split(' ') # split command with backspace command_type = self.which_command(command_list) if command_type == 'arithmetic': self.asm_list += self.c_arithmetic(command_list) elif command_type == 'push': self.asm_list += self.c_push(command_list) elif command_type == 'pop': self.asm_list += self.c_pop(command_list) elif command_type == 'label': self.asm_list += self.c_label(command_list) elif command_type == 'goto': self.asm_list += self.c_goto(command_list) elif command_type == 'if-goto': self.asm_list += self.c_if(command_list) elif command_type == 'function': self.asm_list += self.c_function(command_list) elif command_type == 'call': self.asm_list += self.c_call(command_list) elif command_type == 'return': self.asm_list += self.c_return() else: # invalid command type print('[Log]:\tError_invalid_command_type') self.asm_list += ['Error_invalid_command_type'] # ---------------------------------------------------------- # return corresponding number of command type @staticmethod def which_command(command_list): arithmetic_command = ['add', 'sub', 'neg', 'eq', 'gt', 'lt', 'and', 'or', 'not'] if command_list[0] in arithmetic_command: return 'arithmetic' else: # 'arithmetic', 'push', 'pop', 'label', 'goto', 'if-goto', 'function', 'call', 'return' return command_list[0] # ---------------------------------------------------------- # parse arithmetic command def c_arithmetic(self, command_list): command = command_list[0] if command == 'add': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'M=D+M'] elif command == 'sub': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'M=M-D'] elif command == 'neg': re_c_arithmetic = ['@SP', 'A=M-1', 'M=-M'] elif command == 'eq': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'D=M-D', 'M=-1', '@eqTrue' + str(self.label_flag), 'D;JEQ', '@SP', 'A=M-1', 'M=0', '(eqTrue' + str(self.label_flag) + ')'] self.label_flag += 1 elif command == 'gt': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'D=M-D', 'M=-1', '@gtTrue' + str(self.label_flag), 'D;JGT', '@SP', 'A=M-1', 'M=0', '(gtTrue' + str(self.label_flag) + ')'] self.label_flag += 1 elif command == 'lt': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'D=M-D', 'M=-1', '@ltTrue' + str(self.label_flag), 'D;JLT', '@SP', 'A=M-1', 'M=0', '(ltTrue' + str(self.label_flag) + ')'] self.label_flag += 1 elif command == 'and': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'M=D&M'] elif command == 'or': re_c_arithmetic = ['@SP', 'AM=M-1', 'D=M', '@SP', 'A=M-1', 'M=D|M'] elif command == 'not': re_c_arithmetic = ['@SP', 'A=M-1', 'M=!M'] else: re_c_arithmetic = [] print('[Log]:\tError_unknown_arithmetic_command') exit(0) return re_c_arithmetic # ---------------------------------------------------------- # parse push command def c_push(self, command_list): segment = command_list[1] index = command_list[2] if segment == 'constant': re_c_push = ['@' + str(index), 'D=A', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'static': re_c_push = ['@' + self.file_name + '.' + str(index), 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'this': re_c_push = ['@THIS', 'D=M', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'that': re_c_push = ['@THAT', 'D=M', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'local': re_c_push = ['@LCL', 'D=M', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'argument': re_c_push = ['@ARG', 'D=M', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'temp': re_c_push = ['@5', 'D=A', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] elif segment == 'pointer': re_c_push = ['@3', 'D=A', '@' + str(index), 'A=D+A', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] else: re_c_push = [] print('unknown arithmetic command!') return re_c_push # ---------------------------------------------------------- # parse pop command def c_pop(self, command_list): segment = command_list[1] index = command_list[2] if segment == 'static': re_c_pop = ['@SP', 'AM=M-1', 'D=M', '@' + self.file_name + '.' + str(index), 'M=D'] elif segment == 'this': re_c_pop = ['@THIS', 'D=M', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] elif segment == 'that': re_c_pop = ['@THAT', 'D=M', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] elif segment == 'local': re_c_pop = ['@LCL', 'D=M', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] elif segment == 'argument': re_c_pop = ['@ARG', 'D=M', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] elif segment == 'temp': re_c_pop = ['@5', 'D=A', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] elif segment == 'pointer': re_c_pop = ['@3', 'D=A', '@' + str(index), 'D=D+A', '@R13', 'M=D', '@SP', 'AM=M-1', 'D=M', '@R13', 'A=M', 'M=D'] else: re_c_pop = [] print('unknown arithmetic command!') return re_c_pop # ---------------------------------------------------------- # parse label command def c_label(self, command_list): new_label = self.file_name + '$' + command_list[1] # mark different labels return ['(' + new_label + ')'] # ---------------------------------------------------------- # parse goto command def c_goto(self, command_list): new_label = self.file_name + '$' + command_list[1] return ['@' + new_label, '0;JMP'] # ---------------------------------------------------------- # parse if command def c_if(self, command_list): new_label = self.file_name + '$' + command_list[1] return ['@SP', 'AM=M-1', 'D=M', '@' + new_label, 'D;JNE'] # ---------------------------------------------------------- # parse function command def c_function(self, command_list): res_push = self.c_push(['push', 'constant', '0']) res = ['(' + command_list[1] + ')'] loop_times = int(command_list[2]) while loop_times: res = res + res_push loop_times = loop_times - 1 return res # ---------------------------------------------------------- # parse return command # ---------------------------------------------------------- @staticmethod def c_return(): res = ['@LCL', 'D=M', '@R13', 'M=D', '@5', 'A=D-A', 'D=M', '@R14', 'M=D', '@SP', 'AM=M-1', 'D=M', '@ARG', 'A=M', 'M=D'] res += ['@ARG', 'D=M+1', '@SP', 'M=D'] res += ['@R13', 'AM=M-1', 'D=M', '@THAT', 'M=D'] res += ['@R13', 'AM=M-1', 'D=M', '@THIS', 'M=D'] res += ['@R13', 'AM=M-1', 'D=M', '@ARG', 'M=D'] res += ['@R13', 'AM=M-1', 'D=M', '@LCL', 'M=D', '@R14', 'A=M', '0;JMP'] return res # ---------------------------------------------------------- # parse call command @staticmethod def c_call(command_list): global CALL_FLAG label = command_list[1] + '.returnAddr.' + str(CALL_FLAG) # Call_here Flag is set up to prevent multiple calls to the same function in the same vm file, resulting in multiple identical label s # But call_maintained separately for each file Flag, a new VMTranslator object causes the variable to reset to zero when processing another file # So that you can't handle bug s where multiple files call functions with the same name multiple times # Optional solutions: # 1. call_flag's maintenance is adjusted globally, that is, to maintain the same CALL_for all files FLAG (yes) # 2. Manual exclusion by if statement (no) CALL_FLAG += 1 res = ['@' + label, 'D=A', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] res += ['@LCL', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] res += ['@ARG', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] res += ['@THIS', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] res += ['@THAT', 'D=M', '@SP', 'A=M', 'M=D', '@SP', 'M=M+1'] res += ['@' + command_list[2], 'D=A', '@5', 'D=D+A', '@SP', 'D=M-D', '@ARG', 'M=D', '@SP', 'D=M', '@LCL', 'M=D', '@' + command_list[1], '0;JMP', '(' + label + ')'] return res # ---------------------------------------------------------- # Description: # return the Bootstrap Code @staticmethod def c_init(): res_init = ['@256', 'D=A', '@SP', 'M=D'] res_call = VMTranslator.c_call(['call', 'Sys.init', '0']) return res_init + res_call # Setting the register SP to 256 + calls the function Sys.init # ---------------------------------------------------------- # Description: # write assembly code to .asm file # Input: # out_file_path, asm_list def write_asm_file(out_file_path, asm_list): with open(out_file_path, 'w') as f: for line in asm_list: f.write(line + '\n') # ---------------------------------------------------------- # Description: # receive command line input. return input_path or print usage # Output: # input_path def recv_opt_arg(argv): # print('sys.argv=| ', argv, ' |') try: opts, args = getopt.gnu_getopt(argv[1:], 'i:h?', ['input_path=', 'help']) # 'opts' is a list of tuple ('option', 'value'), each option match one value # 'args' is a list contains extra arguments # print('opts=| ', opts, ' |') # print('args=| ', args, ' |') except getopt.GetoptError as e: print(e) print_usage(argv[0]) sys.exit() input_path = os.getcwd() # default input path for opt, value in opts: # ('option', 'value'), tuple if opt in ['-h', '-?', '--help']: # print help information print_usage(argv[0]) exit(0) elif opt == '-i': # input_path input_path = value return input_path # ---------------------------------------------------------- # Description: # print usage information of this script def print_usage(cmd): print(('*********************************************************\n' + ' --* This massage gave you some detailed information! *--\n' + 'Usage: {0} [OPTION]... [PATH]...\n' + '- OPTION:\n' + ' {0} -i | --input_path\tinput path\n' + ' {0} -h | -? | --help\tprint help info and exit script\n' + '- PATH:\n' + ' Provides name of the file you want to precess or directory that contain those files\n' + ' --* *-- \n' + '*********************************************************\n').format(cmd)) if __name__ == '__main__': CALL_FLAG = 0 # There may be multiple calls to the same function in a file. Calling label at the end of the code requires flag identification main()