preface
In the last article, I shared how the Go compiler parses source files into tokens. This article mainly shares how the syntax parsing stage performs syntax parsing according to different tokens. In this article, you can learn the following:
- Go parsing overview
- Go syntax parsing detailed process
- The example shows the complete process of syntax parsing
💡 Tips: the article will involve grammar and production. If you don't know much about it, please take a look at the previous Lexical analysis & the basis of grammatical analysis This article
The article is relatively long, but there are many codes, mainly for easy understanding. I believe you will gain something after reading it
Go parsing overview
In order to facilitate the later understanding, I provide a source file here. You can bring in the following contents according to the contents of the source file:
package main import ( "fmt" "go/token" ) type aType string const A = 666 var B = 888 func main() { fmt.Println("Test Parser") token.NewFileSet() }
entrance
Introduced in the last article lexical analysis The compiler entry of Go is described in detail. The parser is initialized at the compiler entry, and the lexical analyzer is initialized during the initialization of the parser. Therefore, the lexical analyzer is embedded in the parser. We can find it in Src / CMD / compile / internal / syntax / parser The structure of the parser in Go is as follows:
type parser struct { file *PosBase //Record the information of the open file (such as file name, row and column information) errh ErrorHandler //Error callback mode Mode //Parsing mode pragh PragmaHandler scanner //Lexical analyzer base *PosBase // current position base first error // first error encountered errcnt int // number of errors encountered pragma Pragma // pragmas fnest int // function nesting level (for error handling) xnest int // Expression nesting level (for compact ambiguity resolution) indent []byte // tracing support }
Because in Last The article has shared the location of the entry file and what it does in detail (src/cmd/compile/internal/syntax/syntax.go → Parse(...)), After the initialization of the parser and lexical analyzer is completed, we will see that it calls the fileOrNil() method of the parser, which is the core method of parsing. Next, we will introduce what this core method does
Go syntax parsing structure
This part will involve the parsing of each declaration type and the structure of each declaration type, as well as the relationship between the nodes of the syntax tree generated by the syntax parser. Some places are really difficult to understand. You can read the text roughly first, and then combined with the grammar analysis diagram below, it may be easier to understand
Grammar rules in Go parsing
I'll introduce this part as a whole first, and then go to the details
Enter the fileOrNil() method, and you will see such a line of comments in the comments
SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } .
It is the grammar rule of Go parsing source files, which is in the second article of the series Fundamentals of lexical and grammatical analysis It is mentioned in part. After initializing the parser and lexical analyzer, the Go compiler will call this method. This method will continuously obtain tokens through the next() method provided by the lexical word splitter. The syntax analysis will be carried out according to the above grammar rules
You may not know the production formulas of PackageClause, ImportDecl and TopLevelDecl. You can find these three production formulas directly in this file (for production formulas, see Fundamentals of lexical and grammatical analysis (some are introduced)
SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } . PackageClause = "package" PackageName . PackageName = identifier . ImportDecl = "import" ( ImportSpec | "(" { ImportSpec ";" } ")" ) . ImportSpec = [ "." | PackageName ] ImportPath . ImportPath = string_lit . TopLevelDecl = Declaration | FunctionDecl | MethodDecl . Declaration = ConstDecl | TypeDecl | VarDecl . ConstDecl = "const" ( ConstSpec | "(" { ConstSpec ";" } ")" ) . ConstSpec = IdentifierList [ [ Type ] "=" ExpressionList ] . TypeDecl = "type" ( TypeSpec | "(" { TypeSpec ";" } ")" ) . TypeSpec = AliasDecl | TypeDef . AliasDecl = identifier "=" Type . TypeDef = identifier Type . VarDecl = "var" ( VarSpec | "(" { VarSpec ";" } ")" ) . VarSpec = IdentifierList ( Type [ "=" ExpressionList ] | "=" ExpressionList ) .
fileOrNil() simply means parsing the source File according to the grammar SourceFile, and finally returns a syntax tree (File structure, which will be described below). As we know, the compiler will parse each File into a syntax tree
The following is a brief introduction to the meaning of several grammars in Go grammar analysis (if you read the previous one) Fundamentals of lexical and grammatical analysis This article should be easy to understand the meaning of Go (these Grammars)
SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } . We can see SourceFile By PackageClause,ImportDecl,TopLevelDecl These three non terminal symbols form. It means: Each source file is mainly composed of package Declaration, import declaration, and top-level declaration (where ImportDecl and TopLevelDecl Is optional and optional, which is the meaning of square brackets). In other words, every source file should conform to this grammar rule The first is the package declaration: PackageClause PackageClause = "package" PackageName . PackageName = identifier . PackageClause Is a terminator package And a non terminator PackageName Constitute, and PackageName Consisting of an identifier Therefore, when scanning the source file, the first thing you should get is package of Token,Then an identifier Token. Finished parsing package After the declaration, the import declaration should be followed Then comes the import declaration: ImportDecl ImportDecl = "import" ( ImportSpec | "(" { ImportSpec ";" } ")" ) . ImportSpec = [ "." | PackageName ] ImportPath . ImportPath = string_lit .
After understanding the meaning of the above grammar, let's look at how fileNil() parses the grammar according to the above grammar. But before that, you need to know what the node structure of the syntax tree generated by the fileOrNil() method is
Structure of each node of syntax tree
We know that the fileOrNil() method will generate a syntax tree according to the grammar rules of SourceFile, and the structure of each syntax tree is such a structure: (src/cmd/compile/internal/syntax/nodes.go → File)
type File struct { Pragma Pragma PkgName *Name DeclList []Decl Lines uint node }
It mainly contains the package name (PkgName) of the source file and all declarations (DeclList) in the source file. It should be noted that it will also parse import into DeclList as a declaration
fileOrNil() will parse all declarations (such as var, type, const) in the source file into DeclList according to the structure of each declaration (each declaration defines corresponding structures to store declaration information, and these structures are the child nodes of the syntax tree). In the grammar above, we can see the constant, type and variable declaration grammar, as well as the grammar of functions and methods, which can be found in Src / CMD / compile / internal / syntax / parser Found in go
FunctionDecl = "func" FunctionName ( Function | Signature ) . FunctionName = identifier . Function = Signature FunctionBody . MethodDecl = "func" Receiver MethodName ( Function | Signature ) . Receiver = Parameters .
The root node structure of this syntax tree is the File structure, and its child node structure is:
ImportDecl struct { Group *Group // nil means not part of a group Pragma Pragma LocalPkgName *Name // including "."; nil means no rename present Path *BasicLit decl } ConstDecl struct { Group *Group // nil means not part of a group Pragma Pragma NameList []*Name Type Expr // nil means no type Values Expr // nil means no values decl } TypeDecl struct { Group *Group // nil means not part of a group Pragma Pragma Name *Name Alias bool Type Expr decl } VarDecl struct { Group *Group // nil means not part of a group Pragma Pragma NameList []*Name Type Expr // nil means no type Values Expr // nil means no values decl } FuncDecl struct { Pragma Pragma Recv *Field // nil means regular function Name *Name Type *FuncType Body *BlockStmt // nil means no body (forward declaration) decl }
The structure of these nodes is defined in Src / CMD / compile / internal / syntax / nodes Go. That is, in the process of parsing, if it is parsed to meet the ImportDecl grammar rules, it will create nodes with corresponding structures to save relevant information; In case of meeting the var type declaration grammar, create the structure related to the var type declaration (VarDecl) to save the declaration information
If the function is parsed, it is a little more complex. From the structure of the function node, you can see that it includes the receiver, function name, type and function body. The most complex part is the function body, which is a BlockStmt structure:
BlockStmt struct { List []Stmt //Stmt is an interface Rbrace Pos stmt }
BlockStmt is composed of a series of declarations and expressions. You can find it in Src / CMD / compile / internal / syntax / nodes You can see many expressions and declared structures in go (these structures are also the node structures under the function declaration)
// ---------------------------------------------------------------------------- // Statements ...... SendStmt struct { Chan, Value Expr // Chan <- Value simpleStmt } DeclStmt struct { DeclList []Decl stmt } AssignStmt struct { Op Operator // 0 means no operation Lhs, Rhs Expr // Rhs == ImplicitOne means Lhs++ (Op == Add) or Lhs-- (Op == Sub) simpleStmt } ...... ReturnStmt struct { Results Expr // nil means no explicit return values stmt } IfStmt struct { Init SimpleStmt Cond Expr Then *BlockStmt Else Stmt // either nil, *IfStmt, or *BlockStmt stmt } ForStmt struct { Init SimpleStmt // incl. *RangeClause Cond Expr Post SimpleStmt Body *BlockStmt stmt } ...... // ---------------------------------------------------------------------------- // Expressions ...... // [Len]Elem ArrayType struct { // TODO(gri) consider using Name{"..."} instead of nil (permits attaching of comments) Len Expr // nil means Len is ... Elem Expr expr } // []Elem SliceType struct { Elem Expr expr } // ...Elem DotsType struct { Elem Expr expr } // struct { FieldList[0] TagList[0]; FieldList[1] TagList[1]; ... } StructType struct { FieldList []*Field TagList []*BasicLit // i >= len(TagList) || TagList[i] == nil means no tag for field i expr } ...... // X[Index] IndexExpr struct { X Expr Index Expr expr } // X[Index[0] : Index[1] : Index[2]] SliceExpr struct { X Expr Index [3]Expr // Full indicates whether this is a simple or full slice expression. // In a valid AST, this is equivalent to Index[2] != nil. // TODO(mdempsky): This is only needed to report the "3-index // slice of string" error when Index[2] is missing. Full bool expr } // X.(Type) AssertExpr struct { X Expr Type Expr expr } ......
Does it look familiar? There are if, for and return. Stmt in BlockStmt is an interface type, which means that various expression types or declaration type structures above can implement stmt interfaces
After knowing the relationship between the nodes in the syntax tree and the possible structures of these nodes, let's see how fileOrNil parses the nodes of the syntax tree step by step
Source code implementation of fileOrNil()
In this part, I will not go deep into each function to see how it is parsed, but introduce how it parses the token in the source file step by step from the general outline, because this part is mainly understood from the whole first. How to parse import, var, const and func will be described in detail in the next section
You can see that the code implementation of fileOrNil() mainly includes the following parts:
// SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } . func (p *parser) fileOrNil() *File { ...... //1. Create a File structure f := new(File) f.pos = p.pos() // 2. First parse the package definition at the beginning of the file // PackageClause if !p.got(_Package) { //Check whether the package is defined in the first line p.syntaxError("package statement must be first") return nil } // 3. After parsing the package, parse the import declaration (each import is a declaration statement in the view of the parser) // { ImportDecl ";" } for p.got(_Import) { f.DeclList = p.appendGroup(f.DeclList, p.importDecl) p.want(_Semi) } // 4. switch according to the obtained token, select the corresponding branch to parse the corresponding type of statement // { TopLevelDecl ";" } for p.tok != _EOF { switch p.tok { case _Const: p.next() // Get the next token f.DeclList = p.appendGroup(f.DeclList, p.constDecl) ...... } // p.tok == _EOF p.clearPragma() f.Lines = p.line return f }
There are two important methods in fileOrNil(), which are the key to understanding what fileOrNil() is doing:
- got
- appendGroup
The first is the got function, whose parameter is a token, which is used to judge whether the token obtained from the lexical analyzer is the token passed in from the parameter
Then there is the appendGroup function, which has two parameters. The first is DeclList (described earlier when introducing the members of the File structure, which is used to store all declarations in the source File and is a slice type); The second parameter is a function, which is the analysis function of each type of declaration statement (for example, if I currently resolve to import declaration statement, I will pass the method of resolving import as the second parameter to appendGroup)
When parsing the import declaration statement, the following is the code:
for p.got(_Import) { f.DeclList = p.appendGroup(f.DeclList, p.importDecl) p.want(_Semi) }
The function of appendGroup is to find out the definition of batch, such as the following
//Batch declaration of import import ( "fmt" "io" "strconv" "strings" ) //Batch declaration of var var ( x int y int )
For the declaration statement structure with batch declaration, it will have a Group field to indicate that these variables belong to the same Group, such as the declaration structure of import and the declaration structure of var
ImportDecl struct { Group *Group // nil means not part of a group Pragma Pragma LocalPkgName *Name // including "."; nil means no rename present Path *BasicLit decl } VarDecl struct { Group *Group // nil means not part of a group Pragma Pragma NameList []*Name Type Expr // nil means no type Values Expr // nil means no values decl }
In the appendGroup method, the parsing method of the corresponding declared type will be called. Like fileOrNil(), it will be parsed according to the grammar of the corresponding type declaration. For example, the method importDecl() that parses the import declaration
for p.got(_Import) { f.DeclList = p.appendGroup(f.DeclList, p.importDecl) p.want(_Semi) } ...... // ImportSpec = [ "." | PackageName ] ImportPath . // ImportPath = string_lit . func (p *parser) importDecl(group *Group) Decl { if trace { defer p.trace("importDecl")() } d := new(ImportDecl) d.pos = p.pos() d.Group = group d.Pragma = p.takePragma() switch p.tok { case _Name: d.LocalPkgName = p.name() case _Dot: d.LocalPkgName = p.newName(".") p.next() } d.Path = p.oliteral() if d.Path == nil { p.syntaxError("missing import path") p.advance(_Semi, _Rparen) return nil } return d }
We can see that it also creates the structure of the corresponding declaration (here is the import declaration), records the declaration information, and parses the subsequent contents according to the grammar of the declaration
Others include declaration statements such as const, type, var and func, which are matched and parsed by switch. They also have corresponding parsing methods (in fact, the code implemented according to their own grammar rules). I don't list them here. You can use Src / CMD / compile / internal / syntax / parser View in go
As we mentioned earlier, the parser will eventually use a different structure to build each node of the syntax tree. The root node is Src / CMD / compile / internal / syntax / nodes go → File. Its structure has been introduced earlier, mainly including package names and all declaration types. These different types of declarations are the child nodes of the syntax tree
It may be a little difficult to understand through the text description above. The following shows what the whole syntax parsing process is like through a diagram
Graphical parsing process
Take the source code provided at the beginning of the article as an example to show the process of syntax parsing. stay Go lexical analysis It is mentioned in this article that the compilation entry of go is from Src / CMD / compile / internal / GC / noder Go → parseFiles
Here, I believe you have an overall understanding of the grammar analysis part of Go. However, the above does not show how various declaration statements are parsed down, which requires an in-depth look at how the parsing method of each declaration statement is implemented
Go syntax parsing detailed process
For the parsing of various declarations and expressions in Go syntax analysis, you can find them in Src / CMD / compile / internal / syntax / parser Find the corresponding method in Go
Variable declaration & import declaration resolution
The variable declaration is used to call the relevant resolution methods in Src / CMD / compile / internal / syntax / parser Go → fileOrNil() method
Import declaration resolution
In Src / CMD / compile / internal / syntax / parser You can see the following code in go → fileOrNil()
for p.got(_Import) { f.DeclList = p.appendGroup(f.DeclList, p.importDecl) p.want(_Semi) }
In this appendGroup, the importDecl() method will eventually be called (it can be found from the above code that the parsing method of the import declaration will not be executed until the token of the import is matched)
First of all, import can be used in the following ways:
import "a" //By default, a is the package name import aAlias "x/a" // Alias the x/a package aAlias import . "c" // Import the public symbols of dependent packages directly into the namespace of the current file import _ "d" // Only importing dependent packages triggers the initialization of their packages, but does not import any symbols into the current file namespace
For space reasons, I won't stick out the source code of the importDecl() method. I sort out what it does here:
- Create a structure of ImportDeal (which will eventually be append ed to File.DeclList)
- Initialize some data in the structure, such as the location information of the parsed token, group, etc
- Then match the next token to see if it is_ Token type (identifier) of Name, or_ Dot(.)
- If the obtained token is_ Name, the package name is obtained. If yes, the package name is obtained_ Dot(.), Create a new name
- Then there is the matching package path, which is mainly implemented by the oliteral() method
- Returns the ImportDeal structure
What is worth mentioning here is the oliteral() method, which will get the next token to see if it is a token of basic face value type, that is_ Literal
💡 Tips: the basic face value has only integer, floating point number, complex number, Rune and string types
If it is a base face value type, it will create a structure node of the base face value type Basiclit, and initialize some information about it
BasicLit struct { Value string //value Kind LitKind //Base face value of that type, range (IntLit, FloatLit, imagelit, RuneLit, StringLit) Bad bool // true means the literal Value has syntax errors expr }
It is not difficult to see that when the basic face value type is parsed, it is already indecomposable, which is the terminator in grammar. In the syntax tree, these terminators are leaf nodes
Relevant methods are provided in the go standard library to test syntax parsing. Here, I test the results of syntax parsing during import through the interface provided by go/parser:
💡 Tips: same as lexical analysis (you can have a look Go lexical analyzer In the standard library testing part of this article, the implementation of syntax analysis in the standard library is different from that in the go compiler. It is mainly the design of structure (for example, if the structure of the node changes, you can see it yourself. It is relatively simple. If you understand the implementation of syntax analysis in the compiler, you can also understand it in the standard library), the implementation idea is the same
package main import ( "fmt" "go/parser" "go/token" ) const src = ` package test import "a" import aAlias "x/a" import . "c" import _ "d" ` func main() { fileSet := token.NewFileSet() f, err := parser.ParseFile(fileSet, "", src, parser.ImportsOnly) //parser.ImportsOnly mode, which means that only package declarations and imports are resolved if err != nil { panic(err.Error()) } for _, s := range f.Imports{ fmt.Printf("import: name = %v, path = %#v\n", s.Name, s.Path) } }
Print results:
import: name = <nil>, path = &ast.BasicLit{ValuePos:22, Kind:9, Value:"\"a\""} import: name = aAlias, path = &ast.BasicLit{ValuePos:40, Kind:9, Value:"\"x/a\""} import: name = ., path = &ast.BasicLit{ValuePos:55, Kind:9, Value:"\"c\""} import: name = _, path = &ast.BasicLit{ValuePos:68, Kind:9, Value:"\"d\""}
You can test various types of declaration parsing or expression parsing in the standard library. I won't show the tests one by one below (the test method is the same, but the fields you need to print need to be changed)
type declaration resolution
When the parser gets_ When type is a token, it will call the type parsing method to parse. Similarly, in fileOrNil(), you can see the following code:
...... case _Type: p.next() f.DeclList = p.appendGroup(f.DeclList, p.typeDecl) ......
It will call the typeDecl() method in appendGroup, which is based on the grammar declared by type type, which is introduced in the front. We know that type has the following usage:
type a string type b = string
Let's look at what has been done in this method:
- Create a TypeDecl structure (which will eventually be append ed to File.DeclList)
- Initialize some data in the structure, such as the location information of the parsed token, group, etc
- Whether the next token is_ Assign. If yes, get the next token
- Verify the type of the next token, such as chan, struct, map or func (implemented in the typeOrNil() method, it is actually a pile of switch case s)
- Returns the TypeDecl structure
When getting the type of the token on the far right, you need to continue parsing according to its type. If it is a chan type, it will create a chan type structure and initialize the information of this chan type (in the process of parsing, each type of structure created is a node)
You can also use go/parser → ParseFile to test the parsing of type
const type declaration resolution
When the parser gets_ When const this token, it will call const's parsing method to parse it. Similarly, in fileOrNil(), you can see the following code:
...... case _Const: p.next() f.DeclList = p.appendGroup(f.DeclList, p.constDecl) ......
const has the following usage:
const A = 666 const B float64 = 6.66 const ( _ token = iota _EOF _Name _Literal )
Then you can look at the concrete implementation of constDecl method (in fact, it is parsed according to the syntax of const declaration). It will not be repeated here. It is similar to the type resolution above. The corresponding structure is created first, and then some information of the type is recorded. The difference is that it has a list of names, because const can declare multiple names at the same time. The parsing method is parser go→nameList()
var type declaration resolution
When the parser gets_ When var is a token, it will call the var parsing method to parse. Similarly, in fileOrNil(), you can see the following code:
...... case _Var: p.next() f.DeclList = p.appendGroup(f.DeclList, p.varDecl) ......
Like the two declarations above, the corresponding parsing method will be called. The difference of var is that its declaration may involve expressions, so the parsing method of var involves the parsing of expressions. I will analyze the parsing of expressions in detail later
Function declaration parsing implementation
💡 Note: a figure will be added to show the process of function analysis
Finally, the function declaration is parsed. As mentioned earlier, the structure of the function declaration node is as follows:
FuncDecl struct { Pragma Pragma Recv *Field // recipient Name *Name //Function name Type *FuncType //Function type Body *BlockStmt // Function body decl }
In fileOrNil(), you can see the following code:
case _Func: p.next() if d := p.funcDeclOrNil(); d != nil { f.DeclList = append(f.DeclList, d) }
The core method of parsing function is funcDeclOrNil. Because the function parsing is a little more complex, I'll stick out its implementation here and explain what each line of code is doing through comments
// Grammar of function // FunctionDecl = "func" FunctionName ( Function | Signature ) . // FunctionName = identifier . // Function = Signature FunctionBody . // Grammar of method // MethodDecl = "func" Receiver MethodName ( Function | Signature ) . // Receiver = Parameters . // Recipient of method func (p *parser) funcDeclOrNil() *FuncDecl { if trace { defer p.trace("funcDecl")() } f := new(FuncDecl) //Create structure (node) of function declaration type f.pos = p.pos() f.Pragma = p.takePragma() if p.tok == _Lparen { //If the left parenthesis is matched (it indicates the method) rcvr := p.paramList() //Get recipient list switch len(rcvr) { case 0: p.error("method has no receiver") default: p.error("method has multiple receivers") fallthrough case 1: f.Recv = rcvr[0] } } if p.tok != _Name { //Judge whether the next token is an identifier (i.e. function name) p.syntaxError("expecting name or (") p.advance(_Lbrace, _Semi) return nil } f.Name = p.name() f.Type = p.funcType() //Get type (continue to understand its internal implementation below) if p.tok == _Lbrace { // If it matches the left bracket, the function body is parsed f.Body = p.funcBody() //Parse the function body (continue to understand its internal implementation below) } return f }
Two important implementations of function parsing: funcType(), funcBody(). What did they do internally?
/* FuncType struct { ParamList []*Field ResultList []*Field expr } */ func (p *parser) funcType() *FuncType { if trace { defer p.trace("funcType")() } typ := new(FuncType) //Create a function type structure (the main members are parameter list and return value list) typ.pos = p.pos() typ.ParamList = p.paramList() //Get the parameter list (it returns a Field structure whose members are parameter names and types) typ.ResultList = p.funcResult() //Get the list of return values (it also returns a Field structure) return typ }
func (p *parser) funcBody() *BlockStmt { p.fnest++ // Record the call level of the function errcnt := p.errcnt // Record the current number of errors body := p.blockStmt("") // Parse the statements in the function body (see the specific implementation below) p.fnest-- // Don't check branches if there were syntax errors in the function // as it may lead to spurious errors (e.g., see test/switch2.go) or // possibly crashes due to incomplete syntax trees. if p.mode&CheckBranches != 0 && errcnt == p.errcnt { checkBranches(body, p.errh) } return body } func (p *parser) blockStmt(context string) *BlockStmt { if trace { defer p.trace("blockStmt")() } s := new(BlockStmt) //Create the structure of the function body s.pos = p.pos() // people coming from C may forget that braces are mandatory in Go if !p.got(_Lbrace) { p.syntaxError("expecting { after " + context) p.advance(_Name, _Rbrace) s.Rbrace = p.pos() // in case we found "}" if p.got(_Rbrace) { return s } } s.List = p.stmtList() //Start parsing the declaration and expression in the function body (the implementation of the edge here is to judge which declaration or statement is based on the obtained token, and it is also implemented through switch case to parse the corresponding grammar according to the matching type) s.Rbrace = p.pos() p.want(_Rbrace) return s }
The above function analysis process is shown in the figure for easy understanding:
You can see how to parse some functions in the function body, such as assignment, for, go, defer, select, etc
summary
This paper mainly shares the process of syntax parsing as a whole, and briefly shows the internal implementation of type, const and func declaration parsing. In fact, there are expression parsing, including some other syntax parsing. There are many contents, which can't be introduced one by one. Interested partners can study them by themselves
Thanks for reading. The next topic is: abstract syntax tree construction