Recursive descent method for OO expression analysis

The content of my seminar (2021) was revised from last year's seminar and supplemented by the new content of my seminar (2021).

In this year's first unit training guide, the course group has officially given a tutorial on recursive descent method, but the tutorial is short and does not give code. This post can be used as a supplement to the official instruction of the course group for students' reference.

The following is the text.

Why recursive descent

The expressions involved in the OO operation in this unit are derived from the syntax rules defined by a series of formal expressions described by EBNF. Each of the syntax rules is called a production, and the symbol on the left side of the production is called a non Terminator (e.g. expression, item, etc.). In contrast, the symbol that does not appear on the left side of the production in the syntax rules is called a Terminator (e.g. x, +, 1, etc.).

Recursive descent It is a top-down syntax analysis method in the compilation principle. Its approach is to write a recursively callable analysis subroutine (method) for each non Terminator (i.e. each symbol on the left side of the production in the syntax rules), which is used to deal with the nested syntax hierarchy through the mutual calls between the methods.

With the syntax rules of formal expression, we can use the recursive descent method to write the parsing program, and the structure of the parsing program is basically consistent with the syntax rules defined in the formal expression, which makes the code structure of the expression parsing part very clear, And it is very easy to expand (when the grammar rules are added or modified in subsequent jobs, you only need to write the corresponding parsing method for the added grammar rules, find the parsing method of the grammar rules to be modified and modify its internal parsing logic)

Then do expression evaluation

In the data structure course of freshman year, we have learned and practiced the programming topic expression evaluation for many times. At that time, the double stack method was widely used to maintain an operand stack and an operator stack. This method has some limitations. When the syntax rules of the expression become more and more complex (such as the introduction of power, user-defined function, unary operator, etc.), the complexity of the parser also increases, and the readability decreases due to the increase of judgment. (this method is still feasible in OO jobs, but its maintainability may deteriorate with the iteration of the second and third jobs)

Here, let's review the expression evaluation with recursive descent method and introduce it through relatively simple syntax.

Consider evaluating the following versions of the expression:

  • Only integers are involved
  • Support +, -, *, / operations (the mathematical meaning of the priority is the same as that of the operator)
  • Parentheses and nested parentheses are allowed
  • Integer and bracket can be preceded by 1 symbol

(the division method is the same as the integer division method in C language to ensure that there is no division by zero in the operation process)

Problem solving steps

  1. generative grammar
  • Analyze the grammatical components of expressions and their relationships
  • Give the formal definition described by BNF or EBNF
    (in this OO operation, the formal definition has been given in the instruction, so this step is not necessary)
  1. Write a parsing function for each non terminal of the syntax rule
  • A function only parses a non terminal (or a production) for which it is responsible
  • When the next level sub component is encountered, the parsing function of the corresponding component is called
  • Defines the return value of the function
    • This problem requires evaluation, that is, after parsing, each non terminator corresponds to a value, so the return value of each function is an int
    • If you need to build an expression tree, each parsing function returns a pointer to the expression tree node (object variables in Java)
  • Judgment of syntax error (Wrong Format)
    • Parsing cannot continue (the symbol encountered at the current location does not meet the syntax rules)
    • Statement parsing should end but not end

Description syntax rules

The syntax rules of the above expression are described by EBNF (the definition of non terminator is represented by - > in the following production formula):

Expr    -> Term     | Expr ('+'|'-') Term
Term    -> Factor   | Term ('*'|'/') Factor
Factor  -> ['+'|'-'] ( Number | '(' Expr ')' )
Number  -> Digit { Digit }
Digit   -> '0'|'1'|'2'|'3'|...|'9'

among

  • {}: duplicate, 0 or more allowed
  • []: optional, 0 or 1 allowed
  • (): grouping, used to control the priority of production
  • |: or, select one of multiple selections
  • The character (string) wrapped in single quotation marks is the terminator

The formal expression grammar rules describing the same language may not be unique (if the title is not given), for example:

  • In addition to the left recursive form, the production of expression Expr and Term can also be rewritten into circular form
    • Expr -> Term { ('+'|'-') Term }
  • Factor and its left symbol can be split
    • Factor -> Sign UnsignedFactor
    • Sign -> '+' | '-'
    • UnsignedFactor -> Number | '(' Expr ')'
  • The unsigned integer Number can also be used directly as an entire terminator

Eliminate Left Recursion

With the syntax rules of formal expression, can you start writing recursive descent parser immediately? Not at this time. In the above example, there is left recursion in the production of expression Expr and Term. Left recursion refers to the form of production such as a - > A and B, such as Expr - > Expr '+' Term.

If there is left recursion in the syntax rules, the recursive descent method cannot be directly used for analysis, because such a parser will produce infinite recursion (recursive descent analysis is usually greedy for matching, and if for the above left recursive Expr, the parser cannot determine the right boundary of the Expr on the right of the production, or when the recursion will stop).

In order to use recursive descent method for parsing, we need to rewrite the syntax rules: rewrite the production with left recursion into right recursion or circular form (described by {} symbol), so as to eliminate left recursion.

For example, there are two ways to rewrite the production of Expr:

  • Expr -> Term | Term ('+' | '-') Expr
  • Expr -> Term { ('+' | '-') Term }

Both rewriting methods are correct. In contrast, it is more recommended to use the form of loop. If the form of right recursion is adopted, a large number of stack pressing behaviors (recursive function calls) will occur in the corresponding recursive descent parser. If the expression is long, there is a risk of stack overflow. (for example, x+x+x+x+...+x, where the number of X is as many as possible on the premise that the length of the expression does not exceed the upper limit)

Here is a small extension. Why do the grammar rules given in the guide use left recursion to describe expressions and items?

According to personal understanding, the level of the syntax tree derived from the left recursive syntax rules is consistent with the real operation order. In the mathematical sense, +, * and other binary operators are combined from left to right. The operators located on the left are operated first, and the corresponding syntax level corresponds to the left recursive form.

Write parser

Here is the code of C language to realize the evaluation of the above expression:

#include <stdio.h>
#include <stdlib.h>

/*
  Expr    -> Term     | Expr ('+'|'-') Term
  Term    -> Factor   | Term ('*'|'/') Factor
  Factor  -> ['+'|'-'] ( Number | '(' Expr ')' )
  Number  -> unsigned number
*/

// These functions call each other and need to be declared in advance in C language
// Expression evaluation, each parsing subroutine returns a value
int Expr();
int Term();
int Factor();
int Number();


#define MAX_LEN 100 / / maximum length of expression

char buffer[MAX_LEN + 5]; // Input buffer
int pos; // Resolving location subscript

#define chr (buffer[pos]) / / retrieve the current character
#define nxt (pos + +) / / move 1 character backward

#define WF do{puts("WRONG FORMAT!"); exit(1);}while(0)

// Expr -> Term | Expr ('+'|'-') Term
int Expr() {
    int ans = Term();
    while (chr == '+' || chr == '-') {
        if (chr == '+') {
            nxt;
            ans += Term();
        } else {
            nxt;
            ans -= Term();
        }
    }
    return ans;
}

// Term -> Factor | Term ('*'|'/') Factor
int Term() {
    int ans = Factor();
    while (chr == '*' || chr == '/') {
        if (chr == '*') {
            nxt;
            ans *= Factor();
        } else {
            nxt;
            ans /= Factor();
        }
    }
    return ans;
}

// Factor -> ['+'|'-'] ( Number | '(' Expr ')' )
int Factor() {
    int sign = 1; // positive
    // ['+' | '-']
    if (chr == '+') nxt;
    else if (chr == '-') {
        nxt;
        sign = -1; // negative
    }
    // Number | '(' Expr ')'
    // There is an "or" relationship in the syntax rule = > read 1 symbol forward to judge whether the next parsing direction is a single Number or a bracket '(' Expr ')'
    if (chr == '(') {
        nxt;
        int inner = Expr();
        if (chr == ')') nxt; // expect ')' here
        else WF;
        return sign * inner;
    } else if (chr >= '0' && chr <= '9') {
        return sign * Number();
    } else WF; // either '(' or Number
}

int Number() {
    int ans = 0;
    while (chr >= '0' && chr <= '9') {
        ans = (ans << 3) + (ans << 1) + (chr - '0');
        nxt;
    }
    return ans;
}

int main() {
    while (scanf("%s", buffer) != EOF) {
        pos = 0;
        int ans = Expr();
        if (chr != 0) WF; // not reach end
        printf("%d\n", ans);
    }
    return 0;
}

The above codes can be Evaluation of accoding-303 expression Evaluation of.

The recursive descent method is implemented in Java, and the program structure is similar to the above C language code.

Complexity analysis

Time complexity: \ (O(n) \)

  • The parsed string is traversed from left to right. It is linear time complexity without backtracking
  • The number of branches of "or" relationship in grammar rules affects the constant. The more branches, the greater the constant.

Space complexity: \ (O(n) \)

  • It corresponds to the depth of the stack in the recursive process and the height of the syntax tree
  • Changing right recursion into loop can optimize space complexity

OO expression job

After understanding the idea of recursive descent through the relatively simple expression evaluation, we can return to the expression operation of OO.

Grammar rules - Preparation

Since the formal expression of the expression syntax rules is given in the homework, you don't need to understand the syntax of the expression too much, just follow the rules.

Before starting recursive descent, we need to check whether the production of left recursion exists in the syntax rules. If so (such as expressions and items), we need to rewrite it into circular form.

In addition, some rewriting that does not affect the correctness of grammar can also be carried out according to personal habits. For example, the "blank item" defined in the job can be regarded as an entire symbol (Terminator), so the "blank character" is no longer needed. (for white space characters, the parse method does not need to return any content. Its purpose is to skip these white space characters and set the return value type to void.)

Writing of parser

If you need to parse the input string to get the expression tree, the recursive descent parser is equivalent to the factory in the factory mode. Its input is a string, and the constructed product is the expression tree object.

The parsed result of this method is a non parsed result of this method.

  • The syntax tree node or the value of the calculation result shall be determined according to the actual situation.
  • For blank items, skip directly and do not need to return a value

For the production of grammatical components with "or" relationship, the direction of the next analysis needs to be determined during the analysis (that is, select one branch from several branches). In order to avoid inefficient backtracking, we need to adopt the way of "pre reading", Read the next character in advance to judge the direction of the next step (in the above C language code, pre reading is required when parsing Factor). Because the Iterator in Java can only move forward one way with the next method (it can't even stop and read the current value in place like * it in C + +), it is not recommended to use the Iterator in Java when writing a recursive descent parser. In the simplest way, maintain an integer subscript that points to the current resolved location.

Here is a code template for Java recursive descent to establish an expression tree:

/* Expression tree node class definition */

// Abstract node base class
public abstract class TreeNode {
    // ......
}

// Leaf node (number)
public class NumberNode extends TreeNode {
    // ......
    public NumberNode(int number) { /* ... */ }
}

// Addition node 
public class AddNode extends TreeNode {
    // ......
    public AddNode(TreeNode left, TreeNode right) { /* ... */ }
}

// The definitions of subtraction SubNode, multiplication MulNode and division DivNode are omitted

/* recursive descent parser  */
public class ExprBuilder {
    private final String expr; // Expression to be parsed
    private int pos; // Resolved subscript

    public ExprBuilder(String expr) {
        this.expr = expr;
        this.pos = 0;
    }

    // parseXXXX represents the method of parsing XXXX grammatical components
    // Only Expr is the top-level symbol of syntax, so this method is public
    // Throwing an exception indicates that the expression to be parsed is malformed
    public TreeNode parseExpr() throws Exception {
        // Use expr Charat (POS) takes out the character of the current position
        // ......
    }

    private TreeNode parseTerm() throws Exception { /* ... */ }

    private TreeNode parseFactor() throws Exception { /* ... */ }

    private TreeNode parseNumber() throws Exception { /* ... */ }
}

It can be seen that the parsing program written by recursive descent method is quite concise, clear architecture, good readability and highly easy to expand.

Advantages of recursive descent method

  1. Correctness

    The parsing function (method) of each non terminator strictly follows the definition of production. Just reading the code and simple test can ensure the correctness of the parser.

  2. Readability and scalability

    For the iterative development of the three jobs, the non terminators (such as expressions and items) with the same derivation rules as those in the last job do not need to be modified; You only need to modify the parsing of non terminators whose syntax has changed or added.

    From then on, bid farewell to the long and difficult to read "hell" grand rule. (if you need lexical analysis or matching function names / numbers / variables, you only need a few very simple regular expressions, which will never be more difficult than the regular expressions in the Pre operation)

  3. Format check

    When the next character read out in the parsing process cannot meet any direction of the current production, resulting in the inability of syntax analysis to continue, an exception is thrown indicating WRONG FORMAT. The exceptions thrown by the inner method will be thrown to the top level level level by level, and finally thrown out.

Finally, I wish the students can successfully complete the OO trip of unit 1!

appendix

Two reference codes of programming problems realized by recursive descent method are attached here.

Ccf-csp December 2019 question 3 chemical equation

/* CCF CSP 201912-3 Chemistry Equation */

/* 
Recursive descend
BNF: 
<equation> ::= <expr> "=" <expr> // 2H2+O2=2H2O
<expr> ::= <coef> <formula> | <coef> <formula> "+" <expr> // 2H2+O2, 2H2O
<coef> ::= <digits> | ""    // 
<digits> ::= <digit> | <digits> <digit>
<digit> ::= "0" | "1" | ... | "9"
<formula> ::= <term> <coef> | <term> <coef> <formula>   // H2O CO2 Ca(OH)2 Ba3(PO4)2
<term> ::= <element> | "(" <formula> ")" // H, Ca, (OH), (PO4)
<element> ::= <uppercase> | <uppercase> <lowercase> // H, Ca, O, P, C
<uppercase> ::= "A" | "B" | ... | "Z"
<lowercase> ::= "a" | "b" | ... | "z"

11
H2+O2=H2O
2H2+O2=2H2O
H2+Cl2=2NaCl
H2+Cl2=2HCl
CH4+2O2=CO2+2H2O
CaCl2+2AgNO3=Ca(NO3)2+2AgCl
3Ba(OH)2+2H3PO4=6H2O+Ba3(PO4)2
3Ba(OH)2+2H3PO4=Ba3(PO4)2+6H2O
4Zn+10HNO3=4Zn(NO3)2+NH4NO3+3H2O
4Au+8NaCN+2H2O+O2=4Na(Au(CN)2)+4NaOH
Cu+As=Cs+Au
*/

#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <string>
#include <map>
#include <cassert>
using namespace std;
typedef long long ll;

#define MAXN 1000
char input[MAXN + 5];
int offset;

#define RST (offset = 0)
#define CHR (input[offset])
#define NXT (offset++)

struct ElementSet {
    map<string, int> content;
    ElementSet() {}
    ElementSet(string element, int coef) {content[element] = coef;}
    void merge(const ElementSet &s) {
        map<string, int>::const_iterator it1 = s.content.begin(), ie1 = s.content.end();
        while (it1 != ie1) {
            if (this->content.count(it1->first)) {
                this->content[it1->first] += it1->second;
            }
            else {
                this->content[it1->first] = it1->second;
            }
            it1++;
        }
    }
    void times(const int coef) {
        map<string, int>::iterator it1 = this->content.begin(), ie1 = this->content.end();
        while (it1 != ie1) {
            this->content[it1->first] *= coef;
            it1++;
        }
    }
    bool equals(const ElementSet &b) {
        map<string, int>::const_iterator it1 = this->content.begin(), ie1 = this->content.end();
        map<string, int>::const_iterator it2 = b.content.begin(), ie2 = b.content.end();
        while (it1 != ie1 && it2 != ie2) {
            if (it1->first != it2->first) return false;
            if (it1->second != it2->second) return false;
            it1++; it2++;
        }
        return (it1 == ie1) && (it2 == ie2);
    }
};

bool Equation();
void Expr(ElementSet &es); 
int Coef();
// Digits
// Digit
void Formula(ElementSet &es);
void Term(ElementSet &es);
string Element();
// uppercase
// lowercase

bool Equation() {
    ElementSet left, right;
    Expr(left);
    assert(CHR == '=');
    NXT;
    Expr(right);
    return left.equals(right);
}

void Expr(ElementSet &es) {
    int coef = Coef();
    ElementSet formula;
    Formula(formula);
    formula.times(coef);
    es.merge(formula);
    while (CHR == '+') {
        NXT;
        formula.content.clear();
        coef = Coef();
        Formula(formula);
        formula.times(coef);
        es.merge(formula);
    }
}

int Coef() {
    if (CHR >= '0' && CHR <= '9') {
        int ret = 0;
        while (CHR >= '0' && CHR <= '9') {
            ret = (ret << 3) + (ret << 1) + (CHR - '0');
            NXT;
        }
        return ret;
    }
    else return 1;
}

void Formula(ElementSet &es) {
    ElementSet term;
    Term(term);
    int coef = Coef();
    term.times(coef);
    es.merge(term);
    while (CHR && CHR != '+' && CHR != ')' && CHR != '=') { // !!!
        term.content.clear();
        Term(term);
        coef = Coef();
        term.times(coef);
        es.merge(term);
    }
}

void Term(ElementSet &es) {
    if (CHR >= 'A' && CHR <= 'Z') {
        string ele = Element();
        es.merge(ElementSet(ele, 1));
    }
    else if (CHR == '(') {
        NXT; // '('
        ElementSet formula;
        Formula(formula);
        NXT; // ')'
        es.merge(formula);
    }
    else assert(0);
}

string Element() {
    assert(CHR >= 'A' && CHR <= 'Z');
    char ele[3] = {CHR, 0, 0};
    NXT;
    if (CHR >= 'a' && CHR <= 'z') {ele[1] = CHR; NXT;}
    return string(ele);
}


int main()
{
    // freopen("test.in", "r", stdin);
    int n; scanf("%d", &n);
    while (n--) {
        scanf("%s", input);
        RST;
        printf("%c\n", Equation() ? 'Y' : 'N');
    }
    return 0;
}

accoding-4395 seeking truth in expressions

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/********** Formula ***********
 * <EXPR> ::= <TERM> {"+"|"-" <TERM>}
 * <TERM> ::= <FACTOR> {"*"|"/" <FACTOR>}
 * <FACTOR> ::= (<NAME> [<FUNC> | <METHOD> {<METHOD>}]) | ("(" <EXPR> ")" {<METHOD>})
 * <FUNC> ::= "(" <EXPR> {"," <EXPR>} ")"
 * <METHOD> ::= "." <NAME> <FUNC>
 * <NAME> ::= [IDENTIFIER]
 *******************************/ 

#define WF puts("WRONG FORMAT");

#define MAX_EXPR_LENGTH 125
char buf[MAX_EXPR_LENGTH];
char *p_str = buf;

static int count = 0;
#define SAVE (++count)

typedef struct Val {
    unsigned char type; // 0: variable, 1: temporary result
    union {
        char var;
        int res;
    } val;
} Val;
void disp(const Val *v) {
    if (v->type == 0) {
        putchar(v->val.var);
    } else {
        printf("%d", v->val.res);
    }
}
Val ValVar(char var) {
    Val r;
    r.type = 0;
    r.val.var = var;
    return r;
}
Val ValRes(int res) {
    Val r;
    r.type = 1;
    r.val.res = res;
    return r;
}


Val Expr();
Val Term();
Val Factor();
Val Func(char name, const Val* self);
Val Method(const Val *self);
Val Name();

int main()
{
    gets(buf);
    Expr();
    return 0;
}


Val Expr(){
    if (!(*p_str)) WF;

    Val term = Term();
    Val res = term;
    while ((*p_str) && ((*p_str) == '+' || (*p_str) == '-')) {
        char op = *(p_str++);
        term = Term();
        printf("%c ", op);
        disp(&res); putchar(' ');
        disp(&term);
        res = ValRes(SAVE);
        putchar('\n');
    }
    
    return res;
}

Val Term(){
    if (!(*p_str)) WF;

    Val factor = Factor();
    Val res = factor;
    while ((*p_str) && ((*p_str) == '*' || (*p_str) == '/')) {
        char op = *(p_str++);
        factor = Factor();
        printf("%c ", op);
        disp(&res); putchar(' ');
        disp(&factor);
        res = ValRes(SAVE);
        putchar('\n');
    }
    
    return res;
}

Val Factor(){
    if (!(*p_str)) WF;
    Val ret;
    if ((*p_str) == '(') {
        p_str++;
        Val expr = Expr();
        if (!(*p_str) || (*p_str) != ')') WF;
        p_str++; // ')'
        ret = expr;
    } else {
        Val name = Name();
        if (*p_str && (*p_str) == '(') {ret = Func(name.val.var, NULL);}
        else ret = name;
    }

    while ((*p_str) && (*p_str) == '.') {
        ret = Method(&ret);
    }
    return ret;
}

Val Func(char name, const Val* self){
    // self has already calculated
    Val *args = malloc(121 * sizeof(Val)); 
    int argc = 0;
    if (!(*p_str) || (*p_str) != '(') WF;
    p_str++; // '('
    
    if (self != NULL) {
        args[argc++] = *self;
    }
    Val expr = Expr();
    args[argc++] = expr;
    while (*p_str && (*p_str) == ',') {
        p_str++;
        expr = Expr();
        args[argc++] = expr;
    }
    if (!(*p_str) || (*p_str) != ')') WF;
    p_str++;
    printf("%c", name);
    for (int i = 0; i < argc; i++) {
        putchar(' ');
        disp(&args[i]);
    }
    putchar('\n');
    free(args);
    return ValRes(SAVE);
}

Val Method(const Val *self){
    if (!(*p_str) || (*p_str) != '.') WF;
    p_str++;
    Val name = Name();
    return Func(name.val.var, self);
}
Val Name(){
    if (!(*p_str) || (*p_str) < 'a' || (*p_str) > 'z') WF;
    char name = *(p_str++);
    return ValVar(name);
}

Added by seanmayhew on Sat, 05 Mar 2022 20:59:22 +0200