target
Series 3 The basic AST traversal is done in.
Before deep extraction of table name and column name in SQL, we need to solve the two remaining practical problems in Chapter 3, semicolon and case
Semicolon problem
The performance of the semicolon problem is the automatically generated HiveParser.java code, which can only parse a single statement. It will report an error to the sql text containing multiple statements, or even one more semicolon at the end of a single statement. For example
SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1 ;
And this kind of
SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1 ; SELECT c6 FROM tb3 ;
Reason
The problem is easy to find. It's in HiveParser.g, specifically on the RULE corresponding to the statement() method.
// starting rule statement : explainStatement EOF | execStatement EOF ;
The symbol system of this celestial script is very similar to regular expressions. Simply translating the language between adults means that HiveParser accepts the input of statement. A statement can be composed of an explainStatement plus EOF, or an execStatement plus EOF. explainStatement and execStatement are single statements with specific types
This experience is not in line with the experience of using hive. The most common way to submit hive statements is through hive client, which can certainly handle multiple statements. There are two obvious ways to solve the semicolon problem
- Expand the scope of using hive source code
- Modify the syntax rules in HiveParser.g
But how to choose? The author searched the Hive source code up and down, and then combined with Hive's actual use situation to infer.
- The syntax rules of HiveParser.g are only called by the code hidden in Hive execution engine, and the input at the time of calling has been clipped and constrained to include only a single statement
- The most common way to submit Hive statements is through hive client, which also needs to be translated to or called on thrift or jdbc protocol.
- In the script that Hive client can process, there will be some conf settings and running parameter replacement. In this way, you need to preprocess the statements before executing hive sql
With the above inferences, it is also clear that the processing of blood relationship does not need complete hive client processing capacity, and the workload of cutting hive source code is not very controllable, so you need to select 2, modify the syntax rules in HiveParser.g
If you are still interested in 1, you can start from studying the ParseDriver.java class in the same directory as the source code.
Rule revision
Now the rule is called statement, and EOF ends after a statement. As a regular writer, it's natural to think that if the statement rule can be matched multiple times and separated by semicolons, is it OK? The successful results are as follows
// starting rule statements : statement (SEMICOLON statement )* SEMICOLON* EOF ; statement : explainStatement | execStatement ;
Remove the EOF from the original statement, and make two lines and one line
Then add a rule of plural names like statements. Semicolor is found from the definition in HiveLexer.g, indicating the SEMICOLON. The brackets () in the. g file indicate that the contents in brackets are judged as a whole*
Its function is similar to that in regular expressions, which means that the whole front part appears 0 to infinite times. The combined effect is that there can be multiple statements in a piece of text. The statements are separated by a single semicolon, but the semicolon after the last statement can be omitted
Verify output
The test sql statement is as follows
SELECT DISTINCT a1.c1 AS c2, a1.c3 AS c4, '' c5 FROM db2.tb2 a1 ; SELECT c6 FROM tb3
After modification, java code should be regenerated and compiled into class
java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g javac -cp antlr-3.4-complete.jar HiveLexer.java HiveParser*.java ParseError.java
Continue to use the AST tree traversal script written in the previous article. The script is relatively long, and there is another problem later. Do not paste it repeatedly, just copy the output as follows
None=0 TOK_QUERY=777 TOK_FROM=681 TOK_TABREF=864 TOK_TABNAME=863 db2=26 tb2=26 a1=26 TOK_INSERT=707 TOK_DESTINATION=660 TOK_TAB=835 TOK_TABNAME=863 db1=26 tb1=26 TOK_SELECTDI=792 TOK_SELEXPR=793 .=17 TOK_TABLE_OR_COL=860 a1=26 c1=26 c2=26 TOK_SELEXPR=793 .=17 TOK_TABLE_OR_COL=860 a1=26 c3=26 c4=26 TOK_SELEXPR=793 ''=302 c5=26 ;=299 TOK_QUERY=777 TOK_FROM=681 TOK_TABREF=864 TOK_TABNAME=863 tb3=26 TOK_INSERT=707 TOK_DESTINATION=660 TOK_DIR=661 TOK_TMP_FILE=873 TOK_SELECT=791 TOK_SELEXPR=793 TOK_TABLE_OR_COL=860 c6=26 <EOF>=-1
It can be seen that there are no errors reported. In the output tree, there are two tok [query] nodes, corresponding to two select statements
Case problem
The expression of case problem is that in the automatically generated HiveParser.java, in the sql text content that will need to be input, the keyword can only be uppercase, and the lowercase keyword will be recognized as an identifier, and then the parsing fails because it does not conform to the syntax rules.
Similar to the previous semicolon problem, there are two options
- Expand the scope of using hive source code
- Modify the syntax rules in HiveParser.g
This time, the choice is different from the semicolon problem. First, there are two completely different ways to deal with the case of input identifiers in antlr's own documents.
- Preprocess input content, normalize all content to upper case (or lower case)
- When defining keywords, case independent processing is required. If all keywords need case independent processing, all rules should be redefined
The first point is easy to understand, the second point may be a little obscure, with the keyword select as the chestnut description.
In HiveLexer.g, the keyword of select is defined as follows
KW_SELECT : 'SELECT';
If you want to do case independent processing, one of the possible ways to write is like this
KW_SELECT : ('s'|'S')('e'|'E')('l'|'L')('e'|'E')('c'|'C')('t'|'T');
This change is feasible, but it's a little bit static. And the obvious fact is that Hive does not deal with it in this way. If you directly modify HiveLexer.g on a large scale, it will be hard to please how the generated new code processing behavior is inconsistent with Hive itself. So it is suitable to preprocess the input content and normalize all the content to uppercase
normalization
There are two ways to normalize
- Expand the scope of use of hive source code, and realize normalization in java.
- Write code in python to realize normalization
For the purpose of trying to deal with hive itself all the time and making little changes, we chose to expand the scope of vision of hive source code.
Of course, it is also because the normalization code in Hive source code is very short and easy to handle.
ParseDriver.java mentioned in the previous section defines an internal class, ANTLRNoCaseStringStream, which is used to process input character streams and normalize characters to uppercase. Extract this part of the code, and handle it similar to the previous ParseError.java. The code of ANTLRNoCaseStringStream.java is as follows.
package grammar.hive110; import org.antlr.runtime.ANTLRStringStream; import org.antlr.runtime.CharStream; public class ANTLRNoCaseStringStream extends ANTLRStringStream { public ANTLRNoCaseStringStream(String input) { super(input); } @Override public int LA(int i) { int returnChar = super.LA(i); if (returnChar == CharStream.EOF) { return returnChar; } else if (returnChar == 0) { return returnChar; } return Character.toUpperCase((char) returnChar); } }
Code revision
Because the normalized code is added, it needs to be recompiled, and the generated code of the tree also changes slightly
Add an input file at compile time
java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g javac -cp antlr-3.4-complete.jar HiveLexer.java HiveParser*.java ParseError.java ANTLRNoCaseStringStream.java
The previous tree is also used to generate code, because the syntax rule has been modified, and the method name of the parsing entry has to be modified
The revised python code is as follows
import jnius_config jnius_config.set_classpath('./','./grammar/hive110/antlr-3.4-complete.jar') import jnius StringStream = jnius.autoclass('grammar.hive110.ANTLRNoCaseStringStream') Lexer = jnius.autoclass('grammar.hive110.HiveLexer') Parser = jnius.autoclass('grammar.hive110.HiveParser') TokenStream = jnius.autoclass('org.antlr.runtime.CommonTokenStream') sql_string = ( "SELECT DISTINCT a1.c1 AS c2,\n" " a1.c3 AS c4,\n" " '' c5\n" " FROM db2.tb2 AS a1 ;\n" ) sqlstream = StringStream(sql_string) inst = Lexer(sqlstream) ts = TokenStream(inst) parser = Parser(ts) ret = parser.statements() treeroot = ret.getTree()