Series: using python+antlr to analyze hive sql to obtain data consanguinity


Series 3 The basic AST traversal is done in.

Before deep extraction of table name and column name in SQL, we need to solve the two remaining practical problems in Chapter 3, semicolon and case

Semicolon problem

The performance of the semicolon problem is the automatically generated code, which can only parse a single statement. It will report an error to the sql text containing multiple statements, or even one more semicolon at the end of a single statement. For example

SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5  FROM db2.tb2 a1 ;

And this kind of

SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5  FROM db2.tb2 a1 ;
SELECT c6  FROM tb3 ;


The problem is easy to find. It's in HiveParser.g, specifically on the RULE corresponding to the statement() method.

// starting rule
	: explainStatement EOF
	| execStatement EOF

The symbol system of this celestial script is very similar to regular expressions. Simply translating the language between adults means that HiveParser accepts the input of statement. A statement can be composed of an explainStatement plus EOF, or an execStatement plus EOF. explainStatement and execStatement are single statements with specific types

This experience is not in line with the experience of using hive. The most common way to submit hive statements is through hive client, which can certainly handle multiple statements. There are two obvious ways to solve the semicolon problem

  1. Expand the scope of using hive source code
  2. Modify the syntax rules in HiveParser.g

But how to choose? The author searched the Hive source code up and down, and then combined with Hive's actual use situation to infer.

  • The syntax rules of HiveParser.g are only called by the code hidden in Hive execution engine, and the input at the time of calling has been clipped and constrained to include only a single statement
  • The most common way to submit Hive statements is through hive client, which also needs to be translated to or called on thrift or jdbc protocol.
  • In the script that Hive client can process, there will be some conf settings and running parameter replacement. In this way, you need to preprocess the statements before executing hive sql

With the above inferences, it is also clear that the processing of blood relationship does not need complete hive client processing capacity, and the workload of cutting hive source code is not very controllable, so you need to select 2, modify the syntax rules in HiveParser.g

If you are still interested in 1, you can start from studying the class in the same directory as the source code.

Rule revision

Now the rule is called statement, and EOF ends after a statement. As a regular writer, it's natural to think that if the statement rule can be matched multiple times and separated by semicolons, is it OK? The successful results are as follows

// starting rule
    : statement (SEMICOLON statement )* SEMICOLON* EOF
    : explainStatement | execStatement

Remove the EOF from the original statement, and make two lines and one line

Then add a rule of plural names like statements. Semicolor is found from the definition in HiveLexer.g, indicating the SEMICOLON. The brackets () in the. g file indicate that the contents in brackets are judged as a whole*
Its function is similar to that in regular expressions, which means that the whole front part appears 0 to infinite times. The combined effect is that there can be multiple statements in a piece of text. The statements are separated by a single semicolon, but the semicolon after the last statement can be omitted

Verify output

The test sql statement is as follows

 a1.c3 AS c4,
 '' c5
 FROM db2.tb2 a1 ;

After modification, java code should be regenerated and compiled into class

java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g

javac -cp antlr-3.4-complete.jar HiveParser*.java

Continue to use the AST tree traversal script written in the previous article. The script is relatively long, and there is another problem later. Do not paste it repeatedly, just copy the output as follows


It can be seen that there are no errors reported. In the output tree, there are two tok [query] nodes, corresponding to two select statements

Case problem

The expression of case problem is that in the automatically generated, in the sql text content that will need to be input, the keyword can only be uppercase, and the lowercase keyword will be recognized as an identifier, and then the parsing fails because it does not conform to the syntax rules.

Similar to the previous semicolon problem, there are two options

  1. Expand the scope of using hive source code
  2. Modify the syntax rules in HiveParser.g

This time, the choice is different from the semicolon problem. First, there are two completely different ways to deal with the case of input identifiers in antlr's own documents.

  1. Preprocess input content, normalize all content to upper case (or lower case)
  2. When defining keywords, case independent processing is required. If all keywords need case independent processing, all rules should be redefined

The first point is easy to understand, the second point may be a little obscure, with the keyword select as the chestnut description.

In HiveLexer.g, the keyword of select is defined as follows


If you want to do case independent processing, one of the possible ways to write is like this

KW_SELECT : ('s'|'S')('e'|'E')('l'|'L')('e'|'E')('c'|'C')('t'|'T');

This change is feasible, but it's a little bit static. And the obvious fact is that Hive does not deal with it in this way. If you directly modify HiveLexer.g on a large scale, it will be hard to please how the generated new code processing behavior is inconsistent with Hive itself. So it is suitable to preprocess the input content and normalize all the content to uppercase


There are two ways to normalize

  1. Expand the scope of use of hive source code, and realize normalization in java.
  2. Write code in python to realize normalization

For the purpose of trying to deal with hive itself all the time and making little changes, we chose to expand the scope of vision of hive source code.

Of course, it is also because the normalization code in Hive source code is very short and easy to handle. mentioned in the previous section defines an internal class, ANTLRNoCaseStringStream, which is used to process input character streams and normalize characters to uppercase. Extract this part of the code, and handle it similar to the previous The code of is as follows.

package grammar.hive110;
 import org.antlr.runtime.ANTLRStringStream;
 import org.antlr.runtime.CharStream;
 public class ANTLRNoCaseStringStream extends ANTLRStringStream {

   public ANTLRNoCaseStringStream(String input) {

   public int LA(int i) {

     int returnChar = super.LA(i);
     if (returnChar == CharStream.EOF) {
       return returnChar;
     } else if (returnChar == 0) {
       return returnChar;

     return Character.toUpperCase((char) returnChar);

Code revision

Because the normalized code is added, it needs to be recompiled, and the generated code of the tree also changes slightly

Add an input file at compile time

java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g

javac -cp antlr-3.4-complete.jar HiveParser*.java

The previous tree is also used to generate code, because the syntax rule has been modified, and the method name of the parsing entry has to be modified
The revised python code is as follows

import jnius_config
import jnius

StringStream = jnius.autoclass('grammar.hive110.ANTLRNoCaseStringStream')
Lexer  = jnius.autoclass('grammar.hive110.HiveLexer')
Parser  = jnius.autoclass('grammar.hive110.HiveParser')
TokenStream  = jnius.autoclass('org.antlr.runtime.CommonTokenStream')

sql_string = (
    "SELECT DISTINCT a1.c1 AS c2,\n"
    " a1.c3 AS c4,\n"
    " '' c5\n"
    " FROM db2.tb2 AS a1 ;\n"

sqlstream = StringStream(sql_string)
inst = Lexer(sqlstream)
ts = TokenStream(inst)
parser = Parser(ts)
ret  = parser.statements()
treeroot = ret.getTree()

19 original articles published, praised 0, 767 visitors
Private letter follow

Keywords: hive Java SQL Python

Added by mortal991 on Thu, 16 Jan 2020 08:58:24 +0200