Series: using python+antlr to analyze hive sql to obtain data consanguinity

target

Series 3 The basic AST traversal is done in.

Before deep extraction of table name and column name in SQL, we need to solve the two remaining practical problems in Chapter 3, semicolon and case

Semicolon problem

The performance of the semicolon problem is the automatically generated HiveParser.java code, which can only parse a single statement. It will report an error to the sql text containing multiple statements, or even one more semicolon at the end of a single statement. For example

SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5  FROM db2.tb2 a1 ;

And this kind of

SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5  FROM db2.tb2 a1 ;
SELECT c6  FROM tb3 ;

Reason

The problem is easy to find. It's in HiveParser.g, specifically on the RULE corresponding to the statement() method.

// starting rule
statement
	: explainStatement EOF
	| execStatement EOF
	;

The symbol system of this celestial script is very similar to regular expressions. Simply translating the language between adults means that HiveParser accepts the input of statement. A statement can be composed of an explainStatement plus EOF, or an execStatement plus EOF. explainStatement and execStatement are single statements with specific types

This experience is not in line with the experience of using hive. The most common way to submit hive statements is through hive client, which can certainly handle multiple statements. There are two obvious ways to solve the semicolon problem

Expand the scope of using hive source code
Modify the syntax rules in HiveParser.g

But how to choose? The author searched the Hive source code up and down, and then combined with Hive's actual use situation to infer.

The syntax rules of HiveParser.g are only called by the code hidden in Hive execution engine, and the input at the time of calling has been clipped and constrained to include only a single statement
The most common way to submit Hive statements is through hive client, which also needs to be translated to or called on thrift or jdbc protocol.
In the script that Hive client can process, there will be some conf settings and running parameter replacement. In this way, you need to preprocess the statements before executing hive sql

With the above inferences, it is also clear that the processing of blood relationship does not need complete hive client processing capacity, and the workload of cutting hive source code is not very controllable, so you need to select 2, modify the syntax rules in HiveParser.g

If you are still interested in 1, you can start from studying the ParseDriver.java class in the same directory as the source code.

Rule revision

Now the rule is called statement, and EOF ends after a statement. As a regular writer, it's natural to think that if the statement rule can be matched multiple times and separated by semicolons, is it OK? The successful results are as follows

// starting rule
statements
    : statement (SEMICOLON statement )* SEMICOLON* EOF
    ;
statement
    : explainStatement | execStatement
    ;

Remove the EOF from the original statement, and make two lines and one line

Then add a rule of plural names like statements. Semicolor is found from the definition in HiveLexer.g, indicating the SEMICOLON. The brackets () in the. g file indicate that the contents in brackets are judged as a whole*
Its function is similar to that in regular expressions, which means that the whole front part appears 0 to infinite times. The combined effect is that there can be multiple statements in a piece of text. The statements are separated by a single semicolon, but the semicolon after the last statement can be omitted

Verify output

The test sql statement is as follows

SELECT DISTINCT a1.c1 AS c2,
 a1.c3 AS c4,
 '' c5
 FROM db2.tb2 a1 ;
SELECT c6 FROM tb3

After modification, java code should be regenerated and compiled into class

java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g

javac -cp antlr-3.4-complete.jar HiveLexer.java HiveParser*.java ParseError.java

Continue to use the AST tree traversal script written in the previous article. The script is relatively long, and there is another problem later. Do not paste it repeatedly, just copy the output as follows

None=0
  TOK_QUERY=777
    TOK_FROM=681
      TOK_TABREF=864
        TOK_TABNAME=863
          db2=26
          tb2=26
        a1=26
    TOK_INSERT=707
      TOK_DESTINATION=660
        TOK_TAB=835
          TOK_TABNAME=863
            db1=26
            tb1=26
      TOK_SELECTDI=792
        TOK_SELEXPR=793
          .=17
            TOK_TABLE_OR_COL=860
              a1=26
            c1=26
          c2=26
        TOK_SELEXPR=793
          .=17
            TOK_TABLE_OR_COL=860
              a1=26
            c3=26
          c4=26
        TOK_SELEXPR=793
          ''=302
          c5=26
  ;=299
  TOK_QUERY=777
    TOK_FROM=681
      TOK_TABREF=864
        TOK_TABNAME=863
          tb3=26
    TOK_INSERT=707
      TOK_DESTINATION=660
        TOK_DIR=661
          TOK_TMP_FILE=873
      TOK_SELECT=791
        TOK_SELEXPR=793
          TOK_TABLE_OR_COL=860
            c6=26
  <EOF>=-1

It can be seen that there are no errors reported. In the output tree, there are two tok [query] nodes, corresponding to two select statements

Case problem

The expression of case problem is that in the automatically generated HiveParser.java, in the sql text content that will need to be input, the keyword can only be uppercase, and the lowercase keyword will be recognized as an identifier, and then the parsing fails because it does not conform to the syntax rules.

Similar to the previous semicolon problem, there are two options

Expand the scope of using hive source code
Modify the syntax rules in HiveParser.g

This time, the choice is different from the semicolon problem. First, there are two completely different ways to deal with the case of input identifiers in antlr's own documents.

Preprocess input content, normalize all content to upper case (or lower case)
When defining keywords, case independent processing is required. If all keywords need case independent processing, all rules should be redefined

The first point is easy to understand, the second point may be a little obscure, with the keyword select as the chestnut description.

In HiveLexer.g, the keyword of select is defined as follows

KW_SELECT : 'SELECT';

If you want to do case independent processing, one of the possible ways to write is like this

KW_SELECT : ('s'|'S')('e'|'E')('l'|'L')('e'|'E')('c'|'C')('t'|'T');

This change is feasible, but it's a little bit static. And the obvious fact is that Hive does not deal with it in this way. If you directly modify HiveLexer.g on a large scale, it will be hard to please how the generated new code processing behavior is inconsistent with Hive itself. So it is suitable to preprocess the input content and normalize all the content to uppercase

normalization

There are two ways to normalize

Expand the scope of use of hive source code, and realize normalization in java.
Write code in python to realize normalization

For the purpose of trying to deal with hive itself all the time and making little changes, we chose to expand the scope of vision of hive source code.

Of course, it is also because the normalization code in Hive source code is very short and easy to handle.

ParseDriver.java mentioned in the previous section defines an internal class, ANTLRNoCaseStringStream, which is used to process input character streams and normalize characters to uppercase. Extract this part of the code, and handle it similar to the previous ParseError.java. The code of ANTLRNoCaseStringStream.java is as follows.

package grammar.hive110;
 import org.antlr.runtime.ANTLRStringStream;
 import org.antlr.runtime.CharStream;
 public class ANTLRNoCaseStringStream extends ANTLRStringStream {

   public ANTLRNoCaseStringStream(String input) {
     super(input);
   }

   @Override
   public int LA(int i) {

     int returnChar = super.LA(i);
     if (returnChar == CharStream.EOF) {
       return returnChar;
     } else if (returnChar == 0) {
       return returnChar;
     }

     return Character.toUpperCase((char) returnChar);
   }
 }

Code revision

Because the normalized code is added, it needs to be recompiled, and the generated code of the tree also changes slightly

Add an input file at compile time

java -jar antlr-3.4-complete.jar HiveLexer.g HiveParser.g

javac -cp antlr-3.4-complete.jar HiveLexer.java HiveParser*.java ParseError.java ANTLRNoCaseStringStream.java

The previous tree is also used to generate code, because the syntax rule has been modified, and the method name of the parsing entry has to be modified
The revised python code is as follows

import jnius_config
jnius_config.set_classpath('./','./grammar/hive110/antlr-3.4-complete.jar')
import jnius

StringStream = jnius.autoclass('grammar.hive110.ANTLRNoCaseStringStream')
Lexer  = jnius.autoclass('grammar.hive110.HiveLexer')
Parser  = jnius.autoclass('grammar.hive110.HiveParser')
TokenStream  = jnius.autoclass('org.antlr.runtime.CommonTokenStream')

sql_string = (
    "SELECT DISTINCT a1.c1 AS c2,\n"
    " a1.c3 AS c4,\n"
    " '' c5\n"
    " FROM db2.tb2 AS a1 ;\n"
    )

sqlstream = StringStream(sql_string)
inst = Lexer(sqlstream)
ts = TokenStream(inst)
parser = Parser(ts)
ret  = parser.statements()
treeroot = ret.getTree()

Arrogant programmer

19 original articles published, praised 0, 767 visitors

Private letter follow

Keywords: hive Java SQL Python

Added by mortal991 on Thu, 16 Jan 2020 08:58:24 +0200

Programming VIP