One trick of SQL day: how to use HQL to extract string elements from fixed positions [explain Hive string position lookup function]

catalogue

0 problem description

1 problem solving

2 Summary

0 problem description

SQL extracts string elements from a fixed location. You have a string that contains a continuous piece of log data. You want to parse the string and extract some information from it. However, the information you need does not exist in the fixed position of the string. Therefore, you must use some characters near the target information to locate and extract the required content. For example, consider the following string.

xxxxxabc[867]xxx[-]xxxx[5309]xxxxx
xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx
call:[F_GET_ROWS( )]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx
film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx

You want to extract the values in square brackets and return the result set shown below.

FIRST_VAL       SECOND_VAL      LAST_VAL
--------------- --------------- ---------------
867            -                5309
11271978       4                Joe
F_GET_ROWS( )  ROSEWOOD...SIR   44400002
non_marked     unit             withabanana?

1 problem solving

Main idea: analyze the string and grasp the main characteristics.

Although we do not know the exact position of the characters we need, we are sure that they are contained in the square brackets "[]" and know that there are three sets of such values. If we can find the position of square brackets and use the built-in function SUBSTR to extract the required value from the string, this problem can be solved. Therefore, the key to this problem is to find the position of characteristic square brackets.

(1) Data preparation

create table log
as
select 'xxxxxabc[867]xxx[-]xxxx[5309]xxxxx' msg
 union all
select 'xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx' msg
 union all
select 'call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx' msg
 union all
select 'film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx' msg

(2) Data analysis

There are several functions to get string position in Hive:

  • instr(String str, String substr) function
return str For the first time in substr Index of

instr Function returns a string str Neutron string substr The position of the first occurrence, where the position of the first character is 1,If str Not included substr

Return 0.

eg:

0: jdbc:hive2://10.9.4.117:10000> select instr("abcde",'b');
+------+--+
| _c0  |
+------+--+
| 2    |
+------+--+

However, it should be noted that in hive, this function can only obtain the location of the first occurrence of the string, that is, it can only pass two parameters, and cannot obtain the location of the second and third occurrence. In oracle database, this function can obtain the location of the second and third occurrence

  • Collection lookup function: find_in_set

Syntax: find_in_set(string str, string strList) 
Return value: int
Description: returns the first occurrence of STR in strlist. Strlist is a comma separated string. If the str character is not found, 0 is returned

hive> select find_in_set('de','ef,ab,de') ;
3
hive> select find_in_set('at','ef,ab,de') ;
0

Similarly, this function can only get the position where the string first appears, and the string is a collection of strings separated by commas.

  • String lookup function: locate

Syntax: locate(string substr, string str, int pos)
Return value: int
Description: returns the string substr. In str, find the position of the first occurrence after pos

Examples are as follows:

hive> select locate('a','abcda',2) ;
OK
5
Time taken: 0.14 seconds, Fetched: 1 row(s)
hive> select locate('a','abcdabaaaa',2) ;
OK
5
Time taken: 0.182 seconds, Fetched: 1 row(s)
hive> select locate('a','abcdabaaaa',3) ;
OK
5
Time taken: 0.167 seconds, Fetched: 1 row(s)

By comparing the string position finding function in Hive above, it is obvious that the locate() function is appropriate, but it needs to be converted appropriately.

The details are as follows:

select msg
      ,locate('[',msg,1) as s_1
      ,locate(']',msg,1) as e_1
      ,locate('[',msg,locate('[',msg,1)+1) s_2
      ,locate(']',msg,locate(']',msg,1)+1) e_2
      ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3
      ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3
from log

Note the boundary problem: it is generally closed on the left and open on the right. Therefore, when calculating the second and third indexes, we add 1 during nested calculation. The calculation results are as follows:

xxxxxabc[867]xxx[-]xxxx[5309]xxxxx	9	13	17	19	24	29
xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx	11	20	28	30	34	38
call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx	6	19	23	38	42	51
film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx	6	17	21	26	36	49
Time taken: 0.677 seconds, Fetched: 4 row(s)

For the above results, we use the substr() function to cut the string according to the obtained index position to obtain the corresponding value.

select substr(msg,s_1+1,e_1-s_1-1) as fst_val
      ,substr(msg,s_2+1,e_2-s_2-1) as snd_val
      ,substr(msg,s_3+1,e_3-s_3-1) as thd_val
from(
    select msg
          ,locate('[',msg,1) as s_1
          ,locate(']',msg,1) as e_1
          ,locate('[',msg,locate('[',msg,1)+1) s_2
          ,locate(']',msg,locate(']',msg,1)+1) e_2
          ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3
          ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3
    from log
) t

It should be noted here that when extracting the value, the actual position is the square bracket "[" index value + 1. Intercept the index value at the "]" position, subtract the index value at the "[" position, and then subtract 1. The results are as follows:
 

867	        -	        5309
11271978	4	        Joe
F_GET_ROWS()	ROSEWOOD...SIR	44400002
non_marked	unit	        withabanana?

Code optimization and adjustment are as follows:

select substr(msg,
        locate('[',msg,1)+1,
        locate(']',msg,1)-locate('[',msg,1)-1) as fst_val
      ,substr(msg,
        locate('[',msg,
        locate('[',msg,1)+1)+1,
        locate(']',msg,locate(']',msg,1)+1)-
        locate('[',msg,locate('[',msg,1)+1)-1) as snd_val
      ,substr(msg,
        locate('[',msg,
        locate('[',msg,locate('[',msg,1)+1)+1)+1,
        locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1)-
        locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1)-1) as thd_val
from log

2 Summary

This paper analyzes the problem of extracting string elements from fixed positions through HQL language. Its solution mainly adopts locate() function and substr() function. This paper analyzes the string position function in Hive in detail. Through this article, you can gain the following:

  • (1) Methods and skills of how to extract string elements from fixed positions through HQL
  • (2) Master the use method of string position acquisition function in Hive and use it in actual business

Keywords: Big Data hive SQL

Added by colbyg on Wed, 22 Dec 2021 20:23:21 +0200