catalogue
0 problem description
SQL extracts string elements from a fixed location. You have a string that contains a continuous piece of log data. You want to parse the string and extract some information from it. However, the information you need does not exist in the fixed position of the string. Therefore, you must use some characters near the target information to locate and extract the required content. For example, consider the following string.
xxxxxabc[867]xxx[-]xxxx[5309]xxxxx xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx call:[F_GET_ROWS( )]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx
You want to extract the values in square brackets and return the result set shown below.
FIRST_VAL SECOND_VAL LAST_VAL --------------- --------------- --------------- 867 - 5309 11271978 4 Joe F_GET_ROWS( ) ROSEWOOD...SIR 44400002 non_marked unit withabanana?
1 problem solving
Main idea: analyze the string and grasp the main characteristics.
Although we do not know the exact position of the characters we need, we are sure that they are contained in the square brackets "[]" and know that there are three sets of such values. If we can find the position of square brackets and use the built-in function SUBSTR to extract the required value from the string, this problem can be solved. Therefore, the key to this problem is to find the position of characteristic square brackets.
(1) Data preparation
create table log as select 'xxxxxabc[867]xxx[-]xxxx[5309]xxxxx' msg union all select 'xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx' msg union all select 'call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx' msg union all select 'film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx' msg
(2) Data analysis
There are several functions to get string position in Hive:
- instr(String str, String substr) function
return str For the first time in substr Index of instr Function returns a string str Neutron string substr The position of the first occurrence, where the position of the first character is 1,If str Not included substr Return 0.
eg:
0: jdbc:hive2://10.9.4.117:10000> select instr("abcde",'b'); +------+--+ | _c0 | +------+--+ | 2 | +------+--+
However, it should be noted that in hive, this function can only obtain the location of the first occurrence of the string, that is, it can only pass two parameters, and cannot obtain the location of the second and third occurrence. In oracle database, this function can obtain the location of the second and third occurrence
- Collection lookup function: find_in_set
Syntax: find_in_set(string str, string strList)
Return value: int
Description: returns the first occurrence of STR in strlist. Strlist is a comma separated string. If the str character is not found, 0 is returned
hive> select find_in_set('de','ef,ab,de') ; 3 hive> select find_in_set('at','ef,ab,de') ; 0
Similarly, this function can only get the position where the string first appears, and the string is a collection of strings separated by commas.
- String lookup function: locate
Syntax: locate(string substr, string str, int pos)
Return value: int
Description: returns the string substr. In str, find the position of the first occurrence after pos
Examples are as follows:
hive> select locate('a','abcda',2) ; OK 5 Time taken: 0.14 seconds, Fetched: 1 row(s) hive> select locate('a','abcdabaaaa',2) ; OK 5 Time taken: 0.182 seconds, Fetched: 1 row(s) hive> select locate('a','abcdabaaaa',3) ; OK 5 Time taken: 0.167 seconds, Fetched: 1 row(s)
By comparing the string position finding function in Hive above, it is obvious that the locate() function is appropriate, but it needs to be converted appropriately.
The details are as follows:
select msg ,locate('[',msg,1) as s_1 ,locate(']',msg,1) as e_1 ,locate('[',msg,locate('[',msg,1)+1) s_2 ,locate(']',msg,locate(']',msg,1)+1) e_2 ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3 ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3 from log
Note the boundary problem: it is generally closed on the left and open on the right. Therefore, when calculating the second and third indexes, we add 1 during nested calculation. The calculation results are as follows:
xxxxxabc[867]xxx[-]xxxx[5309]xxxxx 9 13 17 19 24 29 xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx 11 20 28 30 34 38 call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx 6 19 23 38 42 51 film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx 6 17 21 26 36 49 Time taken: 0.677 seconds, Fetched: 4 row(s)
For the above results, we use the substr() function to cut the string according to the obtained index position to obtain the corresponding value.
select substr(msg,s_1+1,e_1-s_1-1) as fst_val ,substr(msg,s_2+1,e_2-s_2-1) as snd_val ,substr(msg,s_3+1,e_3-s_3-1) as thd_val from( select msg ,locate('[',msg,1) as s_1 ,locate(']',msg,1) as e_1 ,locate('[',msg,locate('[',msg,1)+1) s_2 ,locate(']',msg,locate(']',msg,1)+1) e_2 ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3 ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3 from log ) t
It should be noted here that when extracting the value, the actual position is the square bracket "[" index value + 1. Intercept the index value at the "]" position, subtract the index value at the "[" position, and then subtract 1. The results are as follows:
867 - 5309 11271978 4 Joe F_GET_ROWS() ROSEWOOD...SIR 44400002 non_marked unit withabanana?
Code optimization and adjustment are as follows:
select substr(msg, locate('[',msg,1)+1, locate(']',msg,1)-locate('[',msg,1)-1) as fst_val ,substr(msg, locate('[',msg, locate('[',msg,1)+1)+1, locate(']',msg,locate(']',msg,1)+1)- locate('[',msg,locate('[',msg,1)+1)-1) as snd_val ,substr(msg, locate('[',msg, locate('[',msg,locate('[',msg,1)+1)+1)+1, locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1)- locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1)-1) as thd_val from log
2 Summary
This paper analyzes the problem of extracting string elements from fixed positions through HQL language. Its solution mainly adopts locate() function and substr() function. This paper analyzes the string position function in Hive in detail. Through this article, you can gain the following:
- (1) Methods and skills of how to extract string elements from fixed positions through HQL
- (2) Master the use method of string position acquisition function in Hive and use it in actual business