impala series: Service API--select

Basic grammar

[WITH name AS (select_expression) [, ...] ]
SELECT
  [ALL | DISTINCT]
  [STRAIGHT_JOIN]
  expression [, expression ...]
FROM table_reference [, table_reference ...]
[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT] SEMI | [LEFT | RIGHT] ANTI | CROSS]
  JOIN table_reference
  [ON join_equality_clauses | USING (col1[, col2 ...]] ...
WHERE conditions
GROUP BY { column | expression [, ...] }
HAVING conditions
ORDER BY { column | expression [ASC | DESC] [NULLS FIRST | NULLS LAST] [, ...] }
LIMIT expression [OFFSET expression]
[UNION [ALL] select_statement] ...]

table_reference := { table_name | (subquery) }
  [ TABLESAMPLE SYSTEM(percentage) [REPEATABLE(seed)] ]

joins

Using the join keyword
You can use the ON keyword to link or the USING keyword to connect

SELECT t1.c1, t2.c2 FROM t1 JOIN t2
  ON t1.id = t2.id and t1.type_flag = t2.type_flag
  WHERE t1.c1 > 100;

SELECT t1.c1, t2.c2 FROM t1 JOIN t2
  USING (id, type_flag)
  WHERE t1.c1 > 100;

Use "," separated form
The readability of this method is not very high, and it is easy to delete the connection conditions by mistake when changing the SQL conditions

SELECT t1.c1, t2.c2 FROM t1, t2
  WHERE
  t1.id = t2.id AND t1.type_flag = t2.type_flag
  AND t1.c1 > 100;

Self connection

SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;

External connection and internal connection
The INNER of internal connection can be omitted. When using internal connection, the data of two tables must exist at the same time. If either party has no data, it will be filtered out.

SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;

There are LEFT, RIGHT and FULL connections. If there are no these keywords, the default is LEFT connection, that is, if the table on the RIGHT has no corresponding data, it is null.

SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;

Non equivalent connection
All the previous ON, USING and WHERE are compared equally in the left and right tables. This kind of query is called equivalent connection. impala still supports non equivalent connection, but this kind of query should be used carefully to avoid generating huge result set.

SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;

Semi-joins
SEMI JOIN is rarely used. For the matched rows in the right table, the left half join only returns the data in the left table. No matter how many rows are matched in the right table, only one row in the left table is returned. The right half is just the opposite.

SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;

Anti-joins
ANTI JOIN returns unmatched data. For example, LEFT ANTI JOIN returns unmatched data in the left table and the right table.

order by

This statement will arrange the data in order at each node, and then send the data to the coordinator node for final sorting. If you want the first N items, you can directly add the limit keyword, which will also reduce the amount of resources used for sorting.
grammar

ORDER BY col_ref [, col_ref ...] [ASC | DESC] [NULLS FIRST | NULLS LAST]

col_ref ::= column_name | integer_literal

You can use order by 1 to sort according to the first column and specify the corresponding columns in the form of numbers, but you can't use non numeric columns, such as string, which may lead to successful query results, but the results are not what you want; In addition, select * cannot be expressed in this way.

When there is no limit keyword in the query, impala will query a huge result set, so when there is no memory, it will use the disk space to sort.

Sorting of complex types
If complex types are used directly, the query will fail.

CREATE TABLE games (id BIGINT, score ARRAY <BIGINT>) STORED AS PARQUET;
...use LOAD DATA to load externally created Parquet files into the table...
SELECT id FROM games ORDER BY score DESC;
ERROR: AnalysisException: ORDER BY expression 'score' with complex type 'ARRAY<BIGINT>' is not supported.

Data must be join ed together before sorting can be used.

SELECT id, item FROM games, games.score
  WHERE item > 1000000
ORDER BY id, item desc;

The item field is each data in the array. For more complex types, see the following example.

CREATE TABLE games2 (id BIGINT, play array < struct <game_name: string, score: BIGINT, high_score: boolean> >) STORED AS PARQUET
...use LOAD DATA to load externally created Parquet files into the table...
SELECT id, item.game_name, item.score FROM games2, games2.play
  WHERE item.score > 1000000
ORDER BY id, item.score DESC;

CREATE TABLE games3 (id BIGINT, play ARRAY < MAP <STRING, BIGINT> >) STORED AS PARQUET;
...use LOAD DATA to load externally created Parquet files into the table...
SELECT id, info.key AS k, info.value AS v from games3, games3.play AS plays, games3.play.item AS info
  WHERE info.KEY = 'score' AND info.VALUE > 1000000
ORDER BY id, info.value desc;

If the data volume of impala sorting, aggregation and other operations is too large, the data will be written to the disk. By default, it will be written to the / TMP / impala scratch directory. You can also use ‑ scratch when starting the impalad process_ Dirs = "path_to_directory" parameter.

NULLS FIRST or NULLS LAST can be used to control the position of null values in sorting.

[localhost:21000] > create table numbers (x int);
[localhost:21000] > insert into numbers values (1), (null), (2), (null), (3);
[localhost:21000] > select x from numbers order by x nulls first;
+------+
| x    |
+------+
| NULL |
| NULL |
| 1    |
| 2    |
| 3    |
+------+
[localhost:21000] > select x from numbers order by x desc nulls first;
+------+
| x    |
+------+
| NULL |
| NULL |
| 3    |
| 2    |
| 1    |
+------+
[localhost:21000] > select x from numbers order by x nulls last;
+------+
| x    |
+------+
| 1    |
| 2    |
| 3    |
| NULL |
| NULL |
+------+
[localhost:21000] > select x from numbers order by x desc nulls last;
+------+
| x    |
+------+
| 3    |
| 2    |
| 1    |
| NULL |
| NULL |
+------+

group by

group by can be used in conjunction with aggregate functions. You can also use the having keyword to filter the value of aggregate functions.
Query the five products with the lowest sales volume.

select
  ss_item_sk as Item,
  count(ss_item_sk) as Times_Purchased,
  sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
  group by ss_item_sk
  having count(ss_item_sk)>= 100
  order by sum(ss_quantity)
  limit 5;
+-------+-----------------+--------------------------+
| item  | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 13943 | 105             | 4087                     |
| 2992  | 101             | 4176                     |
| 4773  | 107             | 4204                     |
| 14350 | 103             | 4260                     |
| 11956 | 102             | 4275                     |
+-------+-----------------+--------------------------+

limit

SELECT x FROM t1 LIMIT 1e6;                        -- Top 1 million
SELECT x FROM t1 LIMIT length('hello world');      -- Top 11
SELECT x FROM t1 LIMIT 2+2;                        -- Top 4
SELECT x FROM t1 LIMIT cast(truncate(9.9) AS INT); -- Top 9

Match with offset keyword to achieve paging effect.

[localhost:21000] > select x from numbers order by x limit 5 offset 0;
+----+
| x  |
+----+
| 1  |
| 2  |
| 3  |
| 4  |
| 5  |
+----+
[localhost:21000] > select x from numbers order by x limit 5 offset 5;
+----+
| x  |
+----+
| 6  |
| 7  |
| 8  |
| 9  |
| 10 |
+----+

union

grammar

query_1 UNION [DISTINCT | ALL] query_2

Union all will return all the data spliced by the two tables. Union will de duplicate the data. Therefore, when it can be determined that there is no duplicate data in the two tables, it is recommended to use union all, which will use more resources.
for example

[localhost:21000] > select x from few_ints order by x;
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 4 row(s) in 0.41s
[localhost:21000] > select x from few_ints union all select x from few_ints;
+---+
| x |
+---+
| 1 |
| 1 |
| 2 |
| 3 |
| 1 |
| 1 |
| 2 |
| 3 |
+---+
Returned 8 row(s) in 0.42s
[localhost:21000] > select * from (select x from few_ints union all select x from few_ints) as t1 order by x;
+---+
| x |
+---+
| 1 |
| 1 |
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Returned 8 row(s) in 0.53s
[localhost:21000] > select x from few_ints union all select 10;
+----+
| x  |
+----+
| 10 |
| 1  |
| 1  |
| 2  |
| 3  |
+----+
Returned 5 row(s) in 0.38s

[localhost:21000] > select x from few_ints union select x+1 from few_ints;
+---+
| x |
+---+
| 3 |
| 4 |
| 1 |
| 2 |
+---+
Returned 4 row(s) in 0.51s
[localhost:21000] > select x from few_ints union select 10;
+----+
| x  |
+----+
| 2  |
| 10 |
| 1  |
| 3  |
+----+
Returned 4 row(s) in 0.49s
[localhost:21000] > select * from (select x from few_ints union select x from few_ints) as t1 order by x;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.53s

subqueries

Subqueries make SQL statements very powerful, and many very complex filters can be used.

-- Directly judge whether the conditions are correct
SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);
-- use EXISTS Keyword to determine whether it is true
SELECT count(x) FROM t1 WHERE EXISTS(SELECT 1 FROM t2 WHERE t1.x = t2.y * 10);

-- Used in from After and join after
SELECT avg(t1.x), max(t2.y) FROM
  (SELECT id, cast(a AS DECIMAL(10,5)) AS x FROM raw_data WHERE a BETWEEN 0 AND 100) AS t1
  JOIN
  (SELECT id, length(s) AS y FROM raw_data WHERE s LIKE 'A%') AS t2;
  USING (id);

tablesample

grammar

TABLESAMPLE SYSTEM(percentage) [REPEATABLE(seed)]

This keyword is used to sample data in the database. You can specify a number to finally represent the approximate percentage of sampling. For example, if the percentage is specified as 50, 50% of the data will be extracted from the data.

with

with is used to simplify sentence query, similar to a temporary view, so that you can express it more concisely in actual sentences.

with t1 as (select 1), t2 as (select 2) insert into tab select * from t1 union all select * from t2;

with t1 as (select 1) (with t2 as (select 2) select * from t2) union all select * from t1;

DISTINCT

distinct means to return different data and de duplicate the result set.

-- De duplication of a single field
SELECT DISTINCT c_birth_country FROM customer;

-- De duplicate multiple fields
SELECT DISTINCT c_salutation, c_last_name FROM customer;

-- Get the number of non duplicate birthplaces
SELECT COUNT(DISTINCT c_birth_country) FROM customer;

-- Get the number of non duplicate fields
SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;

Keywords: Big Data impala

Added by CrashRoX on Sat, 19 Feb 2022 00:44:52 +0200