preface
The core development language of impala is SQL statement. Impala has shell command line window, JDBC and other methods to receive SQL statement execution. For complex type analysis, you can use C + + or Java to write UDF functions.
Impala's SQL syntax is highly integrated with Apache Hive's HQL syntax. Impala supports Hive's supported data types and some Hive's built-in functions.
Note:
-
Impala is similar to Hive. Their focus is on query, so it is better not to use this tool for Update operations such as Update and Delete. For data deletion operations, you can use Drop Table and Alter Table Drop Partition. You can try to use Insert overwrite for Update;
-
The usual way to use Impala is to store data files in the Hdfs file system, and query and manage data files on Hdfs with the help of Impala's table definition;
-
Impala is mostly the same as Hive. For example, impala also supports internal and external tables and partitions. You can refer to Hive for reference.
1.Impala-Shell
Impala shell commands are divided into external commands and internal commands.
Impala shell external commands refer to command parameters that can be executed without entering the Impala shell interactive command line. Impala shell can be executed with many options, which can be set when starting impala shell to modify the command execution environment.
Impala shell – h you can view the help manual.
Several common options are as follows:
option | meaning |
---|---|
impala-shell –r | Refreshing impala metadata has the same effect as executing REFRESH statement after establishing connection (when metadata changes) |
Impala shell – f file path | Executes the specified SQL query file |
impala-shell –i | Specify the host to which the impalad daemon is running; The default port is 21000, which can be connected to any host running impalad in the cluster |
impala-shell –o | Save the execution results to a file |
Use the Impala shell – r command as follows:
[root@node03 ~]$ impala-shell -r Starting Impala Shell without Kerberos authentication Connected to node03:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) Invalidating Metadata *********************************************************************************** Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. (Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017) The '-B' command line flag turns off pretty-printing for query results. Use this flag to remove formatting from results you want to save for later, or to benchmark Impala. *********************************************************************************** Query: invalidate metadata Fetched 0 row(s) in 6.11s [node03:21000] > exit; Goodbye root
You can see that after executing the Impala shell – r command, you enter the Impala command line while refreshing the metadata.
Then use impala shell – f to create the SQL file vim impala_data/test.sql, the content is:
show databases;
Then proceed as follows:
[root@node03 ~]$ impala-shell -f impala_data/test.sql Starting Impala Shell without Kerberos authentication Connected to node03:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | default | Default Hive database | | homework | | | mydb | | | mydb2 | | | sale | | | tuning | | +------------------+----------------------------------------------+ Fetched 7 row(s) in 0.02s
You can see that all databases have been queried;
Among them_ impala_builtins is Impala's own system database.
You can view the following:
[node03:21000] > show functions; Query: show functions Fetched 0 row(s) in 0.02s [node03:21000] > use _impala_builtins; Query: use _impala_builtins [node03:21000] > show functions; Query: show functions +--------------+-------------------------------------------------+-------------+---------------+ | return type | signature | binary type | is persistent | +--------------+-------------------------------------------------+-------------+---------------+ | BIGINT | abs(BIGINT) | BUILTIN | true | | DECIMAL(*,*) | abs(DECIMAL(*,*)) | BUILTIN | true | | DOUBLE | abs(DOUBLE) | BUILTIN | true | ... | INT | zeroifnull(INT) | BUILTIN | true | | SMALLINT | zeroifnull(SMALLINT) | BUILTIN | true | | TINYINT | zeroifnull(TINYINT) | BUILTIN | true | +--------------+-------------------------------------------------+-------------+---------------+ Fetched 592 row(s) in 0.16s
As you can see, in the system database_ Impala_ All functions of Impala can be viewed in builtins, but not in other databases.
Then use impala shell – i, as follows:
[root@node03 ~]$ impala-shell -i node02 Starting Impala Shell without Kerberos authentication Connected to node02:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) *********************************************************************************** Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. (Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017) You can change the Impala daemon that you're connected to by using the CONNECT command.To see how Impala will plan to run your query without actually executing it, use the EXPLAIN command. You can change the level of detail in the EXPLAIN output by setting the EXPLAIN_LEVEL query option. *********************************************************************************** [node02:21000] > exit; Goodbye root
You can see that impala server connected to other nodes;
If this option is not specified, the native impala server is used by default.
Internal commands refer to the syntax that can be executed after entering the Impala shell command line.
Common internal commands are as follows:
command | meaning |
---|---|
help | View common commands and help documents |
connect hostname | Connect to the specified machine impalad for execution |
refresh dbname.tablename | Incremental refresh, which refreshes the metadata of a table, is mainly used to refresh the data changes in the data table in Hive |
invalidate metadata | Full refresh, which consumes a lot of performance, is mainly used to refresh when creating a new database or database table in Hive |
quit/exit | Exit from impala shell |
explain | Used to view the execution plan of SQL statements; Explain you can set explain_ Level, used to set the level of output content, including 0, 1, 2, 3 and other values, of which level 3 is the highest, which can print the most complete information |
profile | After executing the SQL statement, you can print out more detailed execution steps, which are mainly used for viewing query results, cluster optimization, etc |
Use the following:
[node03:21000] > help; Documented commands (type help <topic>): ======================================== compute exit history quit shell unset version connect explain insert select show use with describe help profile set tip values Undocumented commands: ====================== alter create desc drop load summary [node03:21000] > help compute; Executes a COMPUTE STATS query. Impala shell cannot get child query handle so it cannot query live progress for COMPUTE STATS query. Disable live progress/summary callback for COMPUTE STATS query. [node03:21000] > help alter; *** No help on alter [node03:21000] > refresh mydb.t1; Query: refresh mydb.t1 Fetched 0 row(s) in 0.98s [node03:21000] > connect node01; Connected to node01:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) [node01:21000] > select * from t1; Query: select * from t1 +--------------------+------+-----+--------+ | id | name | age | gender | +--------------------+------+-----+--------+ | 392456197008193000 | Zhang San | 20 | 0 | | 267456198006210000 | Li Si | 25 | 1 | | 892456199007203000 | Wang Wu | 24 | 1 | | 492456198712198000 | Zhao Liu | 26 | 2 | | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | +--------------------+------+-----+--------+ Fetched 6 row(s) in 6.99s [node01:21000] > explain select * from t1; Query: explain select * from t1 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=32.00MB VCores=1 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.t1 | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN HDFS [default.t1] | | partitions=1/1 files=1 size=186B | +------------------------------------------------------------------------------------+ Fetched 8 row(s) in 0.06s [node01:21000] > set explain_level = 3; EXPLAIN_LEVEL set to 3 [node01:21000] > explain select * from t1; Query: explain select * from t1 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=32.00MB VCores=1 | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.t1 | | | | F01:PLAN FRAGMENT [UNPARTITIONED] | | 01:EXCHANGE [UNPARTITIONED] | | hosts=1 per-host-mem=unavailable | | tuple-ids=0 row-size=38B cardinality=unavailable | | | | F00:PLAN FRAGMENT [RANDOM] | | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] | | 00:SCAN HDFS [default.t1, RANDOM] | | partitions=1/1 files=1 size=186B | | table stats: unavailable | | column stats: unavailable | | hosts=1 per-host-mem=32.00MB | | tuple-ids=0 row-size=38B cardinality=unavailable | +------------------------------------------------------------------------------------+ Fetched 17 row(s) in 0.03s [node01:21000] > select * from t1; Query: select * from t1 +--------------------+------+-----+--------+ | id | name | age | gender | +--------------------+------+-----+--------+ | 392456197008193000 | Zhang San | 20 | 0 | | 267456198006210000 | Li Si | 25 | 1 | | 892456199007203000 | Wang Wu | 24 | 1 | | 492456198712198000 | Zhao Liu | 26 | 2 | | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | +--------------------+------+-----+--------+ Fetched 6 row(s) in 0.43s [node01:21000] > profile; Query Runtime Profile: Query (id=944cd4a45137b96a:b8ab935b8d60adab): Summary: Session ID: e44a9c110e12d2cb:24ff32547ca3fc99 Session Type: BEESWAX Start Time: 2021-10-11 11:20:45.844713000 End Time: 2021-10-11 11:20:46.276056000 Query Type: QUERY Query State: FINISHED Query Status: OK Impala Version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) User: root Connected User: root Delegated User: Network Address: ::ffff:192.168.31.157:59392 Default Db: default Sql Statement: select * from t1 Coordinator: node01:22000 Query Options (non default): EXPLAIN_LEVEL=3 Plan: ---------------- Estimated Per-Host Requirements: Memory=32.00MB VCores=1 WARNING: The following tables are missing relevant table and/or column statistics. default.t1 01:EXCHANGE [UNPARTITIONED] | hosts=1 per-host-mem=unavailable | tuple-ids=0 row-size=38B cardinality=unavailable | 00:SCAN HDFS [default.t1, RANDOM] partitions=1/1 files=1 size=186B table stats: unavailable column stats: unavailable hosts=1 per-host-mem=32.00MB tuple-ids=0 row-size=38B cardinality=unavailable ---------------- Estimated Per-Host Mem: 33554432 Estimated Per-Host VCores: 1 Tables Missing Stats: default.t1 Request Pool: default-pool ExecSummary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------- 01:EXCHANGE 1 8.275us 8.275us 6 -1 0 -1.00 B UNPARTITIONED 00:SCAN HDFS 1 158.588ms 158.588ms 6 -1 61.00 KB 32.00 MB default.t1 Planner Timeline: 6.764ms - Analysis finished: 2.300ms (2.300ms) - Equivalence classes computed: 2.466ms (166.581us) - Single node plan created: 3.901ms (1.434ms) - Runtime filters computed: 3.917ms (16.218us) - Distributed plan created: 4.972ms (1.054ms) - Planning finished: 6.764ms (1.792ms) Query Timeline: 432.815ms - Start execution: 44.969us (44.969us) - Planning finished: 14.847ms (14.803ms) - Ready to start 1 remote fragments: 15.241ms (393.172us) - All 1 remote fragments started: 20.207ms (4.966ms) - Child queries finished: 20.823ms (615.552us) - Rows available: 344.381ms (323.558ms) - First row fetched: 428.666ms (84.284ms) - Unregister query: 431.338ms (2.672ms) ImpalaServer: - ClientFetchWaitTimer: 86.786ms - RowMaterializationTimer: 40.424us Execution Profile 944cd4a45137b96a:b8ab935b8d60adab:(Total: 329.230ms, non-child: 0.000ns, % non-child: 0.00%) Number of filters: 0 Filter routing table: ID Src. Node Tgt. Node Targets Type Partition filter ----------------------------------------------------------- Fragment start latencies: Count: 1, 25th %-ile: 3ms, 50th %-ile: 3ms, 75th %-ile: 3ms, 90th %-ile: 3ms, 95th %-ile: 3ms, 99.9th %-ile: 3ms Per Node Peak Memory Usage: node03:22000(71.66 KB) node01:22000(0) - FiltersReceived: 0 (0) - FinalizationTimer: 0.000ns Coordinator Fragment F01:(Total: 323.661ms, non-child: 107.505us, % non-child: 0.03%) MemoryUsage(500.000ms): 8.00 KB - AverageThreadTokens: 0.00 - BloomFilterBytes: 0 - PeakMemoryUsage: 12.05 KB (12336) - PerHostPeakMemUsage: 0 - PrepareTime: 29.640us - RowsProduced: 6 (6) - TotalCpuTime: 90.045ms - TotalNetworkReceiveTime: 323.548ms - TotalNetworkSendTime: 0.000ns - TotalStorageWaitTime: 0.000ns BlockMgr: - BlockWritesOutstanding: 0 (0) - BlocksCreated: 0 (0) - BlocksRecycled: 0 (0) - BufferedPins: 0 (0) - BytesWritten: 0 - MaxBlockSize: 8.00 MB (8388608) - MemoryLimit: 1.14 GB (1221100928) - PeakMemoryUsage: 0 - TotalBufferWaitTime: 0.000ns - TotalEncryptionTime: 0.000ns - TotalIntegrityCheckTime: 0.000ns - TotalReadBlockTime: 0.000ns EXCHANGE_NODE (id=1):(Total: 323.554ms, non-child: 8.275us, % non-child: 0.00%) BytesReceived(500.000ms): 0 - BytesReceived: 264.00 B (264) - ConvertRowBatchTime: 1.697us - DeserializeRowBatchTimer: 7.709us - FirstBatchArrivalWaitTime: 323.545ms - PeakMemoryUsage: 0 - RowsReturned: 6 (6) - RowsReturnedRate: 18.00 /sec - SendersBlockedTimer: 0.000ns - SendersBlockedTotalTimer(*): 0.000ns Averaged Fragment F00:(Total: 161.640ms, non-child: 0.000ns, % non-child: 0.00%) split sizes: min: 186.00 B, max: 186.00 B, avg: 186.00 B, stddev: 0 completion times: min:326.012ms max:326.012ms mean: 326.012ms stddev:0.000ns execution rates: min:570.00 B/sec max:570.00 B/sec mean:570.00 B/sec stddev:0.53 B/sec num instances: 1 - AverageThreadTokens: 1.00 - BloomFilterBytes: 0 - PeakMemoryUsage: 71.66 KB (73384) - PerHostPeakMemUsage: 71.66 KB (73384) - PrepareTime: 132.318ms - RowsProduced: 6 (6) - TotalCpuTime: 325.383ms - TotalNetworkReceiveTime: 0.000ns - TotalNetworkSendTime: 684.107us - TotalStorageWaitTime: 24.116ms BlockMgr: - BlockWritesOutstanding: 0 (0) - BlocksCreated: 0 (0) - BlocksRecycled: 0 (0) - BufferedPins: 0 (0) - BytesWritten: 0 - MaxBlockSize: 8.00 MB (8388608) - MemoryLimit: 1.14 GB (1221100928) - PeakMemoryUsage: 0 - TotalBufferWaitTime: 0.000ns - TotalEncryptionTime: 0.000ns - TotalIntegrityCheckTime: 0.000ns - TotalReadBlockTime: 0.000ns CodeGen:(Total: 290.903ms, non-child: 290.903ms, % non-child: 100.00%) - CodegenTime: 4.578ms - CompileTime: 13.251ms - LoadTime: 0.000ns - ModuleBitcodeSize: 1.87 MB (1959112) - OptimizationTime: 149.882ms - PrepareTime: 123.054ms DataStreamSender (dst_id=1):(Total: 165.873us, non-child: 165.873us, % non-child: 100.00%) - BytesSent: 264.00 B (264) - NetworkThroughput(*): 3.65 MB/sec - OverallThroughput: 1.52 MB/sec - PeakMemoryUsage: 2.66 KB (2728) - RowsReturned: 6 (6) - SerializeBatchTime: 40.210us - TransmitDataRPCTime: 68.991us - UncompressedRowBatchSize: 460.00 B (460) HDFS_SCAN_NODE (id=0):(Total: 158.588ms, non-child: 158.588ms, % non-child: 100.00%) - AverageHdfsReadThreadConcurrency: 0.00 - AverageScannerThreadConcurrency: 0.00 - BytesRead: 186.00 B (186) - BytesReadDataNodeCache: 0 - BytesReadLocal: 186.00 B (186) - BytesReadRemoteUnexpected: 0 - BytesReadShortCircuit: 186.00 B (186) - DecompressionTime: 0.000ns - MaxCompressedTextFileLength: 0 - NumDisksAccessed: 1 (1) - NumScannerThreadsStarted: 1 (1) - PeakMemoryUsage: 61.00 KB (62464) - PerReadThreadRawHdfsThroughput: 3.53 MB/sec - RemoteScanRanges: 0 (0) - RowsRead: 6 (6) - RowsReturned: 6 (6) - RowsReturnedRate: 37.00 /sec - ScanRangesComplete: 1 (1) - ScannerThreadsInvoluntaryContextSwitches: 0 (0) - ScannerThreadsTotalWallClockTime: 24.753ms - DelimiterParseTime: 51.707us - MaterializeTupleTime(*): 903.000ns - ScannerThreadsSysTime: 351.000us - ScannerThreadsUserTime: 0.000ns - ScannerThreadsVoluntaryContextSwitches: 3 (3) - TotalRawHdfsReadTime(*): 50.213us - TotalReadThroughput: 0.00 /sec Fragment F00: Instance 944cd4a45137b96a:b8ab935b8d60adad (host=node03:22000):(Total: 161.640ms, non-child: 0.000ns, % non-child: 0.00%) Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/186.00 B MemoryUsage(500.000ms): 0 ThreadUsage(500.000ms): 1 - AverageThreadTokens: 1.00 - BloomFilterBytes: 0 - PeakMemoryUsage: 71.66 KB (73384) - PerHostPeakMemUsage: 71.66 KB (73384) - PrepareTime: 132.318ms - RowsProduced: 6 (6) - TotalCpuTime: 325.383ms - TotalNetworkReceiveTime: 0.000ns - TotalNetworkSendTime: 684.107us - TotalStorageWaitTime: 24.116ms BlockMgr: - BlockWritesOutstanding: 0 (0) - BlocksCreated: 0 (0) - BlocksRecycled: 0 (0) - BufferedPins: 0 (0) - BytesWritten: 0 - MaxBlockSize: 8.00 MB (8388608) - MemoryLimit: 1.14 GB (1221100928) - PeakMemoryUsage: 0 - TotalBufferWaitTime: 0.000ns - TotalEncryptionTime: 0.000ns - TotalIntegrityCheckTime: 0.000ns - TotalReadBlockTime: 0.000ns CodeGen:(Total: 290.903ms, non-child: 290.903ms, % non-child: 100.00%) - CodegenTime: 4.578ms - CompileTime: 13.251ms - LoadTime: 0.000ns - ModuleBitcodeSize: 1.87 MB (1959112) - OptimizationTime: 149.882ms - PrepareTime: 123.054ms DataStreamSender (dst_id=1):(Total: 165.873us, non-child: 165.873us, % non-child: 100.00%) - BytesSent: 264.00 B (264) - NetworkThroughput(*): 3.65 MB/sec - OverallThroughput: 1.52 MB/sec - PeakMemoryUsage: 2.66 KB (2728) - RowsReturned: 6 (6) - SerializeBatchTime: 40.210us - TransmitDataRPCTime: 68.991us - UncompressedRowBatchSize: 460.00 B (460) HDFS_SCAN_NODE (id=0):(Total: 158.588ms, non-child: 158.588ms, % non-child: 100.00%) ExecOption: Expr Evaluation Codegen Disabled, Codegen enabled: 1 out of 1 Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/186.00 B Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% 2:0% 3:0% 4:0% 5:0% File Formats: TEXT/NONE:1 BytesRead(500.000ms): 0 - AverageHdfsReadThreadConcurrency: 0.00 - AverageScannerThreadConcurrency: 0.00 - BytesRead: 186.00 B (186) - BytesReadDataNodeCache: 0 - BytesReadLocal: 186.00 B (186) - BytesReadRemoteUnexpected: 0 - BytesReadShortCircuit: 186.00 B (186) - DecompressionTime: 0.000ns - MaxCompressedTextFileLength: 0 - NumDisksAccessed: 1 (1) - NumScannerThreadsStarted: 1 (1) - PeakMemoryUsage: 61.00 KB (62464) - PerReadThreadRawHdfsThroughput: 3.53 MB/sec - RemoteScanRanges: 0 (0) - RowsRead: 6 (6) - RowsReturned: 6 (6) - RowsReturnedRate: 37.00 /sec - ScanRangesComplete: 1 (1) - ScannerThreadsInvoluntaryContextSwitches: 0 (0) - ScannerThreadsTotalWallClockTime: 24.753ms - DelimiterParseTime: 51.707us - MaterializeTupleTime(*): 903.000ns - ScannerThreadsSysTime: 351.000us - ScannerThreadsUserTime: 0.000ns - ScannerThreadsVoluntaryContextSwitches: 3 (3) - TotalRawHdfsReadTime(*): 50.213us - TotalReadThroughput: 0.00 /sec
Among them, the common commands output by help command include documented commands (you can use help to view specific usage) and non documented commands (you can't use help to view specific usage);
The refresh command and the invalidate metadata command have the same function and can be used to update metadata. However, the former can specify a table to update (incremental update), while the latter can only update all (full update) and has different performance;
When using explain to view query plans, SCAN HDFS is used to scan tables, partitions is used to specify partitions, and the number and size of files are specified for verification:
[root@node03 ~]$ hdfs dfs -ls /user/impala/t1 Found 1 items -rw-r--r-- 3 root supergroup 186 2021-10-10 21:44 /user/impala/t1/user.csv
As you can see, the results are the same.
At the same time, when the output level of explain is set to 3, there is more information.
Comparison between explain and profile:
Expatin: you can display the execution plan of a task without actually executing the task;
profile: after calling the task execution, you can observe the task of running Impala from the lower level and more detailed level, and tune it.
2.Impala SQL syntax
(1) Database operation
The CREATE DATABASE statement is used to create a new database in Impala:
CREATE DATABASE IF NOT EXISTS database_name; In, if not exists is an optional clause. If this clause is used, the database with the given name will be created only if there is no existing database with the same name.
As follows:
[node03:21000] > show databases; Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | default | Default Hive database | | homework | | | mydb | | | mydb2 | | | sale | | | tuning | | +------------------+----------------------------------------------+ Fetched 7 row(s) in 0.01s [node03:21000] > create database if not exists cltest; Query: create database if not exists cltest Fetched 0 row(s) in 0.79s [node03:21000] > show databases; Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | cltest | | | default | Default Hive database | | homework | | | mydb | | | mydb2 | | | sale | | | tuning | | +------------------+----------------------------------------------+ Fetched 8 row(s) in 0.00s
As you can see, the database is created.
After the database is created, a new database name folder will be created under the data warehouse Path of Hive in HDFS by default, as follows:
[root@node03 ~]$ hdfs dfs -ls /user/hive/warehouse Found 7 items drwxrwxrwx - impala supergroup 0 2021-10-11 13:47 /user/hive/warehouse/cltest.db drwxrwxrwx - root supergroup 0 2021-10-05 20:16 /user/hive/warehouse/homework.db drwxrwxrwx - root supergroup 0 2021-10-10 22:13 /user/hive/warehouse/mydb.db drwxrwxrwx - anonymous supergroup 0 2021-09-28 20:30 /user/hive/warehouse/mydb2.db drwxrwxrwx - root supergroup 0 2021-10-03 13:06 /user/hive/warehouse/sale.db drwxrwxrwx - root supergroup 0 2021-09-28 21:00 /user/hive/warehouse/t2 drwxrwxrwx - root supergroup 0 2021-10-08 23:44 /user/hive/warehouse/tuning.db
The DROP DATABASE statement of Impala is used to delete the database from Impala. It is recommended to delete all tables from Impala before deleting the database;
If cascade is used to delete cascade, you can directly delete the database containing tables and data.
As follows:
[node03:21000] > drop database cltest cascade; Query: drop database cltest cascade [node03:21000] > show databases; Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | default | Default Hive database | | homework | | | mydb | | | mydb2 | | | sale | | | tuning | | +------------------+----------------------------------------------+ Fetched 7 row(s) in 0.01s
(2) Table operation
The CREATE TABLE statement is used to create a new table in the required database in Impala. You need to specify the table name and define its columns and the data type of each column.
Impala supports data types similar to Hive;
Impala's support for complex data types:
Complex types in Text storage format are not supported. Complex types should use parquet format.
The syntax is as follows:
create table IF NOT EXISTS database_name.table_name( column1 data_type, column2 data_type, column3 data_type, ..., columnN data_type );
Examples are as follows:
[node03:21000] > create table if not exists mydb.student( > name string, > age int, > concat int > ); Query: create table if not exists mydb.student( name string, age int, concat int ) Fetched 0 row(s) in 0.31s [node03:21000] > use mydb; Query: use mydb [node03:21000] > show tables; Query: show tables +-----------------+ | name | +-----------------+ | course | | emp | | goodtbl | | line2row | | row2line | | stu | | student | | studscore | | t1 | | t2 | | t3 | | tab1 | | taba | | tabb | | tabc | | tabd | | tabe | | temp2 | | u1 | | u2 | | uaction_orc | | uaction_parquet | | uaction_text | | ulogin | | userpv | | zxz_data | +-----------------+ Fetched 26 row(s) in 0.01s
The default data storage path for table creation is the same as Hive. You can also specify the specific path through location during table creation.
Impala's INSERT statement has two clauses, into and overwrite:
into is used to insert new record data;
Overwrite is used to overwrite existing records.
The syntax of the insert into statement is as follows:
-- Form 1 insert into table_name(column1, column2, column3,...columnN) values(value1,value2, value3,...valueN); -- Form 2 insert into table_name values(value1, value2, value2);
Here, column1, column2,..., columnN are the names of the columns in the table to insert data;
You can also add values without specifying column names, but you need to ensure that the order of values is the same as the order of columns in the table.
Examples are as follows:
[node03:21000] > create table employee( > Id INT, > name STRING, > age INT, > address STRING, > salary BIGINT > ); Query: create table employee( Id INT, name STRING, age INT, address STRING, salary BIGINT ) Fetched 0 row(s) in 0.40s [node03:21000] > insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000); Query: insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000) Inserted 1 row(s) in 2.46s [node03:21000] > insert into employee values (2, 'Khilan', 25, 'Delhi', 15000); Query: insert into employee values (2, 'Khilan', 25, 'Delhi', 15000 ) Inserted 1 row(s) in 0.32s [node03:21000] > Insert into employee values (3, 'kaushik', 23, 'Kota', 30000); Query: insert into employee values (3, 'kaushik', 23, 'Kota', 30000 ) Inserted 1 row(s) in 0.32s [node03:21000] > Insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000); Query: insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000 ) Inserted 1 row(s) in 0.21s [node03:21000] > Insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000); Query: insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000 ) Inserted 1 row(s) in 0.46s [node03:21000] > Insert into employee values (6, 'Komal', 22, 'MP', 32000); Query: insert into employee values (6, 'Komal', 22, 'MP', 32000) Inserted 1 row(s) in 0.33s
overwrite overwrites all records in the table. The overwritten records will be permanently deleted from the table.
Examples are as follows:
insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000);
Impala SELECT statement is used to query data from the database and return data in the form of a table.
As follows:
[node03:21000] > select * from t3; Query: select * from t3 +--------------------+------+-----+--------+ | id | name | age | gender | +--------------------+------+-----+--------+ | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | +--------------------+------+-----+--------+ Fetched 3 row(s) in 2.85s
The describe/desc statement in Impala is used to provide a description of the table. The result contains information about the table, such as column names and their data types. s
Use the following:
[node03:21000] > desc employee; Query: describe employee +---------+--------+---------+ | name | type | comment | +---------+--------+---------+ | id | int | | | name | string | | | age | int | | | address | string | | | salary | bigint | | +---------+--------+---------+ Fetched 5 row(s) in 0.01s [node03:21000] > desc formatted employee; Query: describe formatted employee +------------------------------+------------------------------------------------------------+----------------------+ | name | type | comment | +------------------------------+------------------------------------------------------------+----------------------+ | # col_name | data_type | comment | | | NULL | NULL | | id | int | NULL | | name | string | NULL | | age | int | NULL | | address | string | NULL | | salary | bigint | NULL | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | mydb | NULL | | Owner: | root | NULL | | CreateTime: | Mon Oct 11 13:58:02 CST 2021 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://node01:9000/user/hive/warehouse/mydb.db/employee | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | transient_lastDdlTime | 1633931882 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +------------------------------+------------------------------------------------------------+----------------------+ Fetched 28 row(s) in 0.03s
The Alter table statement in Impala is used to make changes to a given table. You can add, delete, or modify columns in an existing table, or rename them.
It should be noted that tables can be modified in Impala, but it is not recommended to do so in Impala. Instead, it is regarded as a special query tool, and the update operation can be done in Hive.
Impala drop table statement is used to delete existing tables in impala. It will delete the complete table structure from the database and the underlying HDFS file of the internal table.
The syntax is as follows:
drop table database_name.table_name;
be careful:
You must be careful when using this command, because when you delete a table, all the information available in the table will be lost forever.
Impala's Truncate Table statement is used to delete all records from an existing table and preserve the table structure.
The syntax format is as follows:
truncate table_name;
Examples of drop table and truncate are as follows:
[node03:21000] > drop table temp2; Query: drop table temp2 [node03:21000] > show tables; Query: show tables +-----------------+ | name | +-----------------+ | course | | emp | | employee | | goodtbl | | line2row | | row2line | | stu | | student | | studscore | | t1 | | t2 | | t3 | | tab1 | | taba | | tabb | | tabc | | tabd | | tabe | | u1 | | u2 | | uaction_orc | | uaction_parquet | | uaction_text | | ulogin | | userpv | | zxz_data | +-----------------+ Fetched 26 row(s) in 0.01s [node03:21000] > select * from t3; Query: select * from t3 +--------------------+------+-----+--------+ | id | name | age | gender | +--------------------+------+-----+--------+ | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | | 392456197008193000 | Zhang San | 20 | 0 | +--------------------+------+-----+--------+ Fetched 3 row(s) in 2.85s [node03:21000] > truncate t3; Query: truncate t3 Fetched 0 row(s) in 0.22s [node03:21000] > select * from t3; Query: select * from t3 Fetched 0 row(s) in 0.15s
View is a statement of Impala query language with associated name stored in the database. It is a combination of tables in the form of predefined SQL query.
A view can contain all rows of a table or selected rows.
The syntax for creating, modifying, and deleting views is as follows:
-- Create view create view if not exists view_name as select statement; -- Modify view alter view database_name.view_name as select statement; -- Delete view drop view database_name.view_name;
Use the following:
[node03:21000] > create view if not exists emp_view as select * from employee where salary > 20000; Query: create view if not exists emp_view as select * from employee where salary > 20000 Fetched 0 row(s) in 0.14s [node03:21000] > show tables; Query: show tables +-----------------+ | name | +-----------------+ | course | | emp | | emp_view | | employee | | goodtbl | | line2row | | row2line | | stu | | student | | studscore | | t1 | | t2 | | t3 | | tab1 | | taba | | tabb | | tabc | | tabd | | tabe | | u1 | | u2 | | uaction_orc | | uaction_parquet | | uaction_text | | ulogin | | userpv | | zxz_data | +-----------------+ Fetched 27 row(s) in 0.02s [node03:21000] > select * from emp_view; Query: select * from emp_view +----+----------+-----+---------+--------+ | id | name | age | address | salary | +----+----------+-----+---------+--------+ | 3 | kaushik | 23 | Kota | 30000 | | 4 | Chaitali | 25 | Mumbai | 35000 | | 5 | Hardik | 27 | Bhopal | 40000 | | 6 | Komal | 22 | MP | 32000 | +----+----------+-----+---------+--------+ Fetched 4 row(s) in 4.14s [node03:21000] > drop view emp_view; Query: drop view emp_view [node03:21000] > show tables; Query: show tables +-----------------+ | name | +-----------------+ | course | | emp | | employee | | goodtbl | | line2row | | row2line | | stu | | student | | studscore | | t1 | | t2 | | t3 | | tab1 | | taba | | tabb | | tabc | | tabd | | tabe | | u1 | | u2 | | uaction_orc | | uaction_parquet | | uaction_text | | ulogin | | userpv | | zxz_data | +-----------------+ Fetched 26 row(s) in 0.01s
Impala ORDER BY clause is used to sort data in ascending or descending order according to one or more columns.
The syntax is as follows:
select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST];
By default, some databases sort query results in ascending order. You can use the keyword ASC (default) or DESC to sort the data in the table in ascending or descending order respectively.
If NULLS FIRST is used, all null values in the table are arranged in the top row;
If we use NULLS LAST, rows containing null values are arranged last.
The Impala GROUP BY clause is used in conjunction with a SELECT statement to arrange the same data into groups.
The having clause is used to filter the data obtained after grouping and other queries;
It is easily confused with where filtering. The differences are as follows:
The data filtered by where is the original data, which already exists in the table;
having filters query result data.
The limit clause in Impala is used to limit the number of rows in the result set to the required number, that is, the result set of the query does not contain records that exceed the specified limit.
Generally speaking, the rows in the resultset of the select query start from 0. Using the offset clause, you can decide where to start the output.
order by, group by and having are used as follows:
[node03:21000] > select * from employee order by salary desc; Query: select * from employee order by salary desc +----+----------+-----+-----------+--------+ | id | name | age | address | salary | +----+----------+-----+-----------+--------+ | 5 | Hardik | 27 | Bhopal | 40000 | | 4 | Chaitali | 25 | Mumbai | 35000 | | 6 | Komal | 22 | MP | 32000 | | 3 | kaushik | 23 | Kota | 30000 | | 1 | Ramesh | 32 | Ahmedabad | 20000 | | 2 | Khilan | 25 | Delhi | 15000 | +----+----------+-----+-----------+--------+ Fetched 6 row(s) in 0.87s [node03:21000] > select name, sum(salary) from employee group by name; Query: select name, sum(salary) from employee group by name +----------+-------------+ | name | sum(salary) | +----------+-------------+ | Ramesh | 20000 | | Hardik | 40000 | | Komal | 32000 | | Chaitali | 35000 | | kaushik | 30000 | | Khilan | 15000 | +----------+-------------+ Fetched 6 row(s) in 1.28s [node03:21000] > select age, max(salary) from employee group by age having max(salary) > 20000; Query: select age, max(salary) from employee group by age having max(salary) > 20000 +-----+-------------+ | age | max(salary) | +-----+-------------+ | 23 | 30000 | | 25 | 35000 | | 27 | 40000 | | 22 | 32000 | +-----+-------------+ Fetched 4 row(s) in 0.67s [node03:21000] > select * from employee order by salarylimit 2 offset 0; Query: select * from employee order by salarylimit 2 offset 0 ERROR: AnalysisException: Syntax error in line 1: select * from employee order by salarylimit 2 offset 0 ^ Encountered: INTEGER LITERAL Expected: AND, AS, ASC, BETWEEN, CROSS, DESC, DIV, ELSE, END, FOLLOWING, FROM, FULL, GROUP, HAVING, ILIKE, IN, INNER, IREGEXP, IS, JOIN, LEFT, LIKE, LIMIT, LOCATION, NOT, NULLS, OFFSET, ON, OR, ORDER, PRECEDING, RANGE, REGEXP, RIGHT, RLIKE, ROWS, STRAIGHT_JOIN, THEN, UNION, USING, WHEN, WHERE, COMMA, IDENTIFIER CAUSED BY: Exception: Syntax error [node03:21000] > select * from employee order by salary limit 2 offset 0; Query: select * from employee order by salary limit 2 offset 0 +----+--------+-----+-----------+--------+ | id | name | age | address | salary | +----+--------+-----+-----------+--------+ | 2 | Khilan | 25 | Delhi | 15000 | | 1 | Ramesh | 32 | Ahmedabad | 20000 | +----+--------+-----+-----------+--------+ Fetched 2 row(s) in 6.16s [node03:21000] > select * from employee order by salary limit 2 offset 2; Query: select * from employee order by salary limit 2 offset 2 +----+---------+-----+---------+--------+ | id | name | age | address | salary | +----+---------+-----+---------+--------+ | 3 | kaushik | 23 | Kota | 30000 | | 6 | Komal | 22 | MP | 32000 | +----+---------+-----+---------+--------+ Fetched 2 row(s) in 2.53s [node03:21000] > select * from employee order by salary limit 2 offset 4; Query: select * from employee order by salary limit 2 offset 4 +----+----------+-----+---------+--------+ | id | name | age | address | salary | +----+----------+-----+---------+--------+ | 4 | Chaitali | 25 | Mumbai | 35000 | | 5 | Hardik | 27 | Bhopal | 40000 | +----+----------+-----+---------+--------+ Fetched 2 row(s) in 2.45s [node03:21000] > select * from employee order by salary limit 2 offset 6; Query: select * from employee order by salary limit 2 offset 6 Fetched 0 row(s) in 1.47s
You can see that the limit... offset... Statement realizes the paging effect;
It should be noted that using the offset keyword requires that the result data must be sorted.
3. Import data and query Impala via JDBC
Like Hive, there are many ways to import data:
mode | meaning |
---|---|
insert into values | Similar to the data insertion method of RDBMS, the data is manually specified when inserting data |
insert into select | The data inserted into a table comes from the results returned by the subsequent select query statement |
create table as select | The number, type and data of the fields in the table are from the subsequent select query statements |
load data | Import data from HDFS file. This method is not recommended to be used in Impala. You can first load the data into Hive table by load data, and then insert it into Impala table by using the above method |
In actual work, because impala query is relatively fast, impala may be used for database query. Impala query can be carried out through Java code.
Create a Maven project ImpalaJDBC with the following dependencies:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.bigdata.impala</groupId> <artifactId>ImpalaJDBC</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> </properties> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop- common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.9.2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-common</artifactId> <version>2.3.7</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-metastore</artifactId> <version>2.3.7</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-service</artifactId> <version>2.3.7</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>2.3.7</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.3.7</version> </dependency> </dependencies> </project>
Wait to download and load all dependencies.
Create package com.bigdata.impala, and create class ImpalaTest under impala package, as follows:
package com.bigdata.impala; import java.sql.*; /** * @author Corley * @date 2021/10/11 15:41 * @description ImpalaJDBC-com.bigdata.impala */ public class ImpalaTest { public static void main(String[] args) throws ClassNotFoundException, SQLException { // Define the driver and connection url to connect to Impala String driver = "org.apache.hive.jdbc.HiveDriver"; String driverUrl = "jdbc:hive2://node02:21050/mydb;auth=noSasl"; // sql statement of query String querySql = "select * from employee;"; // Load driver Class.forName(driver); // Get connection through DriverManager final Connection connection = DriverManager.getConnection(driverUrl); final PreparedStatement ps = connection.prepareStatement(querySql); // Execute query final ResultSet resultSet = ps.executeQuery(); // Parsing returned results // Gets the number of columns per data final int columnCount = resultSet.getMetaData().getColumnCount(); // Traversal result set while (resultSet.next()) { for (int i = 1; i <= columnCount; i++) { final String string = resultSet.getString(i); System.out.print(string + "\t"); } System.out.println(); } //close resource ps.close(); connection.close(); } }
Output:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/C:/Users/LENOVO/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/LENOVO/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 1 Ramesh 32 Ahmedabad 20000 3 kaushik 23 Kota 30000 4 Chaitali 25 Mumbai 35000 5 Hardik 27 Bhopal 40000 6 Komal 22 MP 32000 2 Khilan 25 Delhi 15000
As you can see, the query results are obtained.
summary
Impala is also an interactive query engine, which can realize efficient query. When choosing between hives, you can choose according to the actual needs.