Scenario:
The project needs function optimization. It needs to compare the same data. Which is more efficient to query from hive or es. Therefore, we need to synchronize all the data of an index in es to hdfs, and query hdfs data through hive to compare their efficiency.
Step 1: preliminary preparation + hive Download
Since hive needs to query the data in hdfs, it is necessary to build the hadoop environment first. The hadoop building process I refer to is as follows:
hadoop cluster building tutorial (detailed)_ fanxin_i blog - CSDN blog_ hadoop cluster construction
To build hive, if you are connected to a local database, you also need to build mysql. The tutorial I refer to for the MySQL setup process is as follows:
linux Installation mysql tutorial - Sweet Potato - blog Garden
Then we download and build hive. Before downloading, we must confirm the corresponding versions of hadoop and hive. The hadoop version I downloaded is 2.7.7, and the corresponding version of hive is 2.3.9. Hive download address is as follows:
http://mirror.bit.edu.cn/apache/hive/
For convenience, I rename the extracted folder to hive to facilitate subsequent operations
Step 2: hive setup
Since I deploy under linux, I need to set the environment variable in advance, path: / etc/profile, and add the following code:
export HIVE_HOME=fill in hive Path of export PATH=$PATH:$HIVE_HOME/bin
Then execute the command: source /etc/profile
Some people modify the environment variable under vim ~/.bashrc. The difference between the two can be determined according to the actual situation
Enter the conf folder under the hive folder, remove all files with. template suffix, and then hive-default.xml. After removing the suffix, you need to modify the name to hive-site.xml .
Edit hive-site.xml. Now hive-site.xml is a template file before removing the suffix, so there are many configuration items in it. I only use the following items:
<property> <name>hive.exec.scratchdir</name> <value>hdfs Next path</value> </property> <property> <name>hive.querylog.location</name> <value>hdfs Next path</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://Database address: 3306 / database name? createDatabaseIfNotExist=true& characterEncoding=UTF-8</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> com.mysql.cj.jdbc.Driver </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value>//Database user name </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> //Database password </property>
If you want to configure other configuration items, you need to replace ${system:java.io.tmpdir} with the real directory, otherwise an error will be reported.
Modify hiv-env.sh and add the following code:
export JAVA_HOME=jdk route export HADOOP_HOME=hadoop route export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HIVE_HOME=hive route export HIVE_CONF_DIR=$HIVE_HOME/conf export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib
Because you need to connect to the database, the last step is to download and put the mysql connector Java driver jar package under hive's lib. Note that the version of the driver jar package must be consistent with the database version. My mysql version is 8, so the driver download version must also be 8
Step 3: hive operation and troubleshooting
Enter the bin directory under the hive folder and run. / hive. If there is an error, you can first go to the hive-log4j2.properties configuration file under the conf directory, check the property.hive.log.dir configuration item, check the log print path, and then go to the path to see the detailed log
Step 4: query the data in es and create an external table
After running hive, download the jar package: elasticsearch-hadoop-2.2.0.jar. Then execute the command:
hive> add jar file:///home/hadoop/xuguokun/elasticsearch-hadoop-2.2.0.jar;
Then create an external table:
create external table ext_table( 'id' bigint, 'title' string, ... ) stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPRORERTIES( 'es.nodes'='es address', 'es.port'='es port', 'es.index.auto.create'='false', 'es.resource'='es Indexes', 'es.nodes.warn.only'='true', 'es.net.http.auth.user'='user name', 'es.net.http.auth.pass'='password', 'es.index.read.missing.as.empty'='false' );
At this point, you can indirectly query the data of es by select ing the external table.
Create internal table:
create table in_table( Consistent with the fields in the external table );
Finally, import the external table data into the internal table and wait
insert overwrite table in_table select * from ext_table;