Hive environment building + reading es data to internal tables

Scenario:

         The project needs function optimization. It needs to compare the same data. Which is more efficient to query from hive or es. Therefore, we need to synchronize all the data of an index in es to hdfs, and query hdfs data through hive to compare their efficiency.

Step 1: preliminary preparation + hive Download

        Since hive needs to query the data in hdfs, it is necessary to build the hadoop environment first. The hadoop building process I refer to is as follows:

        hadoop cluster building tutorial (detailed)_ fanxin_i blog - CSDN blog_ hadoop cluster construction

        To build hive, if you are connected to a local database, you also need to build mysql. The tutorial I refer to for the MySQL setup process is as follows:

        linux Installation mysql tutorial - Sweet Potato - blog Garden

        Then we download and build hive. Before downloading, we must confirm the corresponding versions of hadoop and hive. The hadoop version I downloaded is 2.7.7, and the corresponding version of hive is 2.3.9. Hive download address is as follows:

        http://mirror.bit.edu.cn/apache/hive/

        For convenience, I rename the extracted folder to hive to facilitate subsequent operations

Step 2: hive setup

        Since I deploy under linux, I need to set the environment variable in advance, path: / etc/profile, and add the following code:

export HIVE_HOME=fill in hive Path of
export PATH=$PATH:$HIVE_HOME/bin

        Then execute the command: source /etc/profile

        Some people modify the environment variable under vim ~/.bashrc. The difference between the two can be determined according to the actual situation

        Enter the conf folder under the hive folder, remove all files with. template suffix, and then hive-default.xml. After removing the suffix, you need to modify the name to hive-site.xml  .

        Edit hive-site.xml. Now hive-site.xml is a template file before removing the suffix, so there are many configuration items in it. I only use the following items:

       <property>
			<name>hive.exec.scratchdir</name>
			<value>hdfs Next path</value>
       </property>
       <property>
			<name>hive.querylog.location</name>
			<value>hdfs Next path</value>
      </property>    
      <property>
			<name>javax.jdo.option.ConnectionURL</name>
			<value>jdbc:mysql://Database address: 3306 / database name? createDatabaseIfNotExist=true& characterEncoding=UTF-8</value>
      </property>
      
      <property>
			<name>javax.jdo.option.ConnectionDriverName</name>
			<value>com.mysql.jdbc.Driver</value>  com.mysql.cj.jdbc.Driver
      </property>
      
      <property>
			<name>javax.jdo.option.ConnectionUserName</name>
			<value>root</value>//Database user name
      </property>
      
      <property>
			<name>javax.jdo.option.ConnectionPassword</name>
			<value>123</value>   //Database password
      </property>	

        If you want to configure other configuration items, you need to replace ${system:java.io.tmpdir} with the real directory, otherwise an error will be reported.

        Modify hiv-env.sh and add the following code:

export JAVA_HOME=jdk route
export HADOOP_HOME=hadoop route
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=hive route
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib

        Because you need to connect to the database, the last step is to download and put the mysql connector Java driver jar package under hive's lib. Note that the version of the driver jar package must be consistent with the database version. My mysql version is 8, so the driver download version must also be 8

Step 3: hive operation and troubleshooting

         Enter the bin directory under the hive folder and run. / hive. If there is an error, you can first go to the hive-log4j2.properties configuration file under the conf directory, check the property.hive.log.dir configuration item, check the log print path, and then go to the path to see the detailed log

Step 4: query the data in es and create an external table

        After running hive, download the jar package: elasticsearch-hadoop-2.2.0.jar. Then execute the command:

        hive> add jar file:///home/hadoop/xuguokun/elasticsearch-hadoop-2.2.0.jar;

        Then create an external table:

create external table ext_table(
    'id' bigint,
    'title' string,
    ...
)
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPRORERTIES(
    'es.nodes'='es address',
    'es.port'='es port',
    'es.index.auto.create'='false',
    'es.resource'='es Indexes',
    'es.nodes.warn.only'='true',
    'es.net.http.auth.user'='user name',
    'es.net.http.auth.pass'='password',
    'es.index.read.missing.as.empty'='false'
);

        At this point, you can indirectly query the data of es by select ing the external table.

        Create internal table:

create table in_table(
    Consistent with the fields in the external table
);

        Finally, import the external table data into the internal table and wait

insert overwrite table in_table select * from ext_table;

Keywords: Big Data Hadoop hive

Added by Naez on Wed, 13 Oct 2021 00:02:23 +0300