Big data: platform building (hadoop+spark+zeppelin)

Zeppelin is an open source Apache incubation project. It is a basic web notebook tool that supports interactive data analysis. Through plug-in access to various interpreter s, users can complete interactive query in specific language or data processing back-end, and quickly realize data visualization.
Zeppelin: query and analyze data and generate reports
Spark: it provides Zeppelin with a background data engine. It is a computing model based on RDD (elastic distributed data set). If a large number of polar data can be processed in a distributed manner, a large number of data sets are first split and calculated separately, and then the calculated results are combined.
 

1, Environmental preparation

1. Uninstall and install JDK
Since OpenJDK is installed by default in CentOS, which does not meet our requirements, we need to uninstall it first and then reinstall the Oracle version of JDK.
1.1. Query java version: rpm -qa|grep java
1.2. Uninstall openjdk: Yum - y remove Java XXX openjdk XXX
JDK download address of Oracle version: https://download.oracle.com/otn/java/jdk/8u311-b11/4d5417147a92418ea8b615e228bb6935/jdk-8u311-linux-x64.rpm

  rpm install jdk

rpm -ivh jdk-8u311-linux-x64.rpm

View installed version

  Add to environment variable

vim /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64
export JAVA_BIN=/usr/java/jdk1.8.0_311-amd64/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
export PATH=$PATH:$JAVA_HOME/bin
# Save and exit, execute the following command to make the environment variable take effect immediately
source /etc/profile

two   Change hostname

192.168.0.61
hostnamectl set-hostname master
192.168.0.64
hostnamectl set-hostname slave01
192.168.0.65
hostnamectl set-hostname slave02

# Configure hosts(IP and hostname mapping)
vim /etc/hosts
192.168.0.61 master
192.168.0.64 slave01
192.168.0.65 slave02

3. Synchronization time
ntpdate cn.pool.ntp.org
4. Turn off the firewall and selinux

vi /etc/selinux/config
 take SELINUX=enforcing Change to SELINUX=disabled
 After setting, you need to restart to take effect

systemctl stop firewalld
systemctl disable firewalld

five   scala installation ()
Download Scala3-3.1.0 package

tar -zxf scala3-3.1.0.tar.gz -C /usr/local/
mv scala3-3.1.0 scala3
vim /etc/profile
export PATH=$PATH:$JAVA_HOME/bin:/usr/local/scala3/bin
# Save and exit, execute the following command to make the environment variable take effect immediately
source /etc/profile

Check whether the installation is successful

 

6. Create hadoop user

useradd hadoop
password hadoop

Switch user hadoop
su hadoop

7. Configure ssh keyless certificate

ssh-keygen -t rsa
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub master
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub slave01
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub slave02

The above steps should be performed on each virtual machine

2, Hadoop installation and deployment

  hadoop Download

Index of /hadoop/common/hadoop-3.3.1
https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Unzip the installation package

tar -zxf hadoop-3.3.1.tar.gz -C /data/java-service/
cd /data/java-service/
mv hadoop-3.3.1/ hadoop
# Create dfs related directories
mkdir -p dfs/name
mkdir -p dfs/data
mkdir -p dfs/namesecondary

Configure environment variables

vim /etc/profile
# hadoop
export HADOOP_HOME=/data/java-service/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
# Save and exit, execute the following command to make the environment variable take effect immediately
source /etc/profile

Enter the hadoop configuration file directory to start parameter configuration

one   Modify the configuration file core-site.xml

cp core-site.xml core-site.xml.cp

For detailed parameters, please refer to:   https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/core-default.xml
The following contents are added to < configuration > < / configuration >
vim core-site.xml

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
        <description>NameNode URI.</description>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
        <description>Size of read/write buffer used inSequenceFiles.</description>
    </property>

two   Modify the configuration file hdfs-site.xml

cp hdfs-site.xml hdfs-site.xml.cp
For detailed parameters, please refer to:   http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
The following contents are added to < configuration > < / configuration >
vim hdfs-site.xml

<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>master:50090</value>
    <description>The secondary namenode http server address andport.</description>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/java-service/hadoop/dfs/name</value>
    <description>Path on the local filesystem where the NameNodestores the namespace and transactions logs persistently.</description>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/java-service/hadoop/dfs/data</value>
    <description>Comma separated list of paths on the local filesystemof a DataNode where it should store its blocks.</description>
</property>
<property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>file:///data/java-service/hadoop/dfs/namesecondary</value>
    <description>Determines where on the local filesystem the DFSsecondary name node should store the temporary images to merge. If this is acomma-delimited list of directories then the image is replicated in all of thedirectories for redundancy.</description>
</property>
<property>
    <name>dfs.replication</name>
    <value>2</value>
</property>

3. Modify the configuration file mapred-site.xml

cp mapred-site.xml mapred-site.xml.cp
For detailed parameters, please refer to: https://hadoop.apache.org/docs/r3.3.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
The following contents are added to < configuration > < / configuration >

 vim mapred-site.xml

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    <description>Theruntime framework for executing MapReduce jobs. Can be one of local, classic oryarn.</description>
</property>
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
    <description>MapReduce JobHistoryServer IPC host:port</description>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
    <description>MapReduce JobHistoryServer Web UI host:port</description>
</property>

4. Modify the configuration file yarn-site.xml

cp yarn-site.xml yarn-site.xml.cp
Default configuration link: http://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
The following contents are added to < configuration > < / configuration >

 vim yarn-site.xml

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
    <description>The hostname of theRM.</description>
</property>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>Shuffle service that needs to be set for Map Reduceapplications.</description>
</property>

5. vim hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64

6. vim workers

slave01
slave02

7. Change log write permission

chown hadoop:hadoop -R /data/java-service/hadoop/logs/

eight   Copy the hadoop directory to the two Slave nodes

scp -r /data/java-service/hadoop root@slave01:/data/java-service/
scp -r /data/java-service/hadoop root@slave02:/data/java-service/

The two nodes also need to configure the environment variable / etc/profile
 

Error message

1. ERROR: Cannot set priority of namenode process

1. Does not contain a valid host:port authority

Repair: according to the investigation result, the hostname of the host is illegal, so it is modified to not contain '.' / '' Illegal characters such as.

2.Directory /data/java-service/hadoop/dfs/name is in an inconsistent state

Permission issue chown -R hadoop:hadoop  / data/java-service/hadoop/dfs

Relevant data for error reporting and debugging:

ubuntu 18.04 configuring Hadoop 3.1.1 some problem records_ ygd11 column - CSDN blog

Error when starting Hadoop script: ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain_ Ant shellfish blog - CSDN blog

hadoop error: Does not contain a valid host:port authority - Damo Chuiyang - blog Park

Keywords: Hadoop

Added by IanM on Fri, 26 Nov 2021 21:42:08 +0200