Deployment of Hadoop 2.7.3 distributed cluster memo on docker 1.7.03.1 single machine

Deployment of Hadoop 2.7.3 distributed cluster memo on docker 1.7.03.1 single machine

[TOC]

statement

All the articles are my technical notes. Please indicate where they came from when they were reproduced. https://segmentfault.com/u/yzwall

0 docker version and hadoop Version Description

  • PC: ubuntu 16.04.1 LTS

  • Docker version: 17.03.1-ce OS/Arch:linux/amd64

  • Hadoop version: hadoop-2.7.3

Configuration in 1 docker to build hadoop image

1.1 Create docker container

Create container based on ubuntu image and download the latest streamlined version of ubuntu image by default.
sudo docker run -ti container ubuntu

1.2 Modify / etc/source.list

Modify the default source file / etc/apt/source.list and replace the official source with the domestic source.

1.3 Installation of Java 8

# In order to streamline capacity, docker mirrors delete many ubuntu built-in components and are obtained by `apt-get update'.
apt-get update
apt-get install software-properties-common python-software-properties # add-apt-repository
apt-get install software-properties-commonapt-get install software-properties-common # add-apt-repository
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer
java -version

1.4 Installation of hadoop-2.7.3 in docker

1.4.1 Download hadoop-2.7.3 source code

# Create multilevel directories
mkdir -p /software/apache/hadoop
cd /software/apache/hadoop
# Download and unzip hadoop
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xvzf hadoop-2.7.3.tar.gz

1.4.2 Configure environment variables

Modify the ~/.bashrc file. Add the following configuration information at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/software/apache/hadoop/hadoop-2.7.3
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

source ~/.bashrc makes the configuration of environment variables effective.
Note: After completing the. / bashrc file configuration, hadoop-env.sh does not need to be reconfigured;

1.5 Configure hadoop

Configuration hadoop mainly configures three files: core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml.

Create namenode, datanode, and tmp directories under $HADOOP_HOME

cd $HADOOP_HOME
mkdir tmp
mkdir namenode
mkdir datanode

1.5.1 Configure core.site.xml

  • The configuration item hadoop.tmp.dir points to the TMP directory

  • The configuration item fs.default.name points to the master node and is configured as hdfs://master:9000

<configuration>
    <property>
        <!-- hadoop temp dir  -->
        <name>hadoop.tmp.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <!-- Size of read/write buffer used in SequenceFiles. -->
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
        <final>true</final>
        <description>The name of the default file system.</description>
    </property>
</configuration>

1.5.2 Configure hdfs-site.xml

  • dfs.replicationRepresents the number of nodes and configures one cluster namenodeļ¼Œ3 individual datanodeļ¼ŒSet the number of backups to 4;

  • dfs.namenode.name.diranddfs.datanode.data.dirConfigured separately as previously created NameNode and DataNode Directory path

<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9001</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <final>true</final>
        <description>Default block replication.</description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/namenode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/software/apache/hadoop/hadoop-2.7.3/datanode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

1.5.3 Configure mapred-site.xml

Create mapred-site.xml using the cp command under $HADOOP_HOME

cd $HADOOP_HOME
cp mapred-site.xml.template mapred-site.xml

Configure mapred-site.xml, and the configuration item mapred.job.tracker points to the master node.

In hadoop 2.x.x, users do not need to configure mapred.job.tracker, because JobTracker no longer exists and the function is implemented by component MRAppMaster, so it is necessary to specify the name of the running framework and yarn with mapreduce.framework.name.

—— Hadoop Technology Insider: Deep Analysis of YARN Architecture Design and Implementation Principles

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:19888</value>
    </property>
</configuration>

1.5.4 Configure yarn-site.xml

<configuration>
    <property>  
        <name>yarn.nodemanager.aux-services</name>  
        <value>mapreduce_shuffle</value>  
    </property>  
    <property>                                                                  
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>  
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>master:8032</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>master:8030</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>master:8031</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.admin.address</name>  
        <value>master:8033</value>  
    </property>  
    <property>  
        <name>yarn.resourcemanager.webapp.address</name>  
        <value>master:8088</value>  
    </property>  
</configuration>

1.5.5 Install vim,ifconfig and ping

The package needed to install ifconfig and ping commands

apt-get update
apt-get install vim
apt-get install net-tools       # for ifconfig 
apt-get install inetutils-ping  # for ping

1.5.6 Building Haoop Basic Mirror

Assuming that the current container is named container, the basic image is saved as ubuntu:hadoop, and subsequent Hadoop cluster containers are created and started according to the image without repeated configuration.
sudo docker commit -m "hadoop installed" container ubuntu:hadoop /bin/bash

2. Haoop Distributed Cluster Construction

2.1 Create container clusters based on the hadoop base image already created

mater containers and slave 1-3 containers are created according to the underlying mirror ubuntu:hadoop, and their host names are identical to the container names.
Create master: docker run -ti -h master --name master ubuntu:hadoop /bin/bash
Create slave 1: docker run-ti-h slave 1 -- name slave 1 ubuntu: hadoop/bin/bash
Create slave 2: docker run-ti-h slave 2 -- name slave 2 ubuntu: hadoop/bin/bash
Create slave 3: docker run - Ti - H slave 3 -- name slave 3 ubuntu: Hadoop / bin / Bash

2.2 Configure each container hosts file

Add the following contents to the / etc/hosts of each container. The ip addresses of each container can be viewed through ifconfig:

master 172.17.0.2 
slave1 172.17.0.3 
slave2 172.17.0.4 
slave3 172.17.0.5 

Note: After the docker container restart, the hosts content may be invalid. For the time being, inexperience can only avoid frequent restart of the container. Otherwise, it is necessary to configure the hosts file manually.

Reference resourceshttp://dockone.io/question/400

1./etc/hosts, /etc/resolv.conf and / etc/hostname. The three files in the container do not exist in the mirror, but in / var / lib / docker / containers /< container_id>. When the container is started, these files are mounted inside the container through mount. Therefore, if these files are modified in the container, the modification part will not exist in the top layer of the container, but will be written directly into the three physical files.
2. Why does the revision content not exist after restart? The reason is: Every time Docker starts the container, it rebuilds the new / etc/hosts file. Why? The reason is: the container restart, IP address change, the original IP address in the hosts file is invalid, so it is reasonable to modify the hosts file, otherwise dirty data will be generated. The reason is: Every time Docker starts the container, it rebuilds the new / etc/hosts file. Why? The reason is: the container restarts, the IP address changes, the original IP address in the hosts file is invalid, so the hosts file should be modified, otherwise dirty data will be generated. 1./etc/hosts, /etc/resolv.conf and / etc/hostname. The three files in the container do not exist in the mirror, but in / var / lib / docker / containers /< container_id>. When the container is started, these files are mounted inside the container through mount. Therefore, if these files are modified in the container, the modification part will not exist in the top layer of the container, but will be written directly into the three physical files.

2.3 Cluster Node SSH Configuration

2.3.1 All nodes: install ssh

apt-get update
apt-get install ssh
apt-get install openssh-server

2.3.2 All Nodes: Generating Random Key

# Generate no cryptographic key, the generated key is under ~/. ssh
ssh-keygen -t rsa -P ""

2.3.3 master node: generate certificate file authorized_keys

Write the generated public key into authorized_keys

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

2.3.4 All nodes: modify the sshd_config file

By modifying the sshd_config file, ssh can login to root users of other nodes remotely.

vim /etc/ssh/sshd_config
# Modify PermitRootLogin prohibit-password to PermitRootLogin yes
# Restart ssh service
service ssh restart

2.3.5 master node: transferring certificates to slave node via scp

Transfer authorized_keys on master node to other slave nodes ~/. SSH to cover the file with the same name, ensure the certificates of all nodes are consistent, so that any node can be accessed through ssh.

cd ~/.ssh
scp authorized_keys root@slave1:~/.ssh/
scp authorized_keys root@slave2:~/.ssh/
scp authorized_keys root@slave3:~/.ssh/

2.3.6 slave node: modify certificate permissions to ensure validity

chmod 600 ~/.ssh/authorized_keys

Be careful

  • See if the ssh service is on: ps-e | grep ssh

  • Open ssh service: service ssh start

  • Restart ssh service: service ssh restart

After 2.3.1 operation, the containers can be accessed by ssh.

2.4 master Node Configuration

In the master node, modify the slaves file to configure the slave node

cd $HADOOP_CONFIG_HOME/
vim slaves

Cover the contents as follows:

slave1
slave2
slave3

2.5 Start hadoop cluster

Enter the master node,

  • Execute HDFS namenode-format and a similar message appears to indicate that the namenode formatted successfully:

common.Storage: Storage directory /software/apache/hadoop/hadoop-2.7.3/namenode has been successfully formatted.
  • Start the cluster by executing start_all.sh:

root@master:/# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
The authenticity of host 'master (172.17.0.2)' can't be established.
ECDSA key fingerprint is SHA256:OewrSOYpvfDE6ixf6Gw9U7I9URT2zDCCtDJ6tjuZz/4.
Are you sure you want to continue connecting (yes/no)? yes
master: Warning: Permanently added 'master,172.17.0.2' (ECDSA) to the list of known hosts.
master: starting namenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-namenode-master.out
slave3: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave3.out
slave2: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave2.out
slave1: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave1.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-resourcemanager-master.out
slave3: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave3.out
slave1: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave1.out
slave2: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave2.out

Execute jps in master and slave nodes respectively.

  • master:

root@master:/# jps
2065 Jps
1446 NameNode
1801 ResourceManager
1641 SecondaryNameNode
  • slave1:

1107 NodeManager
1220 Jps
1000 DataNode
  • slave2:

241 DataNode
475 Jps
348 NodeManager
  • slave3:

500 Jps
388 NodeManager
281 DataNode

3. Execute wordcount

Create the input directory / hadoopinput in hdfs and store the input file LICENSE.txt in the directory:

root@master:/# hdfs dfs -mkdir -p /hadoopinput
root@master:/# hdfs dfs -put LICENSE.txt /hadoopint

Enter $HADOOP_HOME/share/hadoop/mapreduce, submit wordcount tasks to the cluster, and save the results in the / hadoopoutput directory of hdfs:

root@master:/# cd $HADOOP_HOME/share/hadoop/mapreduce
root@master:/software/apache/hadoop/hadoop-2.7.3/share/hadoop/mapreduce# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /hadoopinput /hadoopoutput
17/05/26 01:21:34 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032
17/05/26 01:21:35 INFO input.FileInputFormat: Total input paths to process : 1
17/05/26 01:21:35 INFO mapreduce.JobSubmitter: number of splits:1
17/05/26 01:21:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1495722519742_0001
17/05/26 01:21:36 INFO impl.YarnClientImpl: Submitted application application_1495722519742_0001
17/05/26 01:21:36 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1495722519742_0001/
17/05/26 01:21:36 INFO mapreduce.Job: Running job: job_1495722519742_0001
17/05/26 01:21:43 INFO mapreduce.Job: Job job_1495722519742_0001 running in uber mode : false
17/05/26 01:21:43 INFO mapreduce.Job:  map 0% reduce 0%
17/05/26 01:21:48 INFO mapreduce.Job:  map 100% reduce 0%
17/05/26 01:21:54 INFO mapreduce.Job:  map 100% reduce 100%
17/05/26 01:21:55 INFO mapreduce.Job: Job job_1495722519742_0001 completed successfully
17/05/26 01:21:55 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=29366
        FILE: Number of bytes written=295977
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=84961
        HDFS: Number of bytes written=22002
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=2922
        Total time spent by all reduces in occupied slots (ms)=3148
        Total time spent by all map tasks (ms)=2922
        Total time spent by all reduce tasks (ms)=3148
        Total vcore-milliseconds taken by all map tasks=2922
        Total vcore-milliseconds taken by all reduce tasks=3148
        Total megabyte-milliseconds taken by all map tasks=2992128
        Total megabyte-milliseconds taken by all reduce tasks=3223552
    Map-Reduce Framework
        Map input records=1562
        Map output records=12371
        Map output bytes=132735
        Map output materialized bytes=29366
        Input split bytes=107
        Combine input records=12371
        Combine output records=1906
        Reduce input groups=1906
        Reduce shuffle bytes=29366
        Reduce input records=1906
        Reduce output records=1906
        Spilled Records=3812
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=78
        CPU time spent (ms)=1620
        Physical memory (bytes) snapshot=451264512
        Virtual memory (bytes) snapshot=3915927552
        Total committed heap usage (bytes)=348127232
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=84854
    File Output Format Counters 
        Bytes Written=22002

The calculation results are stored in / hadoopoutput/part-r-00000. View the results:

root@master:/# hdfs dfs -ls /hadoopoutput
Found 2 items
-rw-r--r--   3 root supergroup          0 2017-05-26 01:21 /hadoopoutput/_SUCCESS
-rw-r--r--   3 root supergroup      22002 2017-05-26 01:21 /hadoopoutput/part-r-00000

root@master:/# hdfs dfs -cat /hadoopoutput/part-r-00000
""AS    2
"AS    16
"COPYRIGHTS    1
"Contribution"    2
"Contributor"    2
"Derivative    1
"Legal    1
"License"    1
"License");    1
"Licensed    1
"Licensor"    1
...

So far, the deployment of Hadoop 2.7.3 cluster on docker 1.7.03 single machine has been successful!

Reference resources

[1] http://tashan10.com/yong-dockerda-jian-hadoopwei-fen-bu-shi-ji-qun/
[2] http://blog.csdn.net/xiaoxiangzi222/article/details/52757168

Keywords: Linux Hadoop ssh Docker Apache

Added by flowingwindrider on Fri, 28 Jun 2019 02:42:15 +0300