Deployment of Hadoop 2.7.3 distributed cluster memo on docker 1.7.03.1 single machine
[TOC]
statement
All the articles are my technical notes. Please indicate where they came from when they were reproduced. https://segmentfault.com/u/yzwall
0 docker version and hadoop Version Description
PC: ubuntu 16.04.1 LTS
Docker version: 17.03.1-ce OS/Arch:linux/amd64
Hadoop version: hadoop-2.7.3
Configuration in 1 docker to build hadoop image
1.1 Create docker container
Create container based on ubuntu image and download the latest streamlined version of ubuntu image by default.
sudo docker run -ti container ubuntu
1.2 Modify / etc/source.list
Modify the default source file / etc/apt/source.list and replace the official source with the domestic source.
1.3 Installation of Java 8
# In order to streamline capacity, docker mirrors delete many ubuntu built-in components and are obtained by `apt-get update'. apt-get update apt-get install software-properties-common python-software-properties # add-apt-repository apt-get install software-properties-commonapt-get install software-properties-common # add-apt-repository add-apt-repository ppa:webupd8team/java apt-get update apt-get install oracle-java8-installer java -version
1.4 Installation of hadoop-2.7.3 in docker
1.4.1 Download hadoop-2.7.3 source code
# Create multilevel directories mkdir -p /software/apache/hadoop cd /software/apache/hadoop # Download and unzip hadoop wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xvzf hadoop-2.7.3.tar.gz
1.4.2 Configure environment variables
Modify the ~/.bashrc file. Add the following configuration information at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle export HADOOP_HOME=/software/apache/hadoop/hadoop-2.7.3 export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
source ~/.bashrc makes the configuration of environment variables effective.
Note: After completing the. / bashrc file configuration, hadoop-env.sh does not need to be reconfigured;
1.5 Configure hadoop
Configuration hadoop mainly configures three files: core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml.
Create namenode, datanode, and tmp directories under $HADOOP_HOME
cd $HADOOP_HOME mkdir tmp mkdir namenode mkdir datanode
1.5.1 Configure core.site.xml
The configuration item hadoop.tmp.dir points to the TMP directory
The configuration item fs.default.name points to the master node and is configured as hdfs://master:9000
<configuration> <property> <!-- hadoop temp dir --> <name>hadoop.tmp.dir</name> <value>/software/apache/hadoop/hadoop-2.7.3/tmp</value> <description>A base for other temporary directories.</description> </property> <!-- Size of read/write buffer used in SequenceFiles. --> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <final>true</final> <description>The name of the default file system.</description> </property> </configuration>
1.5.2 Configure hdfs-site.xml
dfs.replication
Represents the number of nodes and configures one cluster namenodeļ¼3 individual datanodeļ¼Set the number of backups to 4;dfs.namenode.name.dir
anddfs.datanode.data.dir
Configured separately as previously created NameNode and DataNode Directory path
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<final>true</final>
<description>Default block replication.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/software/apache/hadoop/hadoop-2.7.3/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/software/apache/hadoop/hadoop-2.7.3/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
1.5.3 Configure mapred-site.xml
Create mapred-site.xml using the cp command under $HADOOP_HOME
cd $HADOOP_HOME cp mapred-site.xml.template mapred-site.xml
Configure mapred-site.xml, and the configuration item mapred.job.tracker points to the master node.
In hadoop 2.x.x, users do not need to configure mapred.job.tracker, because JobTracker no longer exists and the function is implemented by component MRAppMaster, so it is necessary to specify the name of the running framework and yarn with mapreduce.framework.name.
—— Hadoop Technology Insider: Deep Analysis of YARN Architecture Design and Implementation Principles
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:19888</value> </property> </configuration>
1.5.4 Configure yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> </configuration>
1.5.5 Install vim,ifconfig and ping
The package needed to install ifconfig and ping commands
apt-get update apt-get install vim apt-get install net-tools # for ifconfig apt-get install inetutils-ping # for ping
1.5.6 Building Haoop Basic Mirror
Assuming that the current container is named container, the basic image is saved as ubuntu:hadoop, and subsequent Hadoop cluster containers are created and started according to the image without repeated configuration.
sudo docker commit -m "hadoop installed" container ubuntu:hadoop /bin/bash
2. Haoop Distributed Cluster Construction
2.1 Create container clusters based on the hadoop base image already created
mater containers and slave 1-3 containers are created according to the underlying mirror ubuntu:hadoop, and their host names are identical to the container names.
Create master: docker run -ti -h master --name master ubuntu:hadoop /bin/bash
Create slave 1: docker run-ti-h slave 1 -- name slave 1 ubuntu: hadoop/bin/bash
Create slave 2: docker run-ti-h slave 2 -- name slave 2 ubuntu: hadoop/bin/bash
Create slave 3: docker run - Ti - H slave 3 -- name slave 3 ubuntu: Hadoop / bin / Bash
2.2 Configure each container hosts file
Add the following contents to the / etc/hosts of each container. The ip addresses of each container can be viewed through ifconfig:
master 172.17.0.2 slave1 172.17.0.3 slave2 172.17.0.4 slave3 172.17.0.5
Note: After the docker container restart, the hosts content may be invalid. For the time being, inexperience can only avoid frequent restart of the container. Otherwise, it is necessary to configure the hosts file manually.
Reference resourceshttp://dockone.io/question/400
1./etc/hosts, /etc/resolv.conf and / etc/hostname. The three files in the container do not exist in the mirror, but in / var / lib / docker / containers /< container_id>. When the container is started, these files are mounted inside the container through mount. Therefore, if these files are modified in the container, the modification part will not exist in the top layer of the container, but will be written directly into the three physical files.
2. Why does the revision content not exist after restart? The reason is: Every time Docker starts the container, it rebuilds the new / etc/hosts file. Why? The reason is: the container restart, IP address change, the original IP address in the hosts file is invalid, so it is reasonable to modify the hosts file, otherwise dirty data will be generated. The reason is: Every time Docker starts the container, it rebuilds the new / etc/hosts file. Why? The reason is: the container restarts, the IP address changes, the original IP address in the hosts file is invalid, so the hosts file should be modified, otherwise dirty data will be generated. 1./etc/hosts, /etc/resolv.conf and / etc/hostname. The three files in the container do not exist in the mirror, but in / var / lib / docker / containers /< container_id>. When the container is started, these files are mounted inside the container through mount. Therefore, if these files are modified in the container, the modification part will not exist in the top layer of the container, but will be written directly into the three physical files.
2.3 Cluster Node SSH Configuration
2.3.1 All nodes: install ssh
apt-get update apt-get install ssh apt-get install openssh-server
2.3.2 All Nodes: Generating Random Key
# Generate no cryptographic key, the generated key is under ~/. ssh ssh-keygen -t rsa -P ""
2.3.3 master node: generate certificate file authorized_keys
Write the generated public key into authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
2.3.4 All nodes: modify the sshd_config file
By modifying the sshd_config file, ssh can login to root users of other nodes remotely.
vim /etc/ssh/sshd_config # Modify PermitRootLogin prohibit-password to PermitRootLogin yes # Restart ssh service service ssh restart
2.3.5 master node: transferring certificates to slave node via scp
Transfer authorized_keys on master node to other slave nodes ~/. SSH to cover the file with the same name, ensure the certificates of all nodes are consistent, so that any node can be accessed through ssh.
cd ~/.ssh scp authorized_keys root@slave1:~/.ssh/ scp authorized_keys root@slave2:~/.ssh/ scp authorized_keys root@slave3:~/.ssh/
2.3.6 slave node: modify certificate permissions to ensure validity
chmod 600 ~/.ssh/authorized_keys
Be careful
See if the ssh service is on: ps-e | grep ssh
Open ssh service: service ssh start
Restart ssh service: service ssh restart
After 2.3.1 operation, the containers can be accessed by ssh.
2.4 master Node Configuration
In the master node, modify the slaves file to configure the slave node
cd $HADOOP_CONFIG_HOME/ vim slaves
Cover the contents as follows:
slave1 slave2 slave3
2.5 Start hadoop cluster
Enter the master node,
Execute HDFS namenode-format and a similar message appears to indicate that the namenode formatted successfully:
common.Storage: Storage directory /software/apache/hadoop/hadoop-2.7.3/namenode has been successfully formatted.
Start the cluster by executing start_all.sh:
root@master:/# start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh Starting namenodes on [master] The authenticity of host 'master (172.17.0.2)' can't be established. ECDSA key fingerprint is SHA256:OewrSOYpvfDE6ixf6Gw9U7I9URT2zDCCtDJ6tjuZz/4. Are you sure you want to continue connecting (yes/no)? yes master: Warning: Permanently added 'master,172.17.0.2' (ECDSA) to the list of known hosts. master: starting namenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-namenode-master.out slave3: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave3.out slave2: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave2.out slave1: starting datanode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-datanode-slave1.out Starting secondary namenodes [master] master: starting secondarynamenode, logging to /software/apache/hadoop/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-master.out starting yarn daemons starting resourcemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-resourcemanager-master.out slave3: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave3.out slave1: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave1.out slave2: starting nodemanager, logging to /software/apache/hadoop/hadoop-2.7.3/logs/yarn-root-nodemanager-slave2.out
Execute jps in master and slave nodes respectively.
master:
root@master:/# jps 2065 Jps 1446 NameNode 1801 ResourceManager 1641 SecondaryNameNode
slave1:
1107 NodeManager 1220 Jps 1000 DataNode
slave2:
241 DataNode 475 Jps 348 NodeManager
slave3:
500 Jps 388 NodeManager 281 DataNode
3. Execute wordcount
Create the input directory / hadoopinput in hdfs and store the input file LICENSE.txt in the directory:
root@master:/# hdfs dfs -mkdir -p /hadoopinput root@master:/# hdfs dfs -put LICENSE.txt /hadoopint
Enter $HADOOP_HOME/share/hadoop/mapreduce, submit wordcount tasks to the cluster, and save the results in the / hadoopoutput directory of hdfs:
root@master:/# cd $HADOOP_HOME/share/hadoop/mapreduce root@master:/software/apache/hadoop/hadoop-2.7.3/share/hadoop/mapreduce# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /hadoopinput /hadoopoutput 17/05/26 01:21:34 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032 17/05/26 01:21:35 INFO input.FileInputFormat: Total input paths to process : 1 17/05/26 01:21:35 INFO mapreduce.JobSubmitter: number of splits:1 17/05/26 01:21:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1495722519742_0001 17/05/26 01:21:36 INFO impl.YarnClientImpl: Submitted application application_1495722519742_0001 17/05/26 01:21:36 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1495722519742_0001/ 17/05/26 01:21:36 INFO mapreduce.Job: Running job: job_1495722519742_0001 17/05/26 01:21:43 INFO mapreduce.Job: Job job_1495722519742_0001 running in uber mode : false 17/05/26 01:21:43 INFO mapreduce.Job: map 0% reduce 0% 17/05/26 01:21:48 INFO mapreduce.Job: map 100% reduce 0% 17/05/26 01:21:54 INFO mapreduce.Job: map 100% reduce 100% 17/05/26 01:21:55 INFO mapreduce.Job: Job job_1495722519742_0001 completed successfully 17/05/26 01:21:55 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=29366 FILE: Number of bytes written=295977 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=84961 HDFS: Number of bytes written=22002 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2922 Total time spent by all reduces in occupied slots (ms)=3148 Total time spent by all map tasks (ms)=2922 Total time spent by all reduce tasks (ms)=3148 Total vcore-milliseconds taken by all map tasks=2922 Total vcore-milliseconds taken by all reduce tasks=3148 Total megabyte-milliseconds taken by all map tasks=2992128 Total megabyte-milliseconds taken by all reduce tasks=3223552 Map-Reduce Framework Map input records=1562 Map output records=12371 Map output bytes=132735 Map output materialized bytes=29366 Input split bytes=107 Combine input records=12371 Combine output records=1906 Reduce input groups=1906 Reduce shuffle bytes=29366 Reduce input records=1906 Reduce output records=1906 Spilled Records=3812 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=78 CPU time spent (ms)=1620 Physical memory (bytes) snapshot=451264512 Virtual memory (bytes) snapshot=3915927552 Total committed heap usage (bytes)=348127232 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=84854 File Output Format Counters Bytes Written=22002
The calculation results are stored in / hadoopoutput/part-r-00000. View the results:
root@master:/# hdfs dfs -ls /hadoopoutput Found 2 items -rw-r--r-- 3 root supergroup 0 2017-05-26 01:21 /hadoopoutput/_SUCCESS -rw-r--r-- 3 root supergroup 22002 2017-05-26 01:21 /hadoopoutput/part-r-00000 root@master:/# hdfs dfs -cat /hadoopoutput/part-r-00000 ""AS 2 "AS 16 "COPYRIGHTS 1 "Contribution" 2 "Contributor" 2 "Derivative 1 "Legal 1 "License" 1 "License"); 1 "Licensed 1 "Licensor" 1 ...
So far, the deployment of Hadoop 2.7.3 cluster on docker 1.7.03 single machine has been successful!
Reference resources
[1] http://tashan10.com/yong-dockerda-jian-hadoopwei-fen-bu-shi-ji-qun/
[2] http://blog.csdn.net/xiaoxiangzi222/article/details/52757168