[Docker x Hadoop] use Docker to build Hadoop clusters (from scratch)

0. Background

See the online tutorials, which use virtual machines to clone multiple virtual machines to simulate clusters
But on the real server, it was found that this method did not work
At this time, I think of Docker. Imagine that Docker hasn't really fought since he finished learning. This is just a good opportunity

The implementation idea is also very simple. Create multiple Centos containers in Docker
Each container can be used as a server to simulate the cluster environment

All right, do it!
If you are not familiar with Docker, you can refer to here: [track of Java] Introduction to Docker (interspersed with pit stepping experience)

The Hadoop version of this blog is 3.1.3. Please refer to the video of Silicon Valley

1. Create the first container

1)	First pull centos
	docker pull centos

2)	Create a centos
	docker run --privileged=true --name=hadoop001 -dit centos /sbin/init
	
	-dit It means interactive and running in the background -it The container stops automatically
	--privileged=true and /sbin/init Indicates privileged startup, which may need to be used later service Command. This permission is required
	If there is no one above, it will be reported when it is reused PID 1 My mistake

2. Configure java and hadoop environment for the container

Copy the jdk and hadoop from the host to the container

docker cp <jdk route> <container id>:<Container path>
docker cp <hadoop route> <container id>:<Container path>

Create the file mydev.sh under / etc/profile.d /
The vim mydex.sh file is as follows

# Java
export JAVA_HOME=<jdk Path in container>
export PATH=$PATH:$JAVA_HOME/bin

# Hadoop
export HADOOP_HOME=<hadoop Path in container>
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

# Constant
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

The first and second are two environments. The last one is some constants. These constants should be used when starting the cluster. Otherwise, the startup may fail
Finally, execute source /etc/profile to make the configuration file effective

3. Configure the container that allows the outside world to connect through ssh

We need to connect containers to form a cluster, but by default, containers cannot be accessed directly by other containers through ssh, so we need to configure them here

1)	First, you need to set the password (enter the following command and press enter to set the password)
	passwd root
	The probability will prompt that if there is no such command, you have to download it and set it again
	yum install passwd

2)	Then download ssh,The following three steps (server and client are installed)
	yum install openssh
	yum install openssh-server
	yum install openssh-clients

3)	Then check the configuration file to see if the following two configurations are the same (they should be the same by default)
	vim /etc/ssh/sshd_config
	  PermitRootLogin yes
	  PasswordAuthentication yes
	These two configurations allow the Internet to pass through ssh Connect to this server (container)

4)	If this is not the case, modify it and restart it ssh Means effective
	service sshd restart
	Similarly, if the command may not exist, download it
	yum install initscripts

Through the above operations, the external network can access the container through ssh. We can try:

1)	First, in the container, use ifconfig see IP address
	ifconfig
	If you don't have this command, download it[doge]
	yum install net-tools

2)	In the server native, through ssh Access the IP Address trying to connect
	ssh <IP>
	Then ask for the password, which we have set above

ohhh then we will find that we have successfully entered the container, this time through ssh instead of docker

4. Clone multiple containers from the container

First, to clone multiple containers, you need to package the container into an image, and then create multiple containers from the image. The steps are as follows:

1)	create mirror
	docker commit -a='IceClean' -m='Hadoop cluster example through docker by IceClean' 5a8 hadoop-docker

2)	Clone multiple container simulation clusters from this input (take the cluster of three servers as an example)
	docker run --privileged=true --name=hadoop002 -dit hadoop-docker /sbin/init
	docker run --privileged=true --name=hadoop003 -dit hadoop-docker /sbin/init

After these containers are generated, configure an IP address mapping for them
In the future, you don't have to remember the IP address of each container

However, due to the characteristics of docker bridge mode, when the container is restarted, an IP address will be assigned to the container in order, so that the IP address is not fixed and will change, which is very unfriendly to the mapping we have done. Therefore, the next step is to fix an IP address for each container (pipework is needed here)

1)	install pipewoek(Download and copy the executable to bin Table of contents)
	git clone https://github.com/jpetazzo/pipework
	cp ~/pipework/pipework /usr/local/bin/

2)	install bridge-utils
	yum -y install bridge-utils

3)	Create network (here) IP Address (custom)
	brctl addbr br0
	ip link set dev br0 up
	ip addr add 172.16.10.1 dev br0

4)	Set a fixed for each container IP
	pipework br0 <Container name> IP/Mask, such as:
	pipework br0 hadoop001 172.16.10.10/24
	pipework br0 hadoop002 172.16.10.11/24
	pipework br0 hadoop003 172.16.10.12/24

5)	Test these IP Available
	ping 172.16.10.10

6)	If you accidentally make a mistake on the way (for example, I--),Want to delete the network or IP,The following can be performed:
	Delete network:	ip link set dev br0 down 
				brctl delbr
	delete IP: 	ip link set dev <name> down 
				among<name>From ifconfig Found veth1plxxx

In this way, the IP address is successfully fixed for each container, and then the mapping can be configured in the container

1)	In each container, modify hosts file
	vim /etc/hosts
		For example, for mine, add:
		172.16.10.10 hadoop001
		172.16.10.11 hadoop002
		172.16.10.12 hadoop003

3)	After configuration, we can easily connect containers directly by name, such as:
	ssh hadoop003

5. Configure ssh password free login for each container

Although containers can be connected to each other normally, it is extremely inconvenient if you need to enter a password every time
Moreover, when the cluster is started later, if there is no secret free login, it may fail

Therefore, we can configure secret free login for containers in each cluster to realize smooth connection between containers
First, let's understand the principle of secret free login (take Hadoop 001 and Hadoop 002 as examples)

In Hadoop 001, you can use SSH key Gen to generate keys
——It is divided into public key and private key. The private key is confidential and the public key can be given to others
Then, if the public key of Hadoop 001 is copied to Hadoop 002
——By copying the command, the public key will be saved in Authorized_keys in Hadoop 002
In the future, if Hadoop 001 wants to connect to Hadoop 002, it can connect directly through the public key without a password

So we can know that the implementation of secret free login is to copy the public key of the target server to the other party's server, so that the other party can log in to the target server without secret. If the two servers want to log in without secret, of course, they have to copy each other's public key ~ start!

1)	First in hadoop001 Generate the public key and private key (just press enter all the way to the end, and there is no need to enter the content)
	ssh-keygen -t rsa
	among id_rsa Is the private key, id_rsa.pub Public key

2)	take hadoop001 Copy your public key to hadoop002
	ssh-copy-id hadoop002

This prompt indicates that the addition is successful. In the future, we can directly connect to hadoop002 through ssh hadoop002

Note: the public key also needs to be copied to this server (for example, Hadoop 001 copies your public key to Hadoop 001). Otherwise, you can log in to the five password free users

Then do the same for other containers. Copy the public key of each container to yourself and the other two servers to realize the mutual secret free login of the three servers. Over~

6. Start writing cluster configuration

All the preparations have been made and each container has been configured
Now we are going to officially start the construction of the cluster. The first thing is the cluster configuration:

Arrange the deployment of the cluster:

hadoop001hadoop002hadoop003
HDFSNameNode
DataNode
DataNodeSecondaryNameNode
DataNode
YARNNodeManagerResourceManager
NodeManager
NodeManager

hadoop001 also serves as the master node and configures the NameNode
Hadoop 002 is also responsible for resource management and configuring resource manager
Hadoop 003 also serves as the backup node and configures the SecondaryNameNode

Then, according to this plan, modify the hadoop configuration file
First, describe the configuration files and their locations (~ indicates the directory where hadoop is located)
System configuration file: ~ / share/doc/hadoop
User defined configuration file: ~ / etc/hadoop

There are four files in the etc directory that we need to customize the configuration. They are:
core-site.xml,``

① Configure Hadoop 001

First, go to the main node Hadoop 001 and configure core-site.xml
The following is my own configuration. I need to configure two things
① Set Hadoop 001 as the master node, and the recommended port is 8020
② Change the default storage location of data to the data folder under hadoop directory (if it does not exist, it will be created automatically)

<configuration>
    <!-- appoint NameNode address -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:8020</value>
    </property>

    <!-- appoint hadoop Directory of data storage -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/hadoop-3.1.3/data</value>
    </property>
</configuration>

Then configure hdfs-site.xml

<configuration>
    <!-- nn web End access address -->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop001:9870</value>
    </property>

    <!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop003:9868</value>
    </property>

	<!-- add to web Client access to files -->
	<!-- At this time, 2333 is added later. There is a congenital pit here~ -->
	<property>
        	<name>dfs.webhdfs.enabled</name>
        	<value>true</value>
        </property>
</configuration>

Then there is yarn-site.xml

<configuration>
    <!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- appoint ResourceManager Address of -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop002</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Finally, configure mapred-site.xml

<configuration>
    <!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

② Synchronize the four configuration files to the other two servers (Hadoop 002, Hadoop 003)

Just execute the following sentence to remotely copy the four configuration files to another container
Since the password free login was set before, it can be copied remotely directly without entering a password, and files can be easily copied between brother containers (the same is true for Hadoop 003)

scp core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml root@hadoop002:/home/hadoop/hadoop-3.1.3/etc/hadoop

③ Configure workers

Also in Hadoop 001, modify the file / etc/hadoop/workers
Remove the contents (delete localhost) and replace them with our three containers:
Note: there should be no spaces at the end of the line and no blank lines

hadoop001
hadoop002
hadoop003

Then synchronize the file to the other two containers

scp workers root@hadoop002:/home/hadoop/hadoop-3.1.3/etc/hadoop/
scp workers root@hadoop003:/home/hadoop/hadoop-3.1.3/etc/hadoop/

7. Start the cluster

Initialization is required for the first startup (it will not be used in the future. Re initialization will clear the data)

hdfs namenode -format

After initialization, data and logs files will be generated in the directory specified in the configuration file

————Next, it is officially launched

① Start DFS first:
Operate in Hadoop 001 (with NameNode configured)

Enter sbin and execute the. / start-dfs.sh command
Then execute jps to see if the startup is successful. It shows that it is successful as shown in the figure below

If there is no error in the above configuration, it should start smoothly
Possible errors are:
1. The password free login is not configured, and the startup fails due to insufficient permissions
2. If you only give the public key to other containers, but do not give it to yourself, it will also lead to the failure of insufficient permissions
Therefore, password free login is very, very important here!!!

② Then start YARN:
Operate in Hadoop 002 (with ResourceManager configured)

Still execute. / start-yarn.sh in sbin
In this step, if the server performance is not very good, it will get stuck (like me - - stuck for two hours)

From 9:00 p.m. to 11:00 p.m., I didn't give up 2333
However, I would also like to remind you that you should not force the restart of the server when the card is used. Just like the forced restart at more than 3 p.m., the forced restart causes some problems in the cluster and can't run. Finally, it is solved by clearing the data and restarting. Be careful!

The mistake at that time was this
ssh: connect to host master port 22: No route to host

I've been looking for solutions for a long time. They all say that there is no fixed firewall IP,No Internet, etc
 But I have no problem with all these. I still can't. in the end, I can only restart (if later people have this problem and have a solution, they can also share it)

If the startup is completed, execute jps to check whether each container is configured as planned
I think the following is OK:

③ Test cluster

Finally, you can access it through the Internet
If you directly use the IP of the container for access, it probably won't work (I don't know if I installed nginx)
However, using nginx as a proxy is very convenient to solve this problem. Here are the implementation steps

First, open ports 98709868 and 8088 of the security group and firewall
Corresponding to NameNode SecondaryNameNode ResourceManager

Then let nginx proxy these three ports and forward them to the corresponding hadoop container to complete the access, as follows:

server {
    listen          9870;
    server_name     www.xxx.xxx;
    location / {
        proxy_pass http://hadoop001:9870;
    }
}

server {
    listen          9868;
    server_name     www.xxx.xxx;
    location / {
        proxy_pass http://hadoop003:9868;
    }
}   

server {
    listen          8088;
    server_name     www.xxx.xxx;
    location / {
        proxy_pass http://hadoop002:8088;
    }
}

# This is also found later, and then come back to fill the pit. Give the access to the container to the nginx agent
server {
   	listen          9864;
   	server_name     www.xxx.xxx;
   	location / {
       	proxy_pass http://hadoop001:9864;
   	}
}

Finally, on the Internet, you can access the hadoop cluster directly through the domain name + port number!! The following page is successful (the corresponding port sequence of the three figures: 987098688088)

For pit filling later, you need to modify the hosts file of this machine (for details, see the actual operation record below)

Location: C:\Windows\System32\drivers\etc\hosts
 Add here(Domain names can be servers and virtual machines)
domain name hadoop001
 domain name hadoop002
 domain name hadoop003


Then use some simple functional tests: upload files to the cluster (executed in Hadoop 001)
Each step can be viewed in the Browse Directory window of 9870:

1)	Create a folder (note the root path here) '/' Refers to the root path of the cluster, which will be followed in the future)
	hadoop fs -mkdir /firstDir

2)	Upload a file (just upload one, and one click means that it will README.txt Uploaded to the cluster root directory firstDir Folder)
	hadoop fs -put README.txt /firstDir

3)	Click the file to view and download the file, and the test is completed (if there are still problems, you can see the black box below)

4)	implement wordcount Function and save the result in output Folder
	

The following part is the actual operation record. The above has been modified. You can skip this black box and look down

---	Well, I found that I couldn't preview the contents of the document. I found that there was something missing when I checked online
		It's obviously done step by step with the video. It won't take another two hours to finish it v...v
		But fortunately, just restart NameNode,The steps are as follows (of course, I have added it in the front, so I don't care here)
		
		stay hdfs-site.xml Add the following configuration, distribute it to each container, and restart it
		<property>
        	<name>dfs.webhdfs.enabled</name>
        	<value>true</value>
        </property>

---	Why?Why can't you visit? At this time, I finally opened the console and found this:
		http://hadoop001:9864/webhdfs/v1/firstDir/test.txt?op=OPEN&namenoderpcaddress=hadoop001:8020&offset=0
		
		Oh, I see
		We used to use nginx It is mentioned that the container name cannot be directly used for access
		But this request is automatically initiated. How can we make it still access our own server nginx What about the agent?
		At this time, you need to modify the of this machine hosts File, change the domain name and IP Mapping of addresses

		Location: C:\Windows\System32\drivers\etc\hosts
		Add here:
		domain name hadoop001
		domain name hadoop002
		domain name hadoop003
		
		The domain name is the domain name of the server IP (Of course, it can also be a virtual machine)
		This paragraph indicates that the names of the three containers will be changed into our domain name, so that we can successfully go to the proxy of our server
		Of course, with this, we can also directly access the three pages listed above through the container name in the future
		(Suddenly found that in the process of writing this blog, I gained a lot of network knowledge 2333)

		Finally, of course, configuration nginx (I have added here too)~)
		server {
        	listen          9864;
        	server_name     www.xxx.xxx;
        	location / {
            	proxy_pass http://hadoop001:9864;
        	}
    	}

So far, the test is completed!

Each knife is inserted in the right position (IceClean)

Keywords: Docker Hadoop

Added by landavia on Sun, 03 Oct 2021 22:12:07 +0300