2021-10-21 virtual box-based Hadoop Cluster Installation Configuration Tutorial

Reference to this document http://dblab.xmu.edu.cn/blog/2775-2/ The process of building a hadoop distributed cluster

Front

  • A pseudo-distributed hadoop system has been configured on a virtual machine
  • One virtual machine acts as master as namenode, and three virtual machines data1, 2, 3 (all with ubuntu system installed) acts as datanode

network configuration

  • Network Card 1 is configured as a NAT network so that the virtual machine can normally access the external network

  • Network Card 2 is configured as host-only so that the data virtual machine can communicate with the master virtual machine

    If the interface name display is not specified, you can specify the interface name by clicking Manage, Host Network Manager and Create in the upper left menu bar of virtualBox

  • Configure host name and network
    sudo vim /etc/hostname

    Ping data1-c 3 to test network connectivity

SSH Password-Free Login Node

  • Master nodes must be able to SSH passwordless login to each Slave node. First, generate the public key of the Master node. If the public key has been generated before, you must delete the original generated public key and regenerate it once again, because we have previously modified the host name. The specific commands are as follows

    cd ~/.ssh              # If not, execute ssh localhost once
    rm ./id_rsa*           # Delete previously generated public keys (if they already exist)
    ssh-keygen -t rsa       # After executing the command, when prompted, press Enter all the time
    
  • In order for the Master node to be able to log in locally without a password SSH, the following commands need to be executed on the Master node:
    cat ./id_rsa.pub >> ./authorized_keys

  • Next, transfer the public key from the Master node to the Slave1 node (depending on the folder)
    scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/

  • On the data1 node, add the SSH key to the authorization:

    mkdir ~/.ssh       # If the folder does not exist, create it first. If it does exist, ignore this command
    cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
    rm ~/id_rsa.pub    # Can be deleted after use
    
  • Configure as if there are other nodes

Configure PATH variable

  • In the previous pseudo-distributed installation, you have described how to configure the PATH variable. You can configure it the same way so that commands like hadoop, hdfs, and so on, can be used directly in any directory. If the PATH variable has not been configured, it needs to be configured on the Master node. First execute the command "vim /.bashrc", that is, use the VIM editor to open the "/.bashrc" file, and then add the following line at the top of the file:
    export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
    After saving, execute the command "source ~/.bashrc" for the configuration to take effect.

Configure Cluster/Distributed Environment

  • When configuring cluster/distributed mode, you need to modify the configuration file in the'/usr/local/hadoop/etc/hadoop'directory, where only the settings necessary for normal startup are set, including workers, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, a total of five files. More settings are available for official instructions. File in/usr/local/hadoop/etc/hadoop

  • workers

  • core-site.xml

    <configuration>
    		<property>
    				<name>fs.defaultFS</name>
    				<value>hdfs://Master:9000</value>
    		</property>
    		<property>
    				<name>hadoop.tmp.dir</name>
    				<value>file:/usr/local/hadoop/tmp</value>
    				<description>Abase for other temporary directories.</description>
    		</property>
    </configuration>
    
  • hdfs-site.xml

    <configuration>
    		<property>
    				<name>dfs.namenode.secondary.http-address</name>
    				<value>Master:50090</value>
    		</property>
    		<property>
    				<name>dfs.replication</name>
    				<value>1</value>
    		</property>
    		<property>
    				<name>dfs.namenode.name.dir</name>
    				<value>file:/usr/local/hadoop/tmp/dfs/name</value>
    		</property>
    		<property>
    				<name>dfs.datanode.data.dir</name>
    				<value>file:/usr/local/hadoop/tmp/dfs/data</value>
    		</property>
    </configuration>
    
  • mapred-site.xml

    <configuration>
    		<property>
    				<name>mapreduce.framework.name</name>
    				<value>yarn</value>
    		</property>
    		<property>
    				<name>mapreduce.jobhistory.address</name>
    				<value>Master:10020</value>
    		</property>
    		<property>
    				<name>mapreduce.jobhistory.webapp.address</name>
    				<value>Master:19888</value>
    		</property>
    		<property>
    				<name>yarn.app.mapreduce.am.env</name>
    				<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    		</property>
    		<property>
    				<name>mapreduce.map.env</name>
    				<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    		</property>
    		<property>
    				<name>mapreduce.reduce.env</name>
    				<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    		</property> 
    </configuration>
    
  • yarn-site.xml

    <configuration>
    		<property>
    				<name>yarn.resourcemanager.hostname</name>
    				<value>Master</value>
    		</property>
    		<property>
    				<name>yarn.nodemanager.aux-services</name>
    				<value>mapreduce_shuffle</value>
    		</property>
    </configuration>
    
  • When all five files are configured, the'/usr/local/hadoop'folder on the Master node needs to be copied to each node
    On the master node:

    cd /usr/local
    sudo rm -r ./hadoop/tmp     # Delete Hadoop Temporary Files
    sudo rm -r ./hadoop/logs/*   # Delete Log File
    tar -zcf ~/hadoop.master.tar.gz ./hadoop   # Compress before copying
    cd ~
    scp ./hadoop.master.tar.gz Slave1:/home/hadoop
    

    On the data1 node

    sudo rm -r /usr/local/hadoop    # Delete old (if present)
    sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/local
    sudo chown -R hadoop /usr/local/hadoop
    

start-up

  • For the first time, format the node on the master node:

    hdfs namenode -format
    
  • start-up

    start-dfs.sh
    start-yarn.sh
    mr-jobhistory-daemon.sh start historyserver
    
  • jps view on master

  • View on data1

Close

  • On master virtual machine
    stop-yarn.sh
    stop-dfs.sh
    mr-jobhistory-daemon.sh stop historyserver
    

Keywords: Big Data Hadoop

Added by php_user13 on Thu, 21 Oct 2021 20:39:33 +0300