Local mode, pseudo-distributed cluster, distributed cluster and high-availability environment for HDFS systems for Hadoop 2.5.2

1. Prepare the environment (JDK and Hadoop)

  1. $ tar -zxf hadoop-2.5.2.tar.gz -C /opt/app/
  1. // Uninstall java from Linux, install jdk1.8, hive only supports version 1.7 or above
  2. $ rpm -qa|grep java
  3. $ rpm -e --nodeps java Various documents

2. Environment Configuration

Configure your environment with etc/hadoop/hadoop-env.sh in the Hadoop installation directory

Edit via Notepad's NPFTp remote link
  1. $ vi /opt/app/etc/hadoop/hadoop-env.sh

  • Configuring the JDK environment
  1. # set to the root of your Java installation
  2. export JAVA_HOME=/usr/java/latest

  • Configure the hadoop installation directory (this option does not require high configuration)
  1. # Assuming your installation directory is /usr/local/hadoop
  2. export HADOOP_PREFIX=/usr/local/hadoop

3. Start three modes of HDFS (local mode, pseudo-distributed, distributed)

3.1 Local (Standalone) Model Local Mode

Usually used as a Java, very useful for debug, demonstrating the instance statistics program
You can restart statistics by touch ing a file to insert data.
  1. # Create an input folder to back up unmodified xml files
  2. $ mkdir input
  3. $ cp etc/hadoop/*.xml input
  4. # Execute an example jar package under hadoop
  5. $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
  6. # View output structure
  7. $ cat output/*

3.2 Pseudo-Distributed Mode l pseudo-distributed mode

etc/hadoop/core-site.xml:
  1. <configuration>
  2. // Modify hdfs file system location
  3. <property>
  4. <name>fs.defaultFS</name>
  5. <value>hdfs://localhost:9000</value>
  6. </property>
  7. // Configure hadoop temporary folder location for easy system management
  8. // Delete dfs files in this file directory when executing bin/hdfs namenode-format error later
  9. <property>
  10. <name>hadoop.tmp.dir</name>
  11. <value>/opt/app/hadoop-2.5.2/data/tmp</value>
  12. </property>
  13. </configuration>
etc/hadoop/hdfs-site.xml:
  1. <configuration>
  2. // Configure the number of copies of local files
  3. <property>
  4. <name>dfs.replication</name>
  5. <value>1</value>
  6. </property>
  7. </configuration>
ssh secret key login needs to be configured to facilitate each access between hosts
  1. $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  2. $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  3. // Test for successful keyless login
  4. $ ssh localhost
Run Cluster
- Format the file system
  1. // Delete multiple file formatshadoop.tmp.dir Files to prevent errors  
  2. $ bin/hdfs namenode -format
-Start hdfs
  1. // Start with SSH keyless startup
  2. $ sbin/start-dfs.sh
  3. // You can also start it in one of the following ways
  4. $ sbin/hadoop-daemon.sh start namenode
  5. $ sbin/hadoop-daemon.sh start datanode
  6. $ sbin/hadoop-daemon.sh start secondarynode(In a pseudo-distributed cluster, there is no need to not start)
-hadoop will automatically create logs under logs after starting hdfs
- We can also view NameNode information on its WEB page: http://localhost:50070/
Note: If the Linux client views the entrance 50070 through a JPS file and cannot open it,
You should check to see if the system and Linucx Virtual Machine firewalls are turned off
Run MapReduce Task
  1. $ bin/hdfs dfs -mkdir -p /user/huangxc/
  2. // Upload local files to hdfs system
  3. $ bin/hdfs dfs -put etc/hadoop input user/huangxgc/
  4. $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/inputoutput 'dfs[a-z.]+'
  5. // Download output file to local file
  6. $ bin/hdfs dfs -get output output
  7. // View results
  8. $ cat output/*
  9. $ bin/hdfs dfs -cat output/*
  10. // Turn off hdfs system
  11. $ sbin/stop-dfs.sh

4.YARN on Single Node

- MapReduce runs on a distributed system.
a mapred-env.sh configuration ${JAVA_HOME} (no configuration is required)
b. Remove etc/hadoop/mapred-site.xml.tmplatform suffix from etc/hadoop/mapred-site.xml
  1. <configuration>
  2. <property>
  3. <name>mapreduce.framework.name</name>
  4. <value>yarn</value>
  5. </property>
  6. </configuration>
c.etc/hadoop/yarn-env.xml:(Configuration not required)
  1. # some Java parameters
  2. export JAVA_HOME=/opt/modules/jdk1.8.0_121

d.etc/hadoop/yarn-site.xml:
  1. <configuration>
  2. <property>
  3. <name>yarn.nodemanager.aux-services</name>
  4. <value>mapreduce_shuffle</value>
  5. </property>
  6. <property>
  7. <name>yarn.resourcemanager.hostname</name>
  8. <value>hadoop.com</value>
  9. </property>
  10. </configuration>
Run YARN
  1. # Run yarn
  2. $ sbin/start-yarn.sh
  3. #Run mapreduce program
  4. # Since the output directory cannot exist, delete the original output directory and files first
  5. $ bin/hdfs -rm -R /user/huangxc/output
  6. $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/output 'dfs[a-z.]+'
  7. # View Output File Information
  8. $ bin/hdfs -dfs -cat /user/huangxc/output
  9. # Close yarn
  10. $ sbin/stop-yarn.sh
YARN's ResourceMananger, like NameNode, has a WEB interface:
----------------------------------------------------------------------------------------------------------------------------
5. Set up distributed cluster operations
- Planning Clusters and Services for Distributed Clusters
1. Setting up the cluster is based on the configuration of the previous SingleNode
2. Clusters are built using three virtual machines with host addresses of hadoo.com, hadoop02.com, and hadoop03.com. Cluster IP addresses are set on the same network segment
3. Ensure that each virtual machine firewall is turned off, and many routes will fail to connect with errors later
Distributed Cluster Cluster Planning
 hadoop.com    hadoop02.com    hadoop03.com   configuration file
HDFS System hadoop-env.shcore-site.xmlhdfs-site.xmlslaves
NameNode    
DataNode DataNode DataNode
    SeconaryNameNode
YARN System yarn-env.shyarn-site.shslaves
  RescourceMannager  
nodeManager NodeManager NodeManager
MapReduce System mapred-env.shmarred-site.xml
JobHistoryServer    

5.1 HDFS System Configuration

  1. hadoop-env.sh configuration &{JAVA_HOME} JDK path (configured above)
  2. core-site.xml configures hdfs address fs.defaultFS, temporary file path hadoop.tmp.dir, (configured above)
  3. etc/hadoop/core-site.xml
  4. The default number of copies of the hdfs-site.xml configuration is 3 (delete the original single-node configuration, or change the value to 3)
  5. Configure SeconaryNameNode address
    1. <property>
    2. <name>dfs.namenode.secondary.http-address</name>
    3. <value>hadoop03.com:50090</value>
    4. </property>
  6. etc/hadoop/slaves
  1. hadoop.com
  2. hadoop02.com
  3. hadoop03.com

5.2 YARN System Configuration

  1. - yarn-env.sh jdk path
  2. - yarn-site.sh
  1. <property>
  2. <name>yarn.nodemanager.aux-services</name>
  3. <value>mapreduce_shuffle</value>
  4. </property>
  5. // ResourceManager address
  6. <property>
  7. <name>yarn.resourcemanager.hostname</name>
  8. <value>hadoop02.com</value>
  9. </property>

Slaves as public slaves above

5.3 MapReduce System Configuration
  1. marred-site.xml jdk path
  2. marred-site.xml, (mapreduce.framework.name single node configured)
  1. <property>
  2. <name>mapreduce.jobhistory.address</name>
  3. <value>hadoop.com:10020</value>
  4. </property>
  5. <property>
  6. <name>mapreduce.jobhistory.webapp.address</name>
  7. <value>hadoop.com:19888.</value>
  8. </property>
5.4 Distributing hadoop.com to other hosts

-Configure secret key logins between different hosts to avoid access to duplicate logins Enter passwords:
  1. ssh-copy-id hadoop02.com
  2. ssh-copy-id hadoop03.com
- Distribute files to different hosts
  1. $ scp -r hadoop2.5.2/ huangxc@hadoop02.com:/opt/app/
  2. $ scp -r hadoop2.5.2/ huangxc@hadoop03.com:/opt/app/
6. Running Clusters
6.1 Start HDFS Prepare File System
  1. $ bin/hdfs dfs namenode -format
  2. $ sbin/strat-dfs.sh
  3. // When starting yarn, since resourcemannager runs on hadoop02.com, running startup on hadoop02.com is not error prone
  4. $ sbin/strat-yarn.sh

7. HDFS High Availability Using the Quorum Journal Manager High Availability HDFS

apache's collaboration framework needs to be built before high availability HDFS can be built. ZooKeeper™ Install and configure conf/zoo.cfg as follows.zookeeper. Best singular
And create a myid in the dataDir directory and fill in the number numbers of all the Server s

Running Replicated ZooKeeper


  1. tickTime = 2000
  2. dataDir = / var / lib / zookeeper
  3. clientPort = 2181
  4. initLimit = 5
  5. syncLimit = 2
  6. server.1 = hadoop.com: 2888: 3888
  7. server.2 =hadoop02.com: 2888: 3888
  8. server.3 = hadoop03.com: 2888: 3888
 hdfs-site.xml 

High Availability HDFS System Planning
 hadoop.com    hadoop02.com    hadoop03.com   configuration file
HDFS System hadoop-env.shcore-site.xmlhdfs-site.xmlslaves
NameNode NameNode  
DataNode DataNode DataNode
journalnode journalnode journalnode
ZooKeeper
zookeeper zookeeper zookeeper zoo.cfg

  1. <property>
  2. <name>dfs.nameservices</name>
  3. <value>ns1</value>
  4. </property>
  5. <property>
  6. <name>dfs.ha.namenodes.mycluster</name>
  7. <value>nn1,nn2</value>
  8. </property>
  9. <property>
  10. <name>dfs.namenode.http-address.mycluster.nn1</name>
  11. <value>machine1.example.com:50070</value>
  12. </property>
  13. <property>
  14. <name>dfs.namenode.http-address.mycluster.nn2</name>
  15. <value>machine2.example.com:50070</value>
  16. </property>
  17. <property>
  18. <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  19. <value>machine1.example.com:8020</value>
  20. </property>
  21. <property>
  22. <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  23. <value>machine2.example.com:8020</value>
  24. </property>
  25. <property>
  26. <name>dfs.namenode.http-address.mycluster.nn1</name>
  27. <value>machine1.example.com:50070</value>
  28. </property>
  29. <property>
  30. <name>dfs.namenode.http-address.mycluster.nn2</name>
  31. <value>machine2.example.com:50070</value>
  32. </property>
  33. <property>
  34. <name>dfs.namenode.shared.edits.dir</name>
  35. <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
  36. </property>
  37. <name>dfs.client.failover.proxy.provider.mycluster</name>
  38. <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  39. </property>
  40. property>
  41. <name>dfs.ha.fencing.methods</name>
  42. <value>sshfence</value>
  43. </property>
  44. <property>
  45. <name>dfs.ha.fencing.ssh.private-key-files</name>
  46. <value>/home/exampleuser/.ssh/id_rsa</value>
  47. </property>
  48. <property>
  49. <name>dfs.journalnode.edits.dir</name>
  50. <value>/path/to/journal/node/local/data</value>
  51. </property>
 core-site.xml
  1. <property>
  2. <name>fs.defaultFS</name>
  3. <value>hdfs://ns1</value>
  4. </property>

QJM HA Start
1. On each journalnode, type the following command to start the journalnode service:
  1. sbin/hadoop-daemon.sh start namenode
2. On nn1, format its hdfs and start the namenode:
  1. sbin/hadoop-daemon.sh start namenode
3. Synchronize nn1 metadata information on nn2
  1. bin/hdfs namenode -bootstrapStandby
4. Start nn2
  1. sbin/hadoop-daemon.sh start datanode
5. Switch nn1 to Active
  1. bin/hdfs haadmin -transitionToActive nn1
6. Start a pre-existing datanode on nn1
  1. sbin/hadoop-daemon.sh start datanode

Keywords: Hadoop xml ssh Zookeeper

Added by jeethau on Wed, 17 Jul 2019 20:20:54 +0300