Local mode, pseudo-distributed cluster, distributed cluster and high-availability environment for HDFS systems for Hadoop 2.5.2

1. Prepare the environment (JDK and Hadoop)

$ tar -zxf hadoop-2.5.2.tar.gz -C /opt/app/


// Uninstall java from Linux, install jdk1.8, hive only supports version 1.7 or above
$ rpm -qa|grep java
$ rpm -e --nodeps java Various documents

2. Environment Configuration

Configure your environment with etc/hadoop/hadoop-env.sh in the Hadoop installation directory

Edit via Notepad's NPFTp remote link

$ vi /opt/app/etc/hadoop/hadoop-env.sh

Configuring the JDK environment


 # set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest

Configure the hadoop installation directory (this option does not require high configuration)


 # Assuming your installation directory is /usr/local/hadoop
 export HADOOP_PREFIX=/usr/local/hadoop
3. Start three modes of HDFS (local mode, pseudo-distributed, distributed)
3.1 Local (Standalone) Model Local Mode

Usually used as a Java, very useful for debug, demonstrating the instance statistics program
You can restart statistics by touch ing a file to insert data.


 # Create an input folder to back up unmodified xml files

  $ mkdir input
  $ cp etc/hadoop/*.xml input
 # Execute an example jar package under hadoop
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
 # View output structure
  $ cat output/*
3.2 Pseudo-Distributed Mode l pseudo-distributed mode
etc/hadoop/core-site.xml:


<configuration>
// Modify hdfs file system location
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
// Configure hadoop temporary folder location for easy system management
//  Delete dfs files in this file directory when executing bin/hdfs namenode-format error later
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/app/hadoop-2.5.2/data/tmp</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:


<configuration>
// Configure the number of copies of local files
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
ssh secret key login needs to be configured to facilitate each access between hosts


$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
// Test for successful keyless login
$ ssh localhost
Run Cluster
 - Format the file system
// Delete multiple file formatshadoop.tmp.dir Files to prevent errors  
$ bin/hdfs namenode -format

-Start hdfs


// Start with SSH keyless startup
$ sbin/start-dfs.sh
// You can also start it in one of the following ways
$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/hadoop-daemon.sh start secondarynode(In a pseudo-distributed cluster, there is no need to not start)

-hadoop will automatically create logs under logs after starting hdfs

- We can also view NameNode information on its WEB page: http://localhost:50070/

Note: If the Linux client views the entrance 50070 through a JPS file and cannot open it,
You should check to see if the system and Linucx Virtual Machine firewalls are turned off
Run MapReduce Task
$ bin/hdfs dfs -mkdir  -p /user/huangxc/
// Upload local files to hdfs system
$ bin/hdfs dfs -put etc/hadoop input user/huangxgc/

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/inputoutput 'dfs[a-z.]+'

// Download output file to local file
$ bin/hdfs dfs -get output output
// View results
$ cat output/*
$ bin/hdfs dfs -cat output/*
// Turn off hdfs system
$ sbin/stop-dfs.sh



4.YARN on Single Node 
- MapReduce runs on a distributed system.
 a mapred-env.sh configuration ${JAVA_HOME} (no configuration is required)


b. Remove etc/hadoop/mapred-site.xml.tmplatform suffix from etc/hadoop/mapred-site.xml


<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>


c.etc/hadoop/yarn-env.xml:(Configuration not required)




# some Java parameters
export JAVA_HOME=/opt/modules/jdk1.8.0_121





d.etc/hadoop/yarn-site.xml:





<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop.com</value>
    </property>
</configuration>
Run YARN


# Run yarn 
$ sbin/start-yarn.sh
#Run mapreduce program
# Since the output directory cannot exist, delete the original output directory and files first

$ bin/hdfs -rm -R /user/huangxc/output


$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/output 'dfs[a-z.]+'

# View Output File Information

$ bin/hdfs -dfs -cat /user/huangxc/output

# Close yarn
$ sbin/stop-yarn.sh

YARN's ResourceMananger, like NameNode, has a WEB interface:

ResourceManager - http://localhost:8088/

----------------------------------------------------------------------------------------------------------------------------


5. Set up distributed cluster operations


- Planning Clusters and Services for Distributed Clusters

 1. Setting up the cluster is based on the configuration of the previous SingleNode

 2. Clusters are built using three virtual machines with host addresses of hadoo.com, hadoop02.com, and hadoop03.com. Cluster IP addresses are set on the same network segment
 3. Ensure that each virtual machine firewall is turned off, and many routes will fail to connect with errors later



Distributed Cluster Cluster Planning

 hadoop.com
  
 hadoop02.com
  
 hadoop03.com
  
configuration file


HDFS System
hadoop-env.shcore-site.xmlhdfs-site.xmlslaves


NameNode
 
 


DataNode
DataNode
DataNode


 
 
SeconaryNameNode


YARN System
yarn-env.shyarn-site.shslaves


 
RescourceMannager
 


nodeManager
NodeManager
NodeManager


MapReduce System
mapred-env.shmarred-site.xml


JobHistoryServer
 
 


5.1 HDFS System Configuration



hadoop-env.sh configuration &{JAVA_HOME} JDK path (configured above)


core-site.xml configures hdfs address fs.defaultFS, temporary file path hadoop.tmp.dir, (configured above)


etc/hadoop/core-site.xml

The default number of copies of the hdfs-site.xml configuration is 3 (delete the original single-node configuration, or change the value to 3)

Configure SeconaryNameNode address
 <property>
     <name>dfs.namenode.secondary.http-address</name>
     <value>hadoop03.com:50090</value>
 </property>



etc/hadoop/slaves


hadoop.com
hadoop02.com
hadoop03.com
5.2 YARN System Configuration 


- yarn-env.sh jdk path

- yarn-site.sh 


     <property>	
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
	// ResourceManager address
	<property>	
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop02.com</value>
    </property>




Slaves as public slaves above


5.3 MapReduce System Configuration



marred-site.xml jdk path

marred-site.xml, (mapreduce.framework.name single node configured)

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>hadoop.com:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>hadoop.com:19888.</value>
</property>
5.4 Distributing hadoop.com to other hosts

 -Configure secret key logins between different hosts to avoid access to duplicate logins Enter passwords:


  ssh-copy-id hadoop02.com
  ssh-copy-id hadoop03.com
 - Distribute files to different hosts


$ scp -r hadoop2.5.2/ huangxc@hadoop02.com:/opt/app/
$ scp -r hadoop2.5.2/ huangxc@hadoop03.com:/opt/app/

6. Running Clusters
 6.1 Start HDFS Prepare File System 


$ bin/hdfs dfs namenode -format
$ sbin/strat-dfs.sh
// When starting yarn, since resourcemannager runs on hadoop02.com, running startup on hadoop02.com is not error prone
$ sbin/strat-yarn.sh





7. HDFS High Availability Using the Quorum Journal Manager High Availability HDFS

apache's collaboration framework needs to be built before high availability HDFS can be built. ZooKeeper™ Install and configure conf/zoo.cfg as follows.zookeeper. Best singular
And create a myid in the dataDir directory and fill in the number numbers of all the Server s 
Running Replicated ZooKeeper



tickTime = 2000
dataDir = / var / lib / zookeeper
clientPort = 2181
initLimit = 5
syncLimit = 2
server.1 = hadoop.com: 2888: 3888
server.2 =hadoop02.com: 2888: 3888
server.3 = hadoop03.com: 2888: 3888


 hdfs-site.xml 






High Availability HDFS System Planning

 hadoop.com
  
 hadoop02.com
  
 hadoop03.com
  
configuration file


HDFS System
hadoop-env.shcore-site.xmlhdfs-site.xmlslaves


NameNode
NameNode
 


DataNode
DataNode
DataNode


journalnode
journalnode
journalnode

ZooKeeper

zookeeper
zookeeper
zookeeper
zoo.cfg








<property>
  <name>dfs.nameservices</name>
  <value>ns1</value>
</property>

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value>
</property>

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value>machine1.example.com:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value>machine2.example.com:50070</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>machine1.example.com:8020</value>
</property>

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value>machine1.example.com:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value>machine2.example.com:50070</value>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>

  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>


<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/exampleuser/.ssh/id_rsa</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node/local/data</value>
</property>


 core-site.xml





<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ns1</value>
</property>



QJM HA Start
1. On each journalnode, type the following command to start the journalnode service:

sbin/hadoop-daemon.sh start namenode
2. On nn1, format its hdfs and start the namenode:

sbin/hadoop-daemon.sh start namenode
3. Synchronize nn1 metadata information on nn2

bin/hdfs namenode -bootstrapStandby
4. Start nn2

sbin/hadoop-daemon.sh start datanode
5. Switch nn1 to Active

bin/hdfs haadmin -transitionToActive nn1
6. Start a pre-existing datanode on nn1

sbin/hadoop-daemon.sh start datanode

Keywords: Hadoop xml ssh Zookeeper

Added by jeethau on Wed, 17 Jul 2019 20:20:54 +0300

Programming VIP

Local mode, pseudo-distributed cluster, distributed cluster and high-availability environment for HDFS systems for Hadoop 2.5.2

1. Prepare the environment (JDK and Hadoop)

2. Environment Configuration

Configure your environment with etc/hadoop/hadoop-env.sh in the Hadoop installation directory

3. Start three modes of HDFS (local mode, pseudo-distributed, distributed)

3.1 Local (Standalone) Model Local Mode

3.2 Pseudo-Distributed Mode l pseudo-distributed mode

etc/hadoop/core-site.xml:

etc/hadoop/hdfs-site.xml:

ssh secret key login needs to be configured to facilitate each access between hosts

Run Cluster

4.YARN on Single Node

5.1 HDFS System Configuration

5.2 YARN System Configuration

7. HDFS High Availability Using the Quorum Journal Manager High Availability HDFS

Running Replicated ZooKeeper

Popular Keywords

Distributed Cluster Cluster Planning
hadoop.com	hadoop02.com	hadoop03.com	configuration file
HDFS System			hadoop-env.shcore-site.xmlhdfs-site.xmlslaves
NameNode
DataNode	DataNode	DataNode
		SeconaryNameNode
YARN System			yarn-env.shyarn-site.shslaves
	RescourceMannager
nodeManager	NodeManager	NodeManager
MapReduce System			mapred-env.shmarred-site.xml
JobHistoryServer			mapred-env.shmarred-site.xml

High Availability HDFS System Planning
hadoop.com	hadoop02.com	hadoop03.com	configuration file
HDFS System			hadoop-env.shcore-site.xmlhdfs-site.xmlslaves
NameNode	NameNode
DataNode	DataNode	DataNode
journalnode	journalnode	journalnode
ZooKeeper
zookeeper	zookeeper	zookeeper	zoo.cfg