1. Prepare the environment (JDK and Hadoop)
$ tar -zxf hadoop-2.5.2.tar.gz -C /opt/app/
// Uninstall java from Linux, install jdk1.8, hive only supports version 1.7 or above
$ rpm -qa|grep java
- $ rpm -e --nodeps java Various documents
2. Environment Configuration
Configure your environment with etc/hadoop/hadoop-env.sh in the Hadoop installation directory
Edit via Notepad's NPFTp remote link
$ vi /opt/app/etc/hadoop/hadoop-env.sh
-
Configuring the JDK environment
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
# Assuming your installation directory is /usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
3. Start three modes of HDFS (local mode, pseudo-distributed, distributed)
3.1 Local (Standalone) Model Local Mode
Usually used as a Java, very useful for debug, demonstrating the instance statistics program
You can restart statistics by touch ing a file to insert data.
# Create an input folder to back up unmodified xml files
$ mkdir input
$ cp etc/hadoop/*.xml input
# Execute an example jar package under hadoop
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
# View output structure
$ cat output/*
3.2 Pseudo-Distributed Mode l pseudo-distributed mode
etc/hadoop/core-site.xml:
<configuration>
- // Modify hdfs file system location
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
- // Configure hadoop temporary folder location for easy system management
- // Delete dfs files in this file directory when executing bin/hdfs namenode-format error later
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/app/hadoop-2.5.2/data/tmp</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
- // Configure the number of copies of local files
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
ssh secret key login needs to be configured to facilitate each access between hosts
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- // Test for successful keyless login
$ ssh localhost
Run Cluster
- Format the file system
// Delete multiple file formatshadoop.tmp.dir Files to prevent errors
$ bin/hdfs namenode -format
-Start hdfs
// Start with SSH keyless startup
$ sbin/start-dfs.sh
// You can also start it in one of the following ways
$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/hadoop-daemon.sh start secondarynode(In a pseudo-distributed cluster, there is no need to not start)
-hadoop will automatically create logs under logs after starting hdfs
Note: If the Linux client views the entrance 50070 through a JPS file and cannot open it,
You should check to see if the system and Linucx Virtual Machine firewalls are turned off
Run MapReduce Task
$ bin/hdfs dfs -mkdir -p /user/huangxc/
// Upload local files to hdfs system
$ bin/hdfs dfs -put etc/hadoop input user/huangxgc/
-
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/inputoutput 'dfs[a-z.]+
'
- // Download output file to local file
$ bin/hdfs dfs -get output output
// View results
$ cat output/*
$ bin/hdfs dfs -cat output/*
// Turn off hdfs system
$ sbin/stop-dfs.sh
4.YARN on Single Node
- MapReduce runs on a distributed system.
a mapred-env.sh configuration ${JAVA_HOME} (no configuration is required)
b. Remove etc/hadoop/mapred-site.xml.tmplatform suffix from etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
c.etc/hadoop/yarn-env.xml:(Configuration not required)
# some Java parameters
export JAVA_HOME=/opt/modules/jdk1.8.0_121
d.etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop.com</value>
</property>
</configuration>
Run YARN
# Run yarn
$ sbin/start-yarn.sh
#Run mapreduce program
- # Since the output directory cannot exist, delete the original output directory and files first
-
$ bin/hdfs -rm -R
/user/huangxc/output
-
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep /user/huangxc/input /user/huangxc/output 'dfs[a-z.]+
'
- # View Output File Information
-
$ bin/hdfs -dfs -cat /user/huangxc/output
- # Close yarn
$ sbin/stop-yarn.sh
YARN's ResourceMananger, like NameNode, has a WEB interface:
----------------------------------------------------------------------------------------------------------------------------
5. Set up distributed cluster operations
- Planning Clusters and Services for Distributed Clusters
1. Setting up the cluster is based on the configuration of the previous SingleNode
2. Clusters are built using three virtual machines with host addresses of hadoo.com, hadoop02.com, and hadoop03.com. Cluster IP addresses are set on the same network segment
3. Ensure that each virtual machine firewall is turned off, and many routes will fail to connect with errors later
Distributed Cluster Cluster Planning |
hadoop.com
|
hadoop02.com
|
hadoop03.com
|
configuration file |
HDFS System |
hadoop-env.shcore-site.xmlhdfs-site.xmlslaves |
NameNode |
|
|
DataNode |
DataNode |
DataNode |
|
|
SeconaryNameNode |
YARN System |
yarn-env.shyarn-site.shslaves |
|
RescourceMannager |
|
nodeManager |
NodeManager |
NodeManager |
MapReduce System |
mapred-env.shmarred-site.xml |
JobHistoryServer |
|
|
5.1 HDFS System Configuration
-
hadoop-env.sh configuration &{JAVA_HOME} JDK path (configured above)
-
core-site.xml configures hdfs address fs.defaultFS, temporary file path hadoop.tmp.dir, (configured above)
-
etc/hadoop/core-site.xml
- The default number of copies of the hdfs-site.xml configuration is 3 (delete the original single-node configuration, or change the value to 3)
-
Configure SeconaryNameNode address
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop03.com:50090</value>
</property>
-
etc/hadoop/slaves
hadoop.com
hadoop02.com
hadoop03.com
5.2 YARN System Configuration
-
- yarn-env.sh jdk path
- - yarn-site.sh
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
// ResourceManager address
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop02.com</value>
</property>
Slaves as public slaves above
5.3 MapReduce System Configuration
-
marred-site.xml jdk path
- marred-site.xml, (mapreduce.framework.name single node configured)
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop.com:19888.</value>
</property>
5.4 Distributing hadoop.com to other hosts
-Configure secret key logins between different hosts to avoid access to duplicate logins Enter passwords:
ssh-copy-id hadoop02.com
ssh-copy-id hadoop03.com
- Distribute files to different hosts
$ scp -r hadoop2.5.2/ huangxc@hadoop02.com:/opt/app/
$ scp -r hadoop2.5.2/ huangxc@hadoop03.com:/opt/app/
6. Running Clusters
6.1 Start HDFS Prepare File System
$ bin/hdfs dfs namenode -format
$ sbin/strat-dfs.sh
// When starting yarn, since resourcemannager runs on hadoop02.com, running startup on hadoop02.com is not error prone
$ sbin/strat-yarn.sh
-
7. HDFS High Availability Using the Quorum Journal Manager High Availability HDFS
apache's collaboration framework needs to be built before high availability HDFS can be built.
ZooKeeper™ Install and configure conf/zoo.cfg as follows.zookeeper. Best singular
And create a myid in the dataDir directory and fill in the number numbers of all the Server s
Running Replicated ZooKeeper
tickTime = 2000
dataDir = / var / lib / zookeeper
clientPort = 2181
initLimit = 5
syncLimit = 2
server.1 = hadoop.com: 2888: 3888
server.2 =hadoop02.com: 2888: 3888
server.3 = hadoop03.com: 2888: 3888
hdfs-site.xml
High Availability HDFS System Planning |
hadoop.com
|
hadoop02.com
|
hadoop03.com
|
configuration file |
HDFS System |
hadoop-env.shcore-site.xmlhdfs-site.xmlslaves |
NameNode |
NameNode |
|
DataNode |
DataNode |
DataNode |
journalnode |
journalnode |
journalnode |
ZooKeeper |
zookeeper |
zookeeper |
zookeeper |
zoo.cfg |
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
QJM HA Start
1. On each journalnode, type the following command to start the journalnode service:
sbin/hadoop-daemon.sh start namenode
2. On nn1, format its hdfs and start the namenode:
sbin/hadoop-daemon.sh start namenode
3. Synchronize nn1 metadata information on nn2
bin/hdfs namenode -bootstrapStandby
4. Start nn2
sbin/hadoop-daemon.sh start datanode
5. Switch nn1 to Active
bin/hdfs haadmin -transitionToActive nn1
6. Start a pre-existing datanode on nn1
sbin/hadoop-daemon.sh start datanode