Hadoop Installation and Configuration in Ubuntu

Tencent Yun ubuntu 16.04.1 LTS 64 bits

Linux operation

Modify the password of root

sudo passwd root

Log off users

logout

Close the firewall

ufw disable

Uninstall iptables components

apt-get remove iptables

Download vim components (for text editing)

apt-get install vim

Word change

sudo dpkg-reconfigure console-setup

linux remote connection

  • First: ssh services on linux
  • Second: Use the ssh client tool

Install ssh tools for the system

apt-get install openssh-server

Start ssh service

/etc/init.d/ssh start

Check the process to see if the specified service has been started

ps -e | grep sshd

ssh can only be used with processes

ubuntu does not allow root users to log in by default
Open the / etc/ssh/sshd_config file using vim

vim /etc/ssh/sshd_config

Subsequently, the content of PermitRootLogin is modified to yes

Configuring ftp services

Install ftp components

apt-get install vsftpd

Modify ftp user password

passwd ftp

After the FTP service is installed, a directory is automatically created: / srv/ftp
cd /srv/ftp
Set this directory to full permissions

chmod 777 /srv/ftp

To make ftp work properly, you need to modify the configuration file "/ etc/vsftpd.conf"

vim /etc/vsftpd.conf

Set the following configuration:
Settings do not allow anonymous login (user name and password must be correct)
anonymous_enable=NO
# Configure that the user has write permission
write_enable=YES
Allow local users to log in
local_enable=YES
Whether to limit all users to the home directory (remove comments #)
chroot_local_user=YES
Whether to start a restricted list of users
chroot_list_enable=YES
Define directories for list settings (because multiple accounts can be set in the list)
chroot_list_file=/etc/vsftpd.chroot_list
Adding a service configuration
pam_service_name=vsftpd

Enter the configuration file
vim /etc/vsftpd.chroot_list
Add an ftp user, enter, complete, save and exit
Modify/etc/pam.d/vsftpd
vim /etc/pam.d/vsftpd
Comment out the following
auth   required        pam_shells.so

Start ftp service

service vsftpd start
service vsftpd restart           //Restart

Check to see if ftp service has been started

ps -e | grep vsftpd

hadoop installation and configuration

jdk installation and configuration

Download jdk or direct ftp incoming

wget http://download.oracle.com/otn-pub/java/jdk/8u191-b12-demos/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64-demos.tar.gz

Unzip and save in / usr/local directory

tar xzvf jdk-8u191-linux-x64-demos.tar.gz -C /usr/local

Rename

mv jdk1.8.0_191/ jdk

Enter the environment file for configuration

vim /etc/profile

export JAVA_HOME=/usr/local/jdk
export PATH=$PATH:$JAVA_HOME/bin:
export CLASS_PATH=$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

# By default, after modifying the content of the environment variable, the system needs to be restarted to read the new configuration, but it can also use source to make the configuration take effect immediately:

source /etc/profile


Installing hadoop in linux

tar xzvf hadoop-2.8.5-src -C /usr/local

# Configuration of Entry Environment Variables

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:

Let configuration take effect

source /etc/profile

hadoop relies on JDK support, so define the JDK path to use in a resource file of hadoop

vim hadoop-env.sh 

Determine the jdk settings to be used in Hadoop

export JAVA_HOME=/usr/local/jdk


# To test whether hadoop is installed and available, you can use hadoop with its own test program
For word statistics, first specify a word file
# Create an input directory under the Hadoop directory:
root@VM-0-3-ubuntu:/usr/local/hadoop# mkdir input
# Write a document
root@VM-0-3-ubuntu:/usr/local/hadoop# echo hello 6jj hello nihaoa > input/info.txt
Use "" to split each word

Distributed hadoop configuration

Configure ssh

ip cannot be changed, otherwise it needs to be reconfigured

For configuration convenience, set the host name for each computer

vim /etc/hostname

Change the localhost inside to "Hadoopm"

You also need to modify the mapping configuration of the host, modify the'/ etc/hosts'file, and add the mapping of IP address to hadoopm host name.

vim /etc/hosts

172.16.0.3 hadoopm

In order to make it work, it is recommended to restart input reboot and restart Linux

In the whole process of Hadoop processing, ssh is used to achieve communication, so even on the local machine, ssh is also recommended for communication processing, and ssh must be configured on the computer to exempt from landing processing.
Since the computer may already have SSH configuration, it is recommended to delete the ".ssh" folder in the root directory.

cd ~
rm -rf ~/.ssh

Generate ssh Key on the host of Hadoop:

ssh-keygen -t rsa

At this point, if the program wants to login, it still needs a password. It needs to store the public key information in the authorized_key file in the authorized authentication file.

cd ~
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

In the future, landing-free processing can be carried out.

Landing subsequently

ssh root@hadoopm

When the login becomes a remote connection, exit can be used to exit the current connection.

Related configuration of hadoop

All configuration files are in the "/usr/local/hadoop/etc/hadoop/" directory

Configuration: "core-site.xml"

Identify Hadoop's core information, including temporary directories, access addresses

<property>

      <name>hadoop.tmp.dir</name>
      <value>/home/root/hadoop_tmp</value>
</property> 
<property> 
      <name>fs.defaultFS</name>
      <value>hdfs://hadoopm:9000</value>
</property>
  • The "hdfs://hadoopm:9000" information configured in this article describes the path of the page manager to be opened later. The default port of Hadoop version 2.X is 9000.
  • The most important thing in this configuration is "/ home/root/hadoop_tmp". If the temporary file information of the file path configuration is not configured, the "tmp" file will be generated in the Hadoop folder ("/usr/local/hadoop/tmp"). If this configuration is restarted, all information will be cleared, that is to say, the Hadoop environment will fail. In order to ensure that there is no error, a "/ho" can be established directly. Me/root/hadoop_tmp "directory mkdir ~/root/hadoop_tmp"

Configuration: "yarn-site.xml"

It can be simply understood as configuring the processing of related job s

<property>
	<name>yarn.resourcemanager.admin.address</name>
	<value>hadoopm:8033</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>hadoopm:8025</value>
</property>
<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>hadoopm:8030</value>
</property>
<property>
	<name>yarn.resourcemanager.address</name>
	<value>hadoopm:8050</value>
</property>
<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>hadoopm:8030</value>
</property>
<property>
	<name>yarn.resourcemanager.webapp.address</name>
	<value>hadoopm:8088</value>
</property>
<property>
	<name>yarn.resourcemanager.webapp.https.address</name>
	<value>hadoopm:8090</value>
</property>

Configuration: "hdfs-site.xml"

You can determine the number of backups of files and the path of data folders

<property>

    <name>dfs.replication</name>

    <value>1</value>

</property>
<property>

    <name>dfs.namenode.name.dir</name>

    <value>file:///usr/local/hadoop/dfs/name</value>

</property>
<property>

    <name>dfs.datanode.data.dir</name>

    <value>file:///usr/local/hadoop/dfs/data</value>

</property>
<property>

    <name>dfs.namenode.http-address</name>

    <value>hadoopm:50070</value>

</property>
<property>

    <name>dfs.namenode.secondary.http-address</name>

    <value>hadoopm:50090</value>

</property>
<property>

    <name>dfs.permissions</name>

    <value>false</value>

</property>

 

  • "replication": the number of copies of a file, usually three copies of a file backup
  • "dfs.namenode.name.dir": Define the name node path
  • "dfs.datanode.data.dir": Define the data file node path
  • "dfs.namenode.http-address": HTTP path access for name service
  • "dfs.namenode.secondary.http-address": second name node
  • "dfs.permissions": access to permissions, to avoid inaccessibility, set false

 

Because Hadoop belongs to the distributed development environment, it is necessary to build cluster in the future.
It is recommended to create a master file in the'/ usr/local/hadoop/etc/hadoop /'directory with the host name written in it, which is hadoopm (the host name defined in the previous host file). If it is a stand-alone environment, it can also be written without it.

vim masters

hadoopm

Modify slaves and add hadoopm

vim slaves 

Since all namenodes are saved in the hadoop directory at this time, the datanode path can be created by itself for the sake of insurance, in the hadoop directory.

mkdir dfs dfs/name dfs/data

If Hadoop has a problem and needs to be reconfigured, remove the two folders

Format file system

hdfs namenode -format

If formatted properly, the "INFOutil. ExitUtil: Exiting with status 0" message appears
If an error occurs, the "INFOutil. ExitUtil: Exiting with status 1" message appears.

A simple process can then be performed to start Hadoop:

Then you can use the jps command provided by JDK to see that if all Java processes return six processes, the configuration is successful.

jsp

Then you can test whether HDFS is working properly.

Close the service if needed: stop-all.sh command

start-all.sh

 

 

 

 

 

 

 

Keywords: Big Data Hadoop ssh ftp vim

Added by EXiT on Tue, 14 May 2019 16:44:32 +0300