Big Data Series-Learning of hdfs

1. HDFS (Distributed File System)

1.1 Distributed File System

When the size of a data set exceeds the storage capacity of an independent computer, it is necessary to store the data set through multiple machines in the network. A file system composed of multiple computers in the management network is called a distributed file system.

Characteristics of 1.2 hdfs

Distributed
- With more and more data and no longer available within the scope of an operating system, it is allocated to more disks managed by the operating system, but it is inconvenient to manage and maintain. Therefore, there is an urgent need for a system to manage files on multiple machines, which is the distributed file management system.
High availability
- Replica mechanism
permeability
- Actually, it is the action of accessing files through the network. From the point of view of programs and users, it is like accessing local disks.

1.3 hdfs architecture

namenode
- Name node
- Management Node of File System
- Maintaining the file directory tree of the entire file system
- Receiving User's Request
datanode
- Data node
- Storage block (a block is 64 MB in Hadoop 1. X and 128 MB in Hadoop 2. x)

Design of 1.4 hdfs

Access to streaming data
- Write once and read many times
Commercial hardware
- hadoop does not need to run on expensive commercial machines (ibm minicomputers, etc.) but only ordinary machines.
Data access with low latency
- Application data requiring tens of milliseconds of response results cannot be stored using hdfs
- Although hdfs can't solve the problem of low latency access, hbase based on hdfs can solve the problem of latency
Lots of small files
- Each file in the namenode stores file directory information and block information, accounting for about 150 bytes.
- hdfs is not suitable for storing small files
Multiuser Write, Modify Files Arbitrarily
- Files stored in hdfs can only have one writer
- You can only append data at the end of a file, not modify it anywhere.

1.4 block Size Planning

block: Data block
- Basic Unit of Large Data Set Storage
- block is 64 MB in Hadoop 1. X and 128 MB in Hadoop 2. x
- Why are there such designs?
  - The hard disk has an addressing time (10ms)
  - Addressing time accounts for 1% of transmission time.
  - The read rate of hard disk is generally 100 mb/s.

1.5 secondary namenode

Merge edits and fsimage
Timing of merger
- 3600s
- 64mb

2. Operation of HDFS

2.1 Graphical Operation

2.2 shell operation

2.3 API operations

3. Operation of HDFS (graphical interface)

3.1 HDFS start-up process

Enter Safety Mode
Loading fsimage
Loading edits
Save checkpoints (merge fsimage and edits files to generate new fsimage)
Exit Security Mode

3.2 Access through Browser

http://namenode:50070

4. Operation of HDFS (shell operation)

hdfs dfs
hadoop fs

5. Operation of HDFS (API operation)

5.1 Dependence on POM

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.6.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.6.4</version>
</dependency>

5.2 hdfs read and write files

import org.apache.commons.compress.utils.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.junit.Test;
public class HdfsTest {
    /**
     * Write file operation
     */
    @Test
    public void testWriteFile() throws Exception {
        //Create configuration objects
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://uplooking01:8020");
        //Creating File System Objects
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("/test002.txt");
        FSDataOutputStream fsDataOutputStream = fs.create(path, true);
        fsDataOutputStream.write("hello".getBytes());
        fsDataOutputStream.flush();
        fsDataOutputStream.close();
    }

    /**
     * Read File Operation
     */
    @Test
    public void testReadFile() throws Exception {
        //Create configuration objects
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://uplooking01:8020");
        //Creating File System Objects
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("/test002.txt");
        FSDataInputStream fsDataInputStream = fs.open(path);
        IOUtils.copy(fsDataInputStream, System.out);
    }


    /**
     * Upload File Operation
     */
    @Test
    public void testuploadFile() throws Exception {
        //Create configuration objects
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://uplooking01:8020");
        //Creating File System Objects
        FileSystem fs = FileSystem.get(conf);
        Path fromPath = new Path("file:///f:/test01.txt");
        Path toPath = new Path("/test01.txt");
        fs.copyFromLocalFile(false, fromPath, toPath);
    }

    /**
     * Download File Operation
     */
    @Test
    public void testdownloadFile() throws Exception {
        //Create configuration objects
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://uplooking01:8020");
        //Creating File System Objects
        FileSystem fs = FileSystem.get(conf);
        Path fromPath = new Path("/test01.txt");
        Path toPath = new Path("file:///f:/test01.txt");
        fs.copyToLocalFile(false, fromPath, toPath);
    }


    /**
     * Download File Operation
     */
    @Test
    public void testOtherFile() throws Exception {
        //Create configuration objects
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://uplooking01:8020");
        //Creating File System Objects
        FileSystem fs = FileSystem.get(conf);
//        BlockLocation[] blockLocations = fs.getFileBlockLocations(new Path("/test01.txt"), 0, 134217730);
//        System.out.println(blockLocations);
        FileStatus[] listStatus = fs.listStatus(new Path("/test01.txt"));
        System.out.println(listStatus);
    }
}

3. Advanced operation of HDFS

Roll back edits: HDFS dfsadmin-rollEdits

Enter security mode: HDFS dfsadmin-safe mode | enter | leave | get | wait

Fusion of edits and fsimage: HDFS dfsadmin-saveNamespace:

View fsimage: HDFS oiv-i-o-p

View edits: HDFS oev-i-o-p

4. Quota management in HDFS

directory quota
- Setting directory quotas
  - hdfs dfsadmin -setQuota n dir
  - n: It refers to the number of directory quotas. If the number is 1, no files can be stored. If the number is 2, only one file can be placed, and so on.
- Clear directory quotas
  - hdfs dfsadmin -clrQuota dir
space quota
- Setting up space quotas
  - hdfs dfsadmin -setSpaceQuota n dir
    - n: The size of space
- Clearance of space quotas
  - hdfs dfsadmin -clrSpaceQuota dir

5. Getting configuration

hdfs getconf -confKey keyname

6. RPC in Hadoop

RPC(Remote Procedure Call) - Remote Procedure Call Protocol
It is a protocol that requests services from remote computer programs over the network without understanding the underlying network technology.
Design purposes:
- Calling remote methods is as convenient as calling local methods.

6.1 Writing RPC Server

Define protocol

/**
 * Define protocol
 */
public interface IHelloService extends VersionedProtocol {
    public long versionID = 123456798L;//Define the version of the protocol
    public String sayHello(String name);//Specific entries of the agreement
}

Define the server instance class of RPC

/**
 * Instance class, which implements the protocol class
 */
public class HelloServiceImpl implements IHelloService {
    @Override
    public String sayHello(String name) {
        System.out.println("==================" + name + "==================");
        return "hello" + name;
    }

    @Override
    public long getProtocolVersion(String protocol, long clientVersion) throws IOException {
        return versionID;
    }

    @Override
    public ProtocolSignature getProtocolSignature(String protocol, long clientVersion, int clientMethodsHash) throws IOException {
        return new ProtocolSignature();
    }
}

Define the starter of RPC program

public class MyRpcServer {
    public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();
        RPC.Server server = new RPC.Builder(conf)
            .setBindAddress("172.16.4.3")//Configure host
            .setPort(8899)//configure port
            .setProtocol(IHelloService.class)//Configuration Protocol
            .setInstance(new HelloServiceImpl())//Configuration instances, you can configure multiple
            .build();
        server.start();
        System.out.println("RPC Server Start Successfully....");
    }
}

6.2 Writing RPC Client

Define protocol

/**
 * Define protocol
 */
public interface IHelloService extends VersionedProtocol {
    public long versionID = 123456798L;//Define the version of the protocol
    public String sayHello(String name);//Specific entries of the agreement
}

Define client starter

Configuration conf = new Configuration();
ProtocolProxy<IHelloService> proxy = RPC.getProtocolProxy(IHelloService.class, IHelloService.versionID, new InetSocketAddress("172.16.4.3", 8899), conf);
IHelloService helloService = proxy.getProxy();
String ret = helloService.sayHello("xiaoming");
System.out.println(ret);

7. Start the namenode datanode independently

hadoop-daemon.sh start namenode

hadoop-daemon.sh start datanode

hadoop-daemon.sh start secondarynamenode

yarn-daemon.sh start resourcemanager

yarn-daemon.sh start nodemanager

8. Service and Retirement of Nodes

Add nodes dynamically without stopping the whole cluster
A white list and a black list are maintained in hdfs

8.1 Node Service

Operating in namenode

hdfs-site.xm

<!-- White list-->
<property>
    <name>dfs.hosts</name>
    <value>/opt/hadoop/etc/hadoop/dfs.include</value>
</property>

Creating White List Files

/opt/hadoop/etc/hadoop/dfs.include

uplooking03

uplooking04

uplooking05

uplooking06

Refresh node:

hdfs dfsadmin -refreshNodes

8.1 Node Retirement

Remove from the White List
Add to Blacklist
Refresh node
Remove from blacklist
Stop the datanode process

Keywords: Hadoop Apache network shell

Added by HuggieBear on Thu, 12 Sep 2019 05:30:13 +0300

Programming VIP