1. HDFS (Distributed File System)
1.1 Distributed File System
When the size of a data set exceeds the storage capacity of an independent computer, it is necessary to store the data set through multiple machines in the network. A file system composed of multiple computers in the management network is called a distributed file system.
Characteristics of 1.2 hdfs
-
Distributed
- With more and more data and no longer available within the scope of an operating system, it is allocated to more disks managed by the operating system, but it is inconvenient to manage and maintain. Therefore, there is an urgent need for a system to manage files on multiple machines, which is the distributed file management system.
-
High availability
- Replica mechanism
-
permeability
- Actually, it is the action of accessing files through the network. From the point of view of programs and users, it is like accessing local disks.
1.3 hdfs architecture
-
namenode
-
Name node
-
Management Node of File System
-
Maintaining the file directory tree of the entire file system
-
Receiving User's Request
-
-
datanode
- Data node
- Storage block (a block is 64 MB in Hadoop 1. X and 128 MB in Hadoop 2. x)
Design of 1.4 hdfs
-
Access to streaming data
- Write once and read many times
-
Commercial hardware
- hadoop does not need to run on expensive commercial machines (ibm minicomputers, etc.) but only ordinary machines.
-
Data access with low latency
- Application data requiring tens of milliseconds of response results cannot be stored using hdfs
- Although hdfs can't solve the problem of low latency access, hbase based on hdfs can solve the problem of latency
-
Lots of small files
- Each file in the namenode stores file directory information and block information, accounting for about 150 bytes.
- hdfs is not suitable for storing small files
-
Multiuser Write, Modify Files Arbitrarily
- Files stored in hdfs can only have one writer
- You can only append data at the end of a file, not modify it anywhere.
1.4 block Size Planning
-
block: Data block
- Basic Unit of Large Data Set Storage
- block is 64 MB in Hadoop 1. X and 128 MB in Hadoop 2. x
- Why are there such designs?
- The hard disk has an addressing time (10ms)
- Addressing time accounts for 1% of transmission time.
- The read rate of hard disk is generally 100 mb/s.
1.5 secondary namenode
-
Merge edits and fsimage
-
Timing of merger
- 3600s
- 64mb
2. Operation of HDFS
2.1 Graphical Operation
2.2 shell operation
2.3 API operations
3. Operation of HDFS (graphical interface)
3.1 HDFS start-up process
- Enter Safety Mode
- Loading fsimage
- Loading edits
- Save checkpoints (merge fsimage and edits files to generate new fsimage)
- Exit Security Mode
3.2 Access through Browser
http://namenode:50070
4. Operation of HDFS (shell operation)
- hdfs dfs
- hadoop fs
5. Operation of HDFS (API operation)
5.1 Dependence on POM
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.4</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.4</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.4</version> </dependency>
5.2 hdfs read and write files
import org.apache.commons.compress.utils.IOUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import org.junit.Test; public class HdfsTest { /** * Write file operation */ @Test public void testWriteFile() throws Exception { //Create configuration objects Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://uplooking01:8020"); //Creating File System Objects FileSystem fs = FileSystem.get(conf); Path path = new Path("/test002.txt"); FSDataOutputStream fsDataOutputStream = fs.create(path, true); fsDataOutputStream.write("hello".getBytes()); fsDataOutputStream.flush(); fsDataOutputStream.close(); } /** * Read File Operation */ @Test public void testReadFile() throws Exception { //Create configuration objects Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://uplooking01:8020"); //Creating File System Objects FileSystem fs = FileSystem.get(conf); Path path = new Path("/test002.txt"); FSDataInputStream fsDataInputStream = fs.open(path); IOUtils.copy(fsDataInputStream, System.out); } /** * Upload File Operation */ @Test public void testuploadFile() throws Exception { //Create configuration objects Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://uplooking01:8020"); //Creating File System Objects FileSystem fs = FileSystem.get(conf); Path fromPath = new Path("file:///f:/test01.txt"); Path toPath = new Path("/test01.txt"); fs.copyFromLocalFile(false, fromPath, toPath); } /** * Download File Operation */ @Test public void testdownloadFile() throws Exception { //Create configuration objects Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://uplooking01:8020"); //Creating File System Objects FileSystem fs = FileSystem.get(conf); Path fromPath = new Path("/test01.txt"); Path toPath = new Path("file:///f:/test01.txt"); fs.copyToLocalFile(false, fromPath, toPath); } /** * Download File Operation */ @Test public void testOtherFile() throws Exception { //Create configuration objects Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://uplooking01:8020"); //Creating File System Objects FileSystem fs = FileSystem.get(conf); // BlockLocation[] blockLocations = fs.getFileBlockLocations(new Path("/test01.txt"), 0, 134217730); // System.out.println(blockLocations); FileStatus[] listStatus = fs.listStatus(new Path("/test01.txt")); System.out.println(listStatus); } }
3. Advanced operation of HDFS
Roll back edits: HDFS dfsadmin-rollEdits
Enter security mode: HDFS dfsadmin-safe mode | enter | leave | get | wait
Fusion of edits and fsimage: HDFS dfsadmin-saveNamespace:
View fsimage: HDFS oiv-i-o-p
View edits: HDFS oev-i-o-p
4. Quota management in HDFS
-
directory quota
-
Setting directory quotas
- hdfs dfsadmin -setQuota n dir
- n: It refers to the number of directory quotas. If the number is 1, no files can be stored. If the number is 2, only one file can be placed, and so on.
-
Clear directory quotas
- hdfs dfsadmin -clrQuota dir
-
-
space quota
-
Setting up space quotas
- hdfs dfsadmin -setSpaceQuota n dir
- n: The size of space
- hdfs dfsadmin -setSpaceQuota n dir
-
Clearance of space quotas
- hdfs dfsadmin -clrSpaceQuota dir
-
5. Getting configuration
hdfs getconf -confKey keyname
6. RPC in Hadoop
-
RPC(Remote Procedure Call) - Remote Procedure Call Protocol
-
It is a protocol that requests services from remote computer programs over the network without understanding the underlying network technology.
-
Design purposes:
- Calling remote methods is as convenient as calling local methods.
6.1 Writing RPC Server
Define protocol
/** * Define protocol */ public interface IHelloService extends VersionedProtocol { public long versionID = 123456798L;//Define the version of the protocol public String sayHello(String name);//Specific entries of the agreement }
Define the server instance class of RPC
/** * Instance class, which implements the protocol class */ public class HelloServiceImpl implements IHelloService { @Override public String sayHello(String name) { System.out.println("==================" + name + "=================="); return "hello" + name; } @Override public long getProtocolVersion(String protocol, long clientVersion) throws IOException { return versionID; } @Override public ProtocolSignature getProtocolSignature(String protocol, long clientVersion, int clientMethodsHash) throws IOException { return new ProtocolSignature(); } }
Define the starter of RPC program
public class MyRpcServer { public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); RPC.Server server = new RPC.Builder(conf) .setBindAddress("172.16.4.3")//Configure host .setPort(8899)//configure port .setProtocol(IHelloService.class)//Configuration Protocol .setInstance(new HelloServiceImpl())//Configuration instances, you can configure multiple .build(); server.start(); System.out.println("RPC Server Start Successfully...."); } }
6.2 Writing RPC Client
Define protocol
/** * Define protocol */ public interface IHelloService extends VersionedProtocol { public long versionID = 123456798L;//Define the version of the protocol public String sayHello(String name);//Specific entries of the agreement }
Define client starter
Configuration conf = new Configuration(); ProtocolProxy<IHelloService> proxy = RPC.getProtocolProxy(IHelloService.class, IHelloService.versionID, new InetSocketAddress("172.16.4.3", 8899), conf); IHelloService helloService = proxy.getProxy(); String ret = helloService.sayHello("xiaoming"); System.out.println(ret);
7. Start the namenode datanode independently
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
hadoop-daemon.sh start secondarynamenode
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
8. Service and Retirement of Nodes
- Add nodes dynamically without stopping the whole cluster
- A white list and a black list are maintained in hdfs
8.1 Node Service
Operating in namenode
hdfs-site.xm
<!-- White list--> <property> <name>dfs.hosts</name> <value>/opt/hadoop/etc/hadoop/dfs.include</value> </property>
Creating White List Files
/opt/hadoop/etc/hadoop/dfs.include
uplooking03
uplooking04
uplooking05
uplooking06
Refresh node:
hdfs dfsadmin -refreshNodes
8.1 Node Retirement
- Remove from the White List
- Add to Blacklist
- Refresh node
- Remove from blacklist
- Stop the datanode process