Troubleshooting hdfs for hadoop optimization

This blog is mainly about troubleshooting hadoop hdfs, including NameNode fault handling, cluster security mode and disk repair. If there is something bad, welcome everyone! thank!

nn Fault Handling

1. Scene
The NameNode process hangs and the stored data is lost. How to recover the NameNode
2. Fault simulation
(1) kill -9 NameNode process
[lqs@bdc112 current]$ kill -9 19886
(2) Delete the data stored in NameNode (/ home/lqs/module/hadoop-3.1.3/data/tmp/dfs/name)
[lqs@bdc112 hadoop-3.1.3]$ rm -rf /home/lqs/module/hadoop-3.1.3/data/dfs/name/*
3. Solution
(1) Copy the data in the SecondaryNameNode to the original NameNode storage data directory
[lqs@bdc112 dfs]$ scp -rlqs@bdc114: /home/lqs/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/
(2) Restart NameNode
[lqs@bdc112 hadoop-3.1.3]$ hdfs --daemon start namenode
(3) Upload a file to the cluster

Cluster security mode & disk repair

brief introduction

The so-called security mode is that the file system only accepts data read requests, but does not accept change requests such as deletion and modification

Which scenarios will enter safe mode

1. nn will enter safe mode during the time period of loading image files and editing logs
2. nn will also be in safe mode when receiving dn registration

Conditions for exiting safe mode

1. When DFS namenode. safemode. Min.datanodes: minimum number of available datanodes, 0 by default
2,dfs. namenode. safemode. Threshold PCT: the percentage of blocks with the minimum number of copies in the total number of blocks in the system. The default value is 0.999f. (only one block is allowed to be lost)
3,dfs.namenode.safemode.extension: stabilization time. The default value is 30000 milliseconds, i.e. 30 seconds

Basic grammar

The cluster is in safe mode and cannot perform important operations (write operations). After the cluster is started, it will automatically exit safe mode.
bin/hdfs dfsadmin -safemode get	
	Function: view the safe mode status
bin/hdfs dfsadmin -safemode enter 
	Function: enter safe mode state
bin/hdfs dfsadmin -safemode leave	
	Function: leave safe mode state
bin/hdfs dfsadmin -safemode wait	
	Function: wait for safe mode status

Practice 01, start the cluster and enter the safe mode

1. Restart the cluster
[lqs@bdc112 subdir0]$ myhadoop.sh stop
[lqs@bdc112 subdir0]$ myhadoop.sh start
2. After the cluster is started, immediately go to the cluster to delete data, and prompt that the cluster is in safe mode

Practice 02, disk repair

Scenario requirements:
The data block is damaged and enters the safe mode. How to solve it
1. Enter / home / LQS / module / hadoop-3.1 of bdc112, bdc113 and bdc114 respectively 3/data/dfs/data/current/BP-1015489500-192.168. 10.102-1611909480872 / current / finalized / subdir0 / subdir0 directory, uniformly delete two block information
[lqs@bdc112 subdir0]$ pwd/home/lqs/module/hadoop-3.1.3/data/dfs/data/current/BP-1015489500-192.168.10.102-1611909480872/current/finalized/subdir0/subdir0
[lqs@bdc112 subdir0]$ rm -rf blk_1073741847 blk_1073741847_1023.meta
[lqs@bdc112 subdir0]$ rm -rf blk_1073741865 blk_1073741865_1042.meta
#Note: bdc113 and bdc114 repeat the above commands
2. Restart the cluster
[lqs@bdc112 subdir0]$ myhadoop.sh stop
[lqs@bdc112 subdir0]$ myhadoop.sh start
3. Observe http://bdc112:9870/dfshealth.html#tab-overview


If the above figure appears, it indicates that the security mode has been turned on and the number of blocks does not meet the requirements.

4. Leave safe mode

[lqs@bdc112 subdir0]$ hdfs dfsadmin -safemode get
Safe mode is ON
[lqs@bdc112 subdir0]$ hdfs dfsadmin -safemode leave
Safe mode is OFF
5. Observe http://bdc112:9870/dfshealth.html#tab-overview



6. Delete metadata on the web side

7. Observe http://bdc112:9870/dfshealth.html#tab-overview, the cluster is normal

Practical operation 03

Scenario requirements:
Simulate wait safe mode



1. View current mode

[lqs@bdc112 hadoop-3.1.3]$ hdfs dfsadmin -safemode get
Safe mode is OFF
2. First in safe mode
[lqs@bdc112 hadoop-3.1.3]$ bin/hdfs dfsadmin -safemode enter
3. Create and execute the following script
At / home / LQS / module / hadoop-3.1 3 path, edit a script safemode sh
[lqs@bdc112 hadoop-3.1.3]$ vim safemode.sh
#!/bin/bash

hdfs dfsadmin -safemode wait
hdfs dfs -put /home/lqs/module/hadoop-3.1.3/README.txt /
[lqs@bdc112 hadoop-3.1.3]$ chmod 777 safemode.sh

[lqs@bdc112 hadoop-3.1.3]$ ./safemode.sh 
4. In addition, open another window to execute
[lqs@bdc112 hadoop-3.1.3]$ bin/hdfs dfsadmin -safemode leave
5. Look at the previous window
Safe mode is OFF
6. There are already uploaded data on the HDFS cluster

Keywords: Big Data Hadoop Distribution hdfs

Added by Stressed on Sat, 18 Dec 2021 15:50:32 +0200