HBase RIT problem handling

Question:

Recently, it was found that 1W + region s in HBase clusters are in RIT state, resulting in many HBase clusters being unavailable.

HBase version:

2.0.1

Problem location:

1. At first, I thought it was just a simple timeout, so I manually modified the meta table status through the script (ing - > closed), then scrolled to restart hbase regionserver and master services, and finally batch assign ed. It was found that the RIT situation was not solved. (without hbck tool, I can only do it manually)

2. The manual assign 'region' prompt timed out. The procedure of a large number of assign operations is in awaiting state. Manual forced stop procedure is also invalid (abort_procedure). Therefore, it feels that things are not so simple.

3. Find the process number of the procedure on the procedure page and search in the master log to find out why the procedure operation is stuck. Finally, it was found that the description file of this region could not be found. When you look at HDFS, the entire table directory is gone, so you can't go online to the region (only the HDFS table directory is deleted when the business deletes the table, resulting in!!!).

Solution:

When the problem is found, the solution is very simple: just delete all the metadata information recorded in hbase:meta. However, in practice, it is difficult to find out: there are too many tables to delete in batch (in the later statistics, there are nearly 2000 tables, 1W + region, and the table names of these tables are prefix + timestamp, so the specific table names cannot be provided).

Based on the above problems, you can only write a script to get all the table names. The specific script is as follows:

1,Get all RIT Table row key
echo "scan 'hbase:meta',{COLUMNS=>'info:state'}" | hbase shell -n | grep -E 'ING|OFFLINE|CLOSED|OPEN' > meta.txt

2,adopt meta.txt Get namespace:Table name
cat meta.txt | awk -F "," '{print $1}' > table_name_and_namespace.txt

3,Get table name through script  
sh get_table_name.sh > meta_table_name_tmp.txt

--------------------------------------------------
#!/bin/bash
#The tables under the wz namespace are mainly processed here (the customer tables are all under the wz space)

cat table_name.txt | while read line 

do
#echo $line 
res1=`echo $line | awk -F ":" '{print $1}'`
res2=`echo $line | awk -F ":" '{print $2}'`
if [ "$res1" == "wz" ];then
  echo $res2
fi

done
--------------------------------------------------

4,Table name de duplication
uniq meta_table_name_tmp.txt > meta_table_name.txt

5,obtain hdfs All table names under the specified path(/apps/hbase/data/data/wz Medium wz Use the actual command space), where 7 and 8 change according to the actual situation,/opt/ceshi/hdfs_table_name.txt The permission is 777; 
hdfs dfs -ls /apps/hbase/data/data/wz | awk '{print $8}' | awk -F "/" '{print $7}' >  /opt/ceshi/hdfs_table_name.txt

6,Compare the two sides to find the deleted table name
sh diff.sh

--------------------------------------------------
#!/bin/bash

cat meta_table_name.txt | while read line 

do 

flag=`grep -c $line hdfs_table_name.txt`
if [ $flag -eq 0 ];then
  echo $line
fi

done
--------------------------------------------------

7,Randomly select several tables to verify whether the results are correct and whether they are true hdfs End deleted

After knowing the table name, you can get all the row keys to be deleted through the scan metadata table, and then delete the dirty data through the deleteall command.

1,Get row key
echo " scan 'hbase:meta'" | hbase shell -n | grep -E 'table:state|info:state' | grep -E ' table_name1,|table_name2,|...|..| table_nameN,' > zangshuju.txt

2,Delete dirty data
for line in `cat zangshuju.txt |awk '{print $1}'`;do echo " deleteall 'hbase:meta',\"$line\" ";done | hbase shell -n

Finally, stop the two hbase master services and start them again. You can see that RIT has completely disappeared.

Summary:

Obviously, the reason for this RIT problem is that the use method is not standardized. Later, it is necessary to change the table deletion logic to disable+drop (recommended). Or when deleting HDFS table data, the metadata is also deleted.

Keywords: HBase

Added by lawnmowerman on Fri, 31 Dec 2021 09:58:11 +0200