hbase HBCK2 usage guide

HBCK2 is apache hbase Cluster repair tool.

Comparison between HBCK2 and hbck1

HBCK2 is the successor of HBCK, hbase-1 X (a.k.a hbck1). HBCK2 was used instead of hbck1 for hbase-2 X cluster. Hbck1 should not be targeted at hbase-2 X installation and operation. It can cause harm. Although hbck1 is still bound to hbase-2 In X - to minimize accidents - but it has been deprecated and will be in hbase-3 Deleted from X. Its write function (- fix) has been removed. It can report hbase-2 The status of the X cluster, but its evaluation will be inaccurate because it does not understand hbase-2 The internal working principle of X.

HBCK2 doesn't work like hbck1, even if the command names are similar in both versions. See the next section for the differences between these tools.

Basic concepts

HBCK2 executes a separate task each time it runs. It is not a tool that can analyze all about running clusters and then fix "all problems" found, such as the recommendations used by hbck1.

HBCK2 is used for repair. For the list of inconsistencies or blockages in the running cluster, you can go elsewhere to view the log and UI of the running cluster Master. Once a problem is found, you can use the HBCK2 tool to ask the Master to repair or skip the bad state. Another important difference between HBCK2 and hbck1 is that the Master is required to repair, rather than trying to repair locally in the context of the repair tool. For more information on how this interactive repair process works and how HBCK2 works, see the following sections.

Source download and compilation

Enter the download page

https://hbase.apache.org/downloads.html

Download HBase Operator Tools

The compilation command is as follows

mvn clean install -DskipTests

After compilation, you can see the corresponding jar file under hbase-hbck2/target.

Run HBCK2 tool

HBCK2 jar does not contain dependencies; It is not a fat jar. Dependencies must be provided. Building and adjusting the target hbase version in the top-level POM to match your deployment will achieve the smoothest operation when running against your deployment (see parent pom.xml hbase operator tools to set hbase.version).

The interesting thing about the runtime interaction between HBCK2 and the running cluster is that when HBCK2 is deployed ahead of your hbase, your hbase does not support all API s in the current HBCK2. If HBCK2 does not require server-side support, it should fail gracefully. If you encounter this situation, use the old version of HBCK2 or upgrade your cluster if you can.

The easiest way to "provide" HBCK2 with its dependencies is through $HBase_ The home / bin/hbase script starts HBCK2. The bin/hbase script itself refers to HBCK -- an HBCK option is listed in the help output. By default, running bin/hbase hbck will run the built-in hbck1 tool. To run HBCK2, you need to use the - j option to point to the built HBCK2 jar, as follows:

 $  ${HBASE_HOME}/bin/hbase --config /etc/hbase-conf hbck -j ~/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-xxx.jar

Where mentioned above, / etc / HBase conf is the location of the deployed configuration. HBCK2 jar is located at ~ / HBase operator tools / HBase hbck2 / target / HBase hbck2 XXX jar. The above command without passing options or parameters will output HBCK2 help:

usage: HBCK2 [OPTIONS] COMMAND <ARGS>
Options:
 -d,--debug                                       run with debug output
 -h,--help                                        output this help message
 -p,--hbase.zookeeper.property.clientPort <arg>   port of hbase ensemble
 -q,--hbase.zookeeper.quorum <arg>                hbase ensemble
 -s,--skip                                        skip hbase version check
                                                  (PleaseHoldException)
 -v,--version                                     this hbck2 version
 -z,--zookeeper.znode.parent <arg>                parent znode of hbase
                                                  ensemble
Command:
 addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
   Options:
    -d,--force_disable aborts fix for table if disable fails.
   To be used when regions missing from hbase:meta but directories
   are present still in HDFS. Can happen if user has run _hbck1_
   'OfflineMetaRepair' against an hbase-2.x cluster. Needs hbase:meta
   to be online. For each table name passed as parameter, performs diff
   between regions available in hbase:meta and region dirs on HDFS.
   Then for dirs with no hbase:meta matches, it reads the 'regioninfo'
   metadata file and re-creates given region in hbase:meta. Regions are
   re-created in 'CLOSED' state in the hbase:meta table, but not in the
   Masters' cache, and they are not assigned either. To get these
   regions online, run the HBCK2 'assigns'command printed when this
   command-run completes.
   NOTE: If using hbase releases older than 2.3.0, a rolling restart of
   HMasters is needed prior to executing the set of 'assigns' output.
   An example adding missing regions for tables 'tbl_1' in the default
   namespace, 'tbl_2' in namespace 'n1' and for all tables from
   namespace 'n2':
     $ HBCK2 addFsRegionsMissingInMeta default:tbl_1 n1:tbl_2 n2
   Returns HBCK2  an 'assigns' command with all re-inserted regions.
   SEE ALSO: reportMissingRegionsInMeta
   SEE ALSO: fixMeta

 assigns [OPTIONS] <ENCODED_REGIONNAME/INPUTFILES_FOR_REGIONNAMES>...
   Options:
    -o,--override  override ownership by another procedure
    -i,--inputFiles  take one or more encoded region names
   A 'raw' assign that can be used even during Master initialization (if
   the -skip flag is specified). Skirts Coprocessors. Pass one or more
   encoded region names. 1588230740 is the hard-coded name for the
   hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example of
   what a user-space encoded region name looks like. For example:
     $ HBCK2 assigns 1588230740 de00010733901a05f5a2a3a382e27dd4
   Returns the pid(s) of the created AssignProcedure(s) or -1 if none.
   If -i or --inputFiles is specified, pass one or more input file names.
   Each file contains encoded region names, one per line. For example:
     $ HBCK2 assigns -i fileName1 fileName2
 bypass [OPTIONS] <PID>...
   Options:
    -o,--override   override if procedure is running/stuck
    -r,--recursive  bypass parent and its children. SLOW! EXPENSIVE!
    -w,--lockWait   milliseconds to wait before giving up; default=1
   Pass one (or more) procedure 'pid's to skip to procedure finish. Parent
   of bypassed procedure will also be skipped to the finish. Entities will
   be left in an inconsistent state and will require manual fixup. May
   need Master restart to clear locks still held. Bypass fails if
   procedure has children. Add 'recursive' if all you have is a parent pid
   to finish parent and children. This is SLOW, and dangerous so use
   selectively. Does not always work.

 extraRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
   Options:
    -f, --fix    fix meta by removing all extra regions found.
   Reports regions present on hbase:meta, but with no related
   directories on the file system. Needs hbase:meta to be online.
   For each table name passed as parameter, performs diff
   between regions available in hbase:meta and region dirs on the given
   file system. Extra regions would get deleted from Meta
   if passed the --fix option.
   NOTE: Before deciding on use the "--fix" option, it's worth check if
   reported extra regions are overlapping with existing valid regions.
   If so, then "extraRegionsInMeta --fix" is indeed the optimal solution.
   Otherwise, "assigns" command is the simpler solution, as it recreates
   regions dirs in the filesystem, if not existing.
   An example triggering extra regions report for tables 'table_1'
   and 'table_2', under default namespace:
     $ HBCK2 extraRegionsInMeta default:table_1 default:table_2
   An example triggering extra regions report for table 'table_1'
   under default namespace, and for all tables from namespace 'ns1':
     $ HBCK2 extraRegionsInMeta default:table_1 ns1
   Returns list of extra regions for each table passed as parameter, or
   for each table on namespaces specified as parameter.

 filesystem [OPTIONS] [<TABLENAME>...]
   Options:
    -f, --fix    sideline corrupt hfiles, bad links, and references.
   Report on corrupt hfiles, references, broken links, and integrity.
   Pass '--fix' to sideline corrupt files and links. '--fix' does NOT
   fix integrity issues; i.e. 'holes' or 'orphan' regions. Pass one or
   more tablenames to narrow checkup. Default checks all tables and
   restores 'hbase.version' if missing. Interacts with the filesystem
   only! Modified regions need to be reopened to pick-up changes.

 fixMeta
   Do a server-side fix of bad or inconsistent state in hbase:meta.
   Available in hbase 2.2.1/2.1.6 or newer versions. Master UI has
   matching, new 'HBCK Report' tab that dumps reports generated by
   most recent run of _catalogjanitor_ and a new 'HBCK Chore'. It
   is critical that hbase:meta first be made healthy before making
   any other repairs. Fixes 'holes', 'overlaps', etc., creating
   (empty) region directories in HDFS to match regions added to
   hbase:meta. Command is NOT the same as the old _hbck1_ command
   named similarily. Works against the reports generated by the last
   catalog_janitor and hbck chore runs. If nothing to fix, run is a
   noop. Otherwise, if 'HBCK Report' UI reports problems, a run of
   fixMeta will clear up hbase:meta issues. See 'HBase HBCK' UI
   for how to generate new report.
   SEE ALSO: reportMissingRegionsInMeta

 generateMissingTableDescriptorFile <TABLENAME>
   Trying to fix an orphan table by generating a missing table descriptor
   file. This command will have no effect if the table folder is missing
   or if the .tableinfo is present (we don't override existing table
   descriptors). This command will first check it the TableDescriptor is
   cached in HBase Master in which case it will recover the .tableinfo
   accordingly. If TableDescriptor is not cached in master then it will
   create a default .tableinfo file with the following items:
     - the table name
     - the column family list determined based on the file system
     - the default properties for both TableDescriptor and
       ColumnFamilyDescriptors
   If the .tableinfo file was generated using default parameters then
   make sure you check the table / column family properties later (and
   change them if needed).
   This method does not change anything in HBase, only writes the new
   .tableinfo file to the file system. Orphan tables can cause e.g.
   ServerCrashProcedures to stuck, you might need to fix these still
   after you generated the missing table info files.

 replication [OPTIONS] [<TABLENAME>...]
   Options:
    -f, --fix    fix any replication issues found.
   Looks for undeleted replication queues and deletes them if passed the
   '--fix' option. Pass a table name to check for replication barrier and
   purge if '--fix'.

 reportMissingRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
   To be used when regions missing from hbase:meta but directories
   are present still in HDFS. Can happen if user has run _hbck1_
   'OfflineMetaRepair' against an hbase-2.x cluster. This is a CHECK only
   method, designed for reporting purposes and doesn't perform any
   fixes, providing a view of which regions (if any) would get re-added
   to hbase:meta, grouped by respective table/namespace. To effectively
   re-add regions in meta, run addFsRegionsMissingInMeta.
   This command needs hbase:meta to be online. For each namespace/table
   passed as parameter, it performs a diff between regions available in
   hbase:meta against existing regions dirs on HDFS. Region dirs with no
   matches are printed grouped under its related table name. Tables with
   no missing regions will show a 'no missing regions' message. If no
   namespace or table is specified, it will verify all existing regions.
   It accepts a combination of multiple namespace and tables. Table names
   should include the namespace portion, even for tables in the default
   namespace, otherwise it will assume as a namespace value.
   An example triggering missing regions report for tables 'table_1'
   and 'table_2', under default namespace:
     $ HBCK2 reportMissingRegionsInMeta default:table_1 default:table_2
   An example triggering missing regions report for table 'table_1'
   under default namespace, and for all tables from namespace 'ns1':
     $ HBCK2 reportMissingRegionsInMeta default:table_1 ns1
   Returns list of missing regions for each table passed as parameter, or
   for each table on namespaces specified as parameter.

 setRegionState <ENCODED_REGIONNAME> <STATE>
   Possible region states:
    OFFLINE, OPENING, OPEN, CLOSING, CLOSED, SPLITTING, SPLIT,
    FAILED_OPEN, FAILED_CLOSE, MERGING, MERGED, SPLITTING_NEW,
    MERGING_NEW, ABNORMALLY_CLOSED
   WARNING: This is a very risky option intended for use as last resort.
   Example scenarios include unassigns/assigns that can't move forward
   because region is in an inconsistent state in 'hbase:meta'. For
   example, the 'unassigns' command can only proceed if passed a region
   in one of the following states: SPLITTING|SPLIT|MERGING|OPEN|CLOSING
   Before manually setting a region state with this command, please
   certify that this region is not being handled by a running procedure,
   such as 'assign' or 'split'. You can get a view of running procedures
   in the hbase shell using the 'list_procedures' command. An example
   setting region 'de00010733901a05f5a2a3a382e27dd4' to CLOSING:
     $ HBCK2 setRegionState de00010733901a05f5a2a3a382e27dd4 CLOSING
   Returns "0" if region state changed and "1" otherwise.

 setTableState <TABLENAME> <STATE>
   Possible table states: ENABLED, DISABLED, DISABLING, ENABLING
   To read current table state, in the hbase shell run:
     hbase> get 'hbase:meta', '<TABLENAME>', 'table:state'
   A value of \x08\x00 == ENABLED, \x08\x01 == DISABLED, etc.
   Can also run a 'describe "<TABLENAME>"' at the shell prompt.
   An example making table name 'user' ENABLED:
     $ HBCK2 setTableState users ENABLED
   Returns whatever the previous table state was.

 scheduleRecoveries <SERVERNAME>...
   Schedule ServerCrashProcedure(SCP) for list of RegionServers. Format
   server name as '<HOSTNAME>,<PORT>,<STARTCODE>' (See HBase UI/logs).
   Example using RegionServer 'a.example.org,29100,1540348649479':
     $ HBCK2 scheduleRecoveries a.example.org,29100,1540348649479
   Returns the pid(s) of the created ServerCrashProcedure(s) or -1 if
   no procedure created (see master logs for why not).
   Command support added in hbase versions 2.0.3, 2.1.2, 2.2.0 or newer.

 unassigns <ENCODED_REGIONNAME>...
   Options:
    -o,--override  override ownership by another procedure
   A 'raw' unassign that can be used even during Master initialization
   (if the -skip flag is specified). Skirts Coprocessors. Pass one or
   more encoded region names. 1588230740 is the hard-coded name for the
   hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example
   of what a userspace encoded region name looks like. For example:
     $ HBCK2 unassign 1588230740 de00010733901a05f5a2a3a382e27dd4
   Returns the pid(s) of the created UnassignProcedure(s) or -1 if none.

   SEE ALSO, org.apache.hbase.hbck1.OfflineMetaRepair, the offline
   hbase:meta tool. See the HBCK2 README for how to use.

Note that when you pass the HBCK parameter to bin/hbase, it will use the default client to access the target hbase cluster by default. This is sufficient for most HBCK2 uses. If you encounter the following exceptions:

bin/hbase --config hbase-conf  hbck
2019-08-30 05:04:54,467 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
        at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:361)
        at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3605)

... this is because the HDFS jar is not on CLASSPATH. By default, HDFS jars are not bundled on CLASSPATH when hbck is run through bin/hbase. Define hadoop in environment_ Home so that bin/hbase can find your local hadoop installation, and then it will load its HDFS jar.

Introduction to HBCK2

HBCK2 is currently a simple tool that does only one thing at a time.

In hbase-2 In X, the Master is the final arbiter of all States, so the general principle of most HBCK2 commands is that it requires the Master to make all repairs. This means that you must start the Master before you can run the HBCK2 command.

HBCK2 is implemented by using HbckService hosted on the Master. The service publishes some methods for use by HBCK2 tools. Therefore, for the HBCK2 command that depends on the HbckService interface of the Master, the first thing HBCK2 does is poke the cluster to ensure that the service is available. If the remote server does not publish the service or the HbckService lacks the requested method, this will fail. In the latter case, if you can, update your cluster for more repair tools.

Look for problems

Although hbck1 performs analysis to report your cluster GOOD or BAD, HBCK2 is not so self righteous. In hbase-2 X, the operator determines what needs to be repaired, and then uses tools including HBCK2 to repair. The operator may have to run several rounds of HBCK2 back and forth, and then check the cluster status.

To solve cluster problems, use the following utilities and methods.

Diagnostic tools

Master Logs

The Master runs all assignments, server crash handling, cluster start and stop, etc. In hbase-2 In X, everything done by the Master is transformed into a program running on the state machine engine. For details on how this new infrastructure works, see process framework and assignment manager. Each process has a unique process id, its pid, which is listed in each log record. After the pid, you can track the life cycle of the process in the main log as the transition from the beginning of the process to the completion of each stage of the process. Some programs generate subroutines, wait for their subroutines, and then complete them by themselves. Each subroutine records its pid and its ppid; The pid of its parent program.

Generally speaking, there is no problem in all operations, but if some unforeseen circumstances occur, the distribution framework may be damaged and operator intervention is required. We will discuss some such scenarios below, but they can be shown in the main log that a process in which an area is STUCK or a conversion entity (area or table) may be blocked because another process holds an exclusive lock and does not let go

The stick procedure is as follows:

2018-09-12 15:29:06,558 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=va1001.example.org,22101,1536173230599, table=IntegrationTestBigLinkedList_20180626110336, region=dbdb56242f17610c46ea044f7a42895b

Master UI: /master-status#tables

The middle part of the Master UI home page in this section displays a list of tables, including the columns of whether the table is ENABLED, ENABLING, DISABLING or DISABLED and other attributes. Columns with area counts for various transition states are also listed: on, off, and so on. Reading this table will help you determine if the areas of this table are properly configured. For example, if a table is ENABLED and there is no area in the OPEN state, and the master log keeps silent on any ongoing allocation, then there is a problem.

Master UI: 'Procedures & Locks'

This page lists all the processes and locks in progress and the current Master Procedure WAL set (named pv2- 0000000000000000 ####. Log your hbase installation under the masterprocwalls directory) on the Master UI home page under the procedures & locks menu item in the page title. At startup, on a large cluster, this page is full of lists of processes and locks when intense allocation is in progress. The number of MasterProcWAL will also expand. If there is a stuck lock or process after the cluster stabilizes, or the count of WAL does not decrease but only increases, operator intervention is required to remove the blocking.

The list of locks and procedures can also be obtained through hbase shell:

$ echo "list_locks"| hbase shell &> /tmp/locks.txt
$ echo "list_procedures"| hbase shell &> /tmp/procedures.txt

Master UI: The 'HBCK Report'

In hbase 2.3.0/2.1.6/2.2.1 / HBCK In the JSP version, an HBCK report page is added to the master, which displays the output of two checks run by the master at regular intervals; One is output by the CatalogJanitor runtime. If there are overlaps or vulnerabilities in hbase:meta, half of the CatalogJanitor page will list what it finds (otherwise it is quiet). Add another background "miscellaneous" process to compare hbase:meta and file system content; If abnormal, it will be recorded in its HBCK report section.

For information on how to force inspectors to run, see the "HBCK report" page itself.

The HBase Canary Tool

The Canary tool is useful for verifying allocation status. It can be table focused or run for the entire cluster.

For example, to check the cluster allocation:

$ hbase canary -f false -t 6000000 &>/tmp/canary.log

-f false tells Canary to continue the failed region extraction, while - t 6000000 tells Canary to run for up to two hours. When you are finished, view / TMP / canary log. Look at the row of ERROR to find the problematic area assignment.

You can probe like Canary in hbase shell. For example, given that the starting row d1dd0c of a Region belongs to the table testtable, please do the following:

hbase> scan 'testtable', {STARTROW => 'd1dddd0c', LIMIT => 10}

For an overview of resolving zone names to their components, see the RegionInfo API.

Other tools

To calculate the list of unopened regions on the ENABLED or ENABLING table, read the hbase:meta table info:state column. For example, to find the table integration testbiglinked list_ For the status of all areas in 20180626064758, please do the following:

$ echo " scan 'hbase:meta', {ROWPREFIXFILTER => 'IntegrationTestBigLinkedList_20180626064758,', COLUMN => 'info:state'}"| hbase shell > /tmp/t.txt

... then execute grep in the OPENING or CLOSING area.

To move the OPENING problem to OPEN to make it consistent with the ENABLED state of the table, queue the new allocation process using the assign command in the hbase shell (view the main log to view the allocation run). If you want to assign multiple areas, use the HBCK2 tool. It allows batch allocation.

Problem repair

general principles

When fixing, make sure hbase:meta is consistent before fixing any other problem types, such as file system deviations. Deviations or allocation problems in the file system should be solved after hbase:meta sorting. If there is a problem with hbase:meta, the Master cannot place it correctly when using orphaned file system data or allocating regions.

Other general principles to keep in mind include that if the area is in the CLOSING state (or conversely, if it is in the OPENING state, it is not allocated), the area cannot be allocated without first switching through CLOSED: the area must always be moved from CLOSED to OPENING, then to OPEN, and then to CLOSING, CLOSED.

When repairing, repair one table at a time.

In addition, if the table is DISABLED, you cannot assign a region. In the Master log, you will see that the Master will report that the allocation has been skipped because the table is DISABLED. You may want to assign an area because it is currently in the OPENING state and you want it to be in the CLOSED state, so it is consistent with the DISABLED state of the table. In this case, you may have to temporarily set the table status to ENABLED so that you can make the assignment, and then set it back again after canceling the assignment. HBCK2 has a function that allows you to perform this operation. See HBCK2 using output.

Here's what hbase-2 has run so far X experience mixed notes and prescriptions. The underlying problem causing the state described below has been fixed in a later version of HBase, so please upgrade as much as possible to avoid the described situation

Assign / unassign

Usually, during allocation, the Master will continue until success. Assign an exclusive lock to the area. This prevents concurrent allocation or deallocation runs. The allocation of the lock area will wait until the lock is released. For a list of currently outstanding locks, see the [procedures & locks] section above.

The main start cannot be carried out and remains in mode until the area is online

This shouldn't happen. If so, it looks like this:

2018-10-01 22:07:42,792 WARN org.apache.hadoop.hbase.master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=CLOSING, ts=1538456302300, server=ve1017.example.org,22101,1538449648131}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.

The Master cannot continue to start because there is no process to allocate hbase:meta (or hbase:namespace). To inject one, use the HBCK2 tool:

HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase org.apache.hbase.HBCK2 assigns -skip 1588230740

... where 1588230740 is the code name of hbase:meta region. Pass the "- skip" option to stop HBCK2 from checking the version of the remote host. If the remote host is not started, the version check will prompt "the host is initializing the response" or "please keep exception" and abandon the allocation attempt- The 'skip' command will start in the version check and will complete the scheduled allocation.

The same can happen to hbase:namespace system tables. Find the encoding region name of the hbase:namespace region and perform an operation similar to what we did for hbase:meta. In the latter case, the Master will actually print a useful message, as shown below:

2019-07-09 22:08:38,966 WARN  [master/localhost:16000:becomeActiveMaster] master.HMaster: hbase:namespace,,1562733904278.9559cf72b8e81e1291c626a8e781a6ae. is NOT online; state={9559cf72b8e81e1291c626a8e781a6ae state=CLOSED, ts=1562735318897, server=null}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.

To schedule the allocation for the hbase:namespace table recorded in the above log line, you can:

 $ ${HBASE_HOME}/bin/hbase --config /etc/hbase-conf hbck -j ./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase -skip assigns 9559cf72b8e81e1291c626a8e781a6ae

... pass the encoded name of the namespace area (the encoded name will be different for each deployment).

Missing region in hbase: meta region / table recovery / reconstruction

There are some unusual cases where table regions are deleted from the hbase:meta table. Some classifications of such cases show that these are caused by the operator. The user has run the outdated hbck1 OfflineMetaRepair tool for the HBCK2 cluster. OfflineMetaRepair is a well-known tool for repairing HBase 1 Problems related to hbase:meta table on version X. The original version is not compatible with HBase 2 X or higher version, after some adjustments, to the extreme, can now run through HBCK2.

In most cases, areas in hbase:meta will eventually be lost randomly, but hbase may still work. In this case, you can use the addFsRegionsMissingInMeta command in HBCK2 to solve the problem online. Compared with the full hbase:meta reconstruction described later, this command is less destructive to hbase and can even be used to recover namespace table regions.

Extra regions in hbase: meta region / table recovery / reconstruction

In some cases, the table area has been deleted by the file system, but there are still relevant entries on the hbase:meta table. This may be due to splitting problems, manual operation errors (such as manually deleting / moving area directories), or even meta information data loss such as HBASE-21843.

Such problems can be solved through the online Master, using the extraRegionsInMeta – fix command in HBCK2. Compared with the full hbase:meta reconstruction described later, this command is less destructive to hbase. This is also useful when this happens on versions that do not support the fixMeta hbck2 option (any versions before "2.0.6", "2.1.6", "2.2.1", "2.3.0", "3.0.0").

Online hbase:meta reconstruction method

If hbase:meta damage is not too serious, hbase can still bring it online. Even if the namespace area is in the missing area, hbase:meta can still be scanned during initialization. At this time, the Master will wait for the namespace to be allocated. To verify this, execute the hbase:meta scan command, as shown below. If it does not time out or displays any errors, hbase:meta is online:

echo "scan 'hbase:meta', {COLUMN=>'info:regioninfo'}" | hbase shell

If there are no errors shown above, you can use HBCK2 addFsRegionsMissingInMeta. It reads the zone metadata information available on the FS zone directory to recreate the zone in hbase:meta. Since it can run with hbase partially running, it attempts to disable the online table affected by the reported problem and read the area to hbase:meta. It can check specific tables / namespaces, or all tables from all namespaces. The following example shows adding missing regions for table 'tbl_1' in the default namespace, 'tbl_2' in namespace 'n1', and all tables in namespace 'n2':

$ HBCK2 addFsRegionsMissingInMeta default:tbl_1 n1:tbl_2 n2

Since it runs independently of the Master, once it is successfully completed, additional steps are required to actually allocate the newly added areas. These are listed below:

addFsRegionsMissingInMeta outputs an allocation command, which contains all the newly added regions. This command needs to be executed later, so it is convenient to copy and save.
For HBase versions before 2.3.0, restart all running HBase masters after addFsRegionsMissingInMeta completes successfully and saves the output.
Once the Master restarts and hbase:meta is online (check whether the Web UI is accessible), run the assign command from the addFsRegionsMissingInMeta output saved by the instruction in #1.

Note: if the namespace area is in the missing area, you need to add the – skip flag at the beginning of the returned assignments command.

If the cluster suffers catastrophic loss of hbase:meta table, the following methods can be used for rough reconstruction. In summary, we stop clustering; Run the HBCK2 OfflineMetaRepair tool, which reads the directory and metadata put into the file system and tries its best to rebuild the feasible hbase:meta table; Restart your cluster; Inject allocation to bring the system namespace table online; Finally, reallocate the user space tables you want to enable (the rebuilt hbase:meta creates a table where all tables are offline and no areas are allocated).

Detailed reconstruction method

Stop the cluster.

Run the rebuild hbase:meta command from HBCK2. This will remove the original hbase:meta and place a newly reconstructed one. The following is an example of how to run the tool. It adds the - details flag, so the tool dumps information about the areas it finds in hdfs:

$ HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar ./bin/hbase org.apache.hbase.hbck1.OfflineMetaRepair -details

Start the cluster. It won't appear completely. It will get stuck because the namespace table is not online and there is no procedure assigned to this unexpected event in the procedure store. hbase master log will show this status. This is an example of what it will record:

2019-07-10 18:30:51,090 WARN  [master/localhost:16000:becomeActiveMaster] master.HMaster: hbase:namespace,,1562808216225.725a0fe6c2c869d3d0a9ed82bfa80fa3. is NOT online; state={725a0fe6c2c869d3d0a9ed82bfa80fa3 state=CLOSED, ts=1562808619952, server=null}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.

To assign a namespace table region, you cannot use a shell. If you use a shell, it will fail and return PleaseHoldException because the master has not yet started (it is waiting for the namespace table to go online before it declares itself "started"). You must use the HBCK2 assignment command. To assign, you will need a namespace encoded name. It appears in the log referenced above: that is 725a0fe6c2c869d3d0a9ed82bfa80fa3, in this case. You must also "skip" the major version check through the - skip command (without it, your HBCK2 call will also raise the above PleaseHoldException because the major version has not been started). This is an example of adding a namespace table assignment:

$ HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar ./bin/hbase org.apache.hbase.HBCK2 -skip assigns 725a0fe6c2c869d3d0a9ed82bfa80fa3

If the call returns "connection rejected", is the Master up? If the Master fails to initialize, it will shut down after a period of time. Simply restart the cluster / Master server and rerun the above allocation command.

When the assignment runs successfully, you will see it emit the following. "48" at the end is the pid of the allocation process plan. If the returned pid is "- 1", the main startup has not been fully... Retry. Or, the encoded region name is incorrect. see.

$  HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar ./bin/hbase org.apache.hbase.HBCK2 -skip assigns 725a0fe6c2c869d3d0a9ed82bfa80fa3
18:40:43.817 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:40:44.315 [main] INFO  org.apache.hbase.HBCK2 - hbck support check skipped
[48]

Check the master log. The master should be started. You will see that pid=48 completed successfully. Look for such a line to verify that the master is started successfully:

master.HMaster: Master has completed initialization 132.515sec

It may take some time to appear.

The reconstruction of hbase:meta adds the user table in DISABLED state and the area in CLOSED mode. Re enable tables through the shell to bring all table regions back online. Make or view enable one at a time_ The all ". *" command enables all tables at once.

Rebuilding metadata may lose edits and may require subsequent repair and cleanup using the tools outlined above in this readme.

Delete references, missing HBase Version file and corrupt hfile

HBCK2 can check for pending references and corrupted hfile s. You can ask it to exclude error files that may need to overcome the hump of area offline or read failure. See file system commands in the HBCK2 list. Pass one or more table names (or none to check all tables). It will report bad files. Fix with the – fix option.

program reset

In extreme cases, as the last resource, the Master state can be erased if the Master is upset, and all repair attempts only have unfinished locks or processes, and / or the MasterProcWAL set grows indefinitely. Just move the / hbase / masterprocwalls / directory under hbase installation to one side, and then restart the Master process. It will return in blank form.

If all regions are happily assigned or offline during erasure, the Master should pick up and continue when the Master restarts as if nothing had happened. However, if there were regions in transition at that time, operators would have to intervene to bring unfinished allocation / cancellation to their destination. Read the hbase:meta info:state column described above to see what needs to be allocated / deallocated. After deleting all the history of MasterProcWAL, no entity should be locked, so you are free to batch allocate / unassign.

Use isolated data

For information on how to fix orphaned areas of the "HBCK Chore" report, see the advanced section of the full batch loading tool in the reference guide "using" spurious data.

This article is an original article from big data to artificial intelligence blogger "xiaozhch5", which follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.

Original link: https://lrting.top/backend/3566/

Added by srirangam007 on Fri, 28 Jan 2022 08:46:30 +0200

Programming VIP