HBase basic notes

HBase is a database tool based on Hadoop. It comes from a Google paper BigTable. Later, Apache made an open source implementation, HBase.

HBase is a NoSQL (non relational database). It is suitable for storing unstructured and semi-structured data and sparse data (empty data does not occupy space). HBase is column (family) oriented storage. At the bottom, data is stored in columns.

Unlike Hive, HBase can still add, delete, modify and query even if it is based on HDFS. And can store massive data, the performance is very powerful. It can realize millisecond level query of hundreds of millions of data (the performance bottleneck of common RDBMS such as MySQL is only 10 million).

However, in terms of transactions, HBase does not perform as well as RDBMS. Table level transactions are not supported.

HBase uses HDFS as the file storage system, MapReduce to process massive data, and Zookeeper as the coordination tool. This means that HBase is a highly reliable, high-performance, scalable distributed storage system.

HBase has the following basic concepts:

  • Row key, RowKey. This is the primary key of HBase. There can be multiple primary keys. When querying, you can only query through the row key of HBase or full table scanning. The row keys are sorted in dictionary order by default.
  • Column Family. Metadata of the table, used to hold multiple columns. It needs to be declared when creating a table. It cannot be added later.
  • Column. Unlike column families, columns in column families do not need to be declared in advance, and can be added dynamically.
  • Cell and Timestamp. A storage unit can be determined through row and column. Each storage unit stores multiple versions of a data, which are distinguished by time stamps. That is, if you change the data in a cell, the previous data will not be deleted, but the versions will be distinguished according to the timestamp. And the data in the cell is stored in binary mode, without the so-called data type distinction. However, manual conversion is required when fetching data.

Installing HBase

HBase installation must require jdk and Hadoop. There are three common versions of HBase. The following table shows the applicable Hadoop versions of the three versions:

Hadoop version HBase-0.92.x HBase-0.94.x HBase-0.96
0.20.205 support I won't support it I won't support it
0.22.x support I won't support it I won't support it
0.23.x I won't support it support Not tested
1.0.x support support support
1.1.x Not tested support support
2.x I won't support it support support

Next, I use Hadoop 2.7 and HBase 0.96 as demonstration.

HBase installation is divided into stand-alone, pseudo distributed and fully distributed. The stand-alone mode stores the data locally, and the pseudo distributed mode stores the data in HDFS. There is only one HBase machine, and multiple HBase machines are fully distributed as clusters.

standalone mode

First, unzip HBase:

[root@hadoop1 mysql]# tar xzf hbase-0.98.17-hadoop2-bin.tar.gz -C /usr/develop/
[root@hadoop1 mysql]# cd /usr/develop/
[root@hadoop1 develop]# mv hbase-0.98.17-hadoop2/ hbase0.98

Then, you need to change the HBase core configuration file conf / HBase site xml. In stand-alone mode, the default configuration is generally used, but a simple configuration needs to be added:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>file:///usr/develop/hbase0.98/tmp</value>
    </property>
</configuration>

This configuration specifies the HBase data storage path, otherwise it will be stored in / tmp, which will cause the HBase data to be deleted by the system intermittently.

Then start the HBase daemon:

[root@hadoop1 hbase0.98]# bin/start-hbase.sh 
starting master, logging to /usr/develop/hbase0.98/bin/../logs/hbase-root-master-hadoop1.out
[root@hadoop1 hbase0.98]# jps
6000 Jps
3552 NodeManager
5908 HMaster
3445 ResourceManager
3256 SecondaryNameNode
2953 NameNode
3053 DataNode

Seeing the HMaster process indicates successful startup. Enter ip:60010 in the browser to manage HBase through a graphical interface.

The bottom layer of such HBase stores the data in the local FS. HDFS is not used to store data and is generally only used for testing and development.

Pseudo Distributed installation

Pseudo distributed HBase is more similar to production, which stores data on HDFS.

To implement the pseudo distributed pattern, you need to modify HBase site XML configuration file:

<configuration>

    <!-- Store data in HDFS -->
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://hadoop1:9000/hbase</value>
    </property>

    <!-- Declare that there is only one backup data for each data block, Fully distributed here should be and HDFS Cluster consistency -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

</configuration>

If HDFS is on in the computer, it can be started like stand-alone mode. If HDFS is not turned on, start HDFS first.

Fully distributed installation

Fully distributed requires three machines to be prepared as HBase clusters. Here you can clone 3 virtual machines. I will not demonstrate the specific process of cloning.

Fully distributed needs to be in HBase site XML declares that distributed is enabled. The configuration is as follows:

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://hadoop1:9000/hbase</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>

Note that HBase clusters need Zookeeper as a coordination tool for clusters. Therefore, you need to install Zookeeper on three machines and start 2181 listener.

We need to modify conf / HBase env SH disable zookeeper auto start and destroy. If this option is set to true (default), zookeeper will start automatically whenever HBase is started, and zookeeper will be closed when HBase is finished. If zookeeper is involved in the management of other clusters, it is obviously inappropriate.

You also need to declare Java in this file_ HOME:

export HBASE_MANAGES_ZK=false
export JAVA_HOME=xxx

Then we need to configure the region server. We modify the conf/regionservers file and configure all HBase hosts in it. Each host occupies one line. In this way, when HBase is started, the HBase listener of all nodes will be automatically started (in the order you configure):

[root@hadoop1 hbase0.98]# echo hadoop1 > conf/regionservers 
[root@hadoop1 hbase0.98]# echo hadoop2 >> conf/regionservers 
[root@hadoop1 hbase0.98]# echo hadoop3 >> conf/regionservers 
[root@hadoop1 hbase0.98]# cat conf/regionservers 
hadoop1
hadoop2
hadoop3

Next, you need to copy HBase to Hadoop 1 and Hadoop 2 nodes.

Then you can start HBase on one node:

[root@hadoop1 hbase0.98]# bin/start-hbase.sh
hadoop3: starting regionserver, logging to /usr/develop/hbase0.98/bin/../logs/hbase-root-regionserver-hadoop2.out
hadoop1: starting regionserver, logging to /usr/develop/hbase0.98/bin/../logs/hbase-root-regionserver-hadoop1.out
hadoop3: starting regionserver, logging to /usr/develop/hbase0.98/bin/../logs/hbase-root-regionserver-hadoop3.out
[root@hadoop1 hbase0.98]# jps
6161 Jps
5937 HRegionServer
2338 DataNode
4531 NodeManager
5028 QuorumPeerMain
2231 NameNode
4424 ResourceManager
2504 SecondaryNameNode
5595 HMaster

We can change the machine and enable the standby master, so that the master can replace the master when the primary master hangs up:

[root@hadoop2 hbase0.98]# bin/hbase-daemon.sh start master
starting master, logging to /usr/develop/hbase0.98/bin/../logs/hbase-root-master-hadoop2.out
[root@hadoop2 hbase0.98]# jps
4056 HMaster
3867 HRegionServer
3246 QuorumPeerMain
4094 Jps

This is the high reliability of HBase, and we don't need to do any additional configuration.

After installation, run HBase under bin / and enter shell parameters to enter the shell interface of HBase:

[root@hadoop2 hbase0.98]# cd bin/
[root@hadoop2 bin]# ./hbase shell
2018-04-11 21:23:03,302 INFO  [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.17-hadoop2, rd5f8300c082a75ce8edbbe08b66f077e7d663a4a, Fri Jan 15 22:46:43 PST 2016

hbase(main):001:0>

HBase command

The following are common commands for HBase:

command explain
General Command
status Lists the current HBase status
version View HBase version
whoami List current users
DDL
list List all current tables
create Create a table
describe Displays information about a table
namespace
alter_namespace Modify namespace
list_namespaces List namespace s
create_namespace Create namespace
list_namespace_tables List all tables under a namespace
drop_namespace Delete namespace
DML
disable Disable a table
enable Enable a table
append Additional data
count Statistics rows
delete Delete data
deleteall Delete all data
truncate Destroy the table and re-establish the same table. When the amount of data is large, the efficiency is much higher than that of deleteall
put Stores a value to a specified location (cell)
scan View all data in a table
get Take out the data of a specific row, column or cell
drop To delete a table, disable it before operation

create command

Use create to create a table. The column family (at least one) needs to be passed in. The declaration of the column family can be a simple name or a map.

Table name you can specify a namespace. Use the syntax of "namespace:tableName".

Here are some common uses of create:

create 'namespace:table', {NAME => 'f1', VERSIONS => 5}

This creates a "table" table in the "namespace" namespace. If name is not written, it defaults to default. There is a column family named "f1". Version 5. This version means that when the data is updated, the version will increase from 1. When the version reaches 5, the old data will be really deleted.

We can specify more properties in {}.

You can declare multiple column families:

create 'namespace:table', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}

VERSIONS is not specified. The default is 1. The above formulation can be simplified as:

create 'namespace:table', 'f1', 'f2', 'f3'

put command

put can insert data into a cell of a table in HBase (data can be modified). You need to specify the row, column or timestamp of this data (optional). For example:

put 'ns:t', 'row', 'column', 'value' [,timestamp]

Row represents the row key, which is unique in the whole table.

If you do not specify a timestamp, the current time will be used.

Note that column needs to indicate the column family. The usage is "cf:c".

If row and column do not exist, they will be created automatically.

get command

get can retrieve the contents of a cell. The row key must be specified, and the column can not be specified, which means to view all the contents of this row.

get 'ns:t' , 'row' [, 'column']

If column is not passed, all columns will be viewed.

The column here uses the column name. You can also pass a complete map:

get 'ns:t' , 'row' [, {COLUMN=>'r'}]

You can also view multiple columns:

get 'ns:t', 'row', [, {COLUMN=>['c1', 'c2', ...]}]

The above formulation can be simplified as:

get 'ns:t', 'row', ['c1', 'c2']

Or:

get 'ns:t', 'row', 'c1', 'c2'

When viewing columns, you can also specify a specific timestamp or VERSIONS:

get 'ns:t', 'row', [, {COLUMN=>'c', [,TIMESTAMP=t][,VERSIONS=v]}]

The usage of delete is similar to that of put. You can delete the data of a cell by using delete.

HBase principle

Physical storage

HBase divides the table into multiple regions. Regions are sorted by row direction. A Region may store multiple row keys. At the beginning, a table has only one Region. When inserting data into the table, the Region will become larger and larger. When it reaches the preset value, the original Region will be divided into two regions.

Region is the smallest unit of distributed storage and load in HBase. Different machines in the HBase cluster store different regions. The unit where the region is stored is called the RegionServer. A RegionServer may store many regions.

Within a Region, there are one or more stores. A column family corresponds to one Store. Each Store has a memStore, possibly 0, one or more storefiles.

memStore is the data stored in memory, and storeFile is the file stored in HDFS.

Read and write data

Because the row keys are arranged in order, when there is a new data, HBase will first calculate which Region the row key is stored in. Then find the location of the corresponding RegionServer according to the Region, and then find the corresponding Store according to the column family.

Region is divided into two parts, one is stored in memory and the other is stored in HDFS. HBase writes data to memory first. However, the data in memory is easy to lose and the memory capacity is relatively low. When the memStore is almost full, HBase will open a new memStore, open a new thread to write the old memStore to HDFS, and the write operations in this process will be written to the new memStore.

In other words, every write to HBase is written to memory. It is transparent for us to perform persistence operations in the middle to HDFS. Therefore, the writing efficiency of HBase is very high.

We know that HDFS does not allow data modification. Therefore, once the data is stored in HDFS, it is difficult to be changed by HBase. Therefore, if the client requests HBase to modify the data at this time, HBase will not move the old data in HDFS, but write the new data into memStore. In this way, after memStore is persisted to storeFile later, HDFS actually stores two copies of the data of this cell. In the later query, only the one with a relatively new timestamp can be returned.

This will cause a problem. If HBase is used for a long time, there may be a large amount of garbage data in HDFS that cannot be cleaned up.

To solve this problem, if HBase finds that there are too many storefiles, it will try to combine multiple storefiles into one storefile (starting a thread alone will not affect data writing). In this process, the unused data in the cell will be deleted. This reduces junk data.

After merging many times, the storeFile will become very large, which is not very beneficial to the query. Therefore, when HBase finds that the storeFile is too large, it will split the storeFile once. At this time, there is no garbage data in the split file.

HBase is exquisitely designed, that is, the data writing operation to HBase is completely based on memory, and users will not be limited by the performance bottleneck of HDFS.

If the HBase cluster node is powered off, the memStore data will be lost, but the storeFile data will not be lost.

To solve this problem, HBase maintains an HLog file (stored in HDFS). Before writing data to memStore, HBase will record the operation behavior in HLog and report success after modification. In this way, after power failure, you can find the operation log through HLog and recover the data in memStore through this log.

When writing data from memStore to HDFS, the current last persistent log number will be recorded in ZK. In this way, when HBase hangs up, HBase will find the last persistence number and recover the data from behind this number.

When the HLog reaches a certain size, HBase will delete some persistent log records.

Each RegionServer stores an HLog.

When reading, if the data is in memory, it will be read directly. If the data is in HDFS, you need to find all the data of this row key in HDFS (garbage data may be found), read these data into memory and output them after merging. Note that this reading process may scan multiple storefiles, which is not so efficient. However, it can basically guarantee the same reading efficiency as HDFS.

In short, the write efficiency of HBase is equivalent to that of memory, which will not change. The reading efficiency is not stable enough, sometimes equivalent to the reading efficiency of memory and sometimes equivalent to the reading efficiency of HDFS. However, due to the index optimization within HBase, this efficiency gap is acceptable.

StoreFile structure

StoreFile is a file that HBase stores in HDFS. It stores some data in the table. This file consists of the following six parts:

  1. Data Block: saves the data in the table and can be compressed
  2. Meta Block: save user-defined key value pairs, which can be compressed
  3. File info: meta information of storefile. It cannot be compressed and can be customized
  4. Data block index: the index of the data block
  5. Meta block index: index of meta block
  6. Traller: fixed length, which saves the offset of each block. When reading data, first read this section of data to know the position of each piece of data.

The Data Block will be subdivided into multiple sub blocks, in which multiple key value pairs are stored, which is the real data. Each sub DataBlock stores a certain range of row key data. When querying, the Data Block Index and Data Block location will be obtained according to the tracker. According to the Index, you can know which specific sub Data Block of the current row key data is located in the Data Block, and then return the data of the whole sub Data Block to memory for merging.

Because the query for a row key may obtain multiple data blocks with the same range from multiple storefiles, it should be put into memory for merging before output.

HBase data is finally stored according to key value pairs. Where key is row key + column family + column, and value is specific data. Keys are sorted according to the dictionary data of row keys, which is why empty cells in HBase do not occupy space.

Region addressing

We know that HBase can find the Region corresponding to the row key through the row key, so as to find which Region server stores the data of this row key. How does HBase achieve this?

Regions are arranged according to row keys, so it is not complicated to determine which Region a row key is stored in. But how to determine which machine the Region is saved on?

There is a meta table under the HBase namespace of HBase, which stores the corresponding relationship between Region and RegionServer, that is, on which machine Region is stored. This meta table can only have one Region.

When we need to store data in a Region, we will find the specific machine on which the Region is stored from the meta table. The Region location of meta is not fixed. If the machine that stores this Region hangs, the Region will be restored on other machines.

The location of the Region is recorded by ZK.

Here is the complete process of writing or querying data into HBase:

  • The client obtains the location of the meta table Region from ZK and calculates which Region the row key to insert or modify data is in.
  • Find the meta table and find out which machine in the Hbase cluster this Region is stored on. Find the Store of the Region through the column family.
  • If it is write, write data directly in the memStore of the corresponding Region of this machine; If it is a query, first go to memStore for query. If it fails, go to fileStore for query, and merge the results of multiple fileStore queries in memory before returning.

The first step is not always done. The client can cache the location information of the Region, so as to avoid the performance overhead caused by accessing ZK every time.

Storage structure

It is generally believed that there are three ways to store data:

  • Hash storage: data storage is realized through persistent hash table. It supports addition, deletion, modification and random reading (random reading is its advantage), but it does not support sequential scanning. That is, the insertion and query operation efficiency of hash table is particularly high (one order of magnitude higher than that of tree, O(1) vs O(n)). However, the efficiency of data sorting and orderly traversal is particularly low.
  • B-tree storage: store data through persistent B-tree or its improved tree data structure (B + tree, etc.). Row level addition, deletion and modification are supported, but the efficiency is less than that of hash table. It supports sequential traversal and has high sorting efficiency. Traditional RDBMS (such as MySQL) generally use this data structure.
  • LSM tree storage: log structure merge tree, which stores the merge tree. Storage mode of HBase. Part of the data is stored in memory and part of the data is stored in disk. The advantage is that the data is written to the disk from memory in batch at one time, which greatly reduces the random io of the disk. The write performance of LSM is better than the first two storage methods (memory IO + batch sequential disk IO vs a large number of random disk IO), but the read efficiency is less than the first two storage methods (the read efficiency is unstable, and a large number of merge operations may be involved in reading)

In other words, LSM improves write efficiency by sacrificing read efficiency.

LSM tree stores the data in memory first. Once the memory is insufficient, it stores the data in disk. When updating data, LSM will not delete old data in disk for efficient write operation, but save multiple copies of data. Therefore, writing data is always memory oriented, and the efficiency is particularly high.

However, when reading data, if the memory is not hit, random disk IO may be required, and the data of different historical versions may be merged, resulting in additional performance overhead. In extreme cases, the read data efficiency of LSM is one order of magnitude lower than that of MySQL, but the write efficiency is one order of magnitude higher than that of MySQL.

At the bottom, LSM maintains several "small trees". These trees are initially stored in memory. When the trees grow to a certain extent, they will be persisted to disk. LSM needs to merge the small trees in the disk regularly to form a big tree, and then regularly split the big tree into multiple small trees.

Because the memory is unreliable and not persistent, LSM requires a reliable log file to record each operation, especially the persistence operation, so as to facilitate the recovery of memory data loss.

HBase FAQs

Why is HBase so fast?

  • Row keys are arranged in order. The whole table is divided into multiple regions according to row keys and stored on different machines. In this way, the machine on which the data is located can be quickly located during query. And such distributed storage can carry out concurrent query of multiple machines.
  • Using the LSM tree, part of the data is saved in memory and can be written based on memory when writing data. Very efficient.

Why can HBase store massive data?

  • The bottom layer is based on HDFS. This distributed file system is originally designed for big data.
  • The bottom layer stores key value pairs, does not store the table structure, empty cells do not occupy space, and the space utilization of sparse tables is high.
  • It is stored by column. Because the data types of columns are generally the same, the bottom layer can compress the column data to save space.

How reliable is HBase?

  • It implements HA based on ZK.
  • The bottom layer is stored in HDFS, which is inherently highly reliable.
  • HLog enables timely recovery even if the data stored in memory is lost.

HBase vs Hive

  • Hive is just a Hadoop client. Used as a data warehouse. Its function is to use SQL to conveniently manage the data in HDFS and write MR tasks.
  • Hive is generally used for offline analysis with high delay. In addition, row level addition, deletion and modification cannot be performed.
  • HBase is a real database, providing server and client. HDFS is only used as storage space for HBase. And HBase itself is distributed.
  • HBase can be used for real-time analysis, with high writing and query efficiency, and supports row level addition, deletion and modification.

HBase vs RDBMS

  • RDBMS is not suitable for storing massive data. Because HBase is distributed and based on HDFS, it is suitable for storing massive data.
  • RDBMS is suitable for storing structured data, fixing the table structure, and not allowing the addition of columns (in fact, it can be implemented, but it may need to reconstruct the table structure, which is very inefficient and has some limitations). The bottom layer is stored according to rows. HBase is suitable for storing unstructured data and semi-structured data. It does not fix the structure of the table. It is allowed to add columns, and the bottom layer is stored according to columns.
  • RDBMS supports very complex structured queries. HBase only supports full table scanning and query by row key.
  • RDBMS supports ACID transactions and is suitable for application level data storage; HBase does not support ACID transactions and only supports row level transactions. It is not suitable for storing application data. It is suitable for storing auxiliary massive data such as logs and crawler results for data analysis.

Java API

We can't use JDBC to operate HBase. We need HBase's own API.

We need to use the following classes to operate HBase:

  • HBaseAdmin: represents the entire database object.
  • HBaseConfiguration: database configuration.
  • HTable: table
  • HTableDescriptor: table description
  • Put: used to write data
  • Get: used to query data
  • Scanner: used to scan tables

Here is the code to connect HBase and create a table:

package cn.lazycat.bdd.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.HBaseAdmin;

import java.io.IOException;

public class HBaseTest {

    public static void main(String[] args) throws IOException {
        // Get Conf
        Configuration conf = HBaseConfiguration.create();
        // Configure zk address
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181," +
                        "hadoop2:2181,hadoop3:2181");
        // Build Admin object
        HBaseAdmin admin = new HBaseAdmin(conf);

        // Create a description of the table to create the table
        TableName name = TableName.valueOf("testTab");
        HTableDescriptor desc = new HTableDescriptor(name);

        // Create a column family
        HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
        HColumnDescriptor cf2 = new HColumnDescriptor("cf2");
        desc.addFamily(cf1);
        desc.addFamily(cf2);

        // Create table
        admin.createTable(desc);

        // Close connection
        admin.close();
    }

}

In HBase, you can see the created table:

hbase(main):003:0> list
TABLE
tab1
testTab
2 row(s) in 0.0120 seconds

=> ["tab1", "testTab"]

hbase(main):004:0> desc 'testTab'
Table testTab is ENABLED
testTab
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',
COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'cf2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',
COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.1670 seconds

To write data to the table, use the HTable object:

package cn.lazycat.bdd.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;

import java.io.IOException;

public class HBasePut {

    public static void main(String[] args) throws IOException {

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181," +
                "hadoop2:2181,hadoop3:2181");
        // Connect HBase data sheet
        HTable hTable = new HTable(conf, "testTab");

        // To create a Put, you need to pass in the row key name
        Put put = new Put("rk1".getBytes());
        // Adding data requires column families, columns, and specific data
        put.add("cf1".getBytes(), "c1".getBytes(), "value".getBytes());
        // Note all incoming byte data above

        // Write data to table
        hTable.put(put);

        // Close connection
        hTable.close();
    }

}

You can see the results:

hbase(main):005:0> scan 'testTab'
ROW                                      COLUMN+CELL
rk1                                     column=cf1:c1, timestamp=1523592364481, value=value
1 row(s) in 0.0990 seconds

The operation of querying data is similar:

package cn.lazycat.bdd.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;

import java.io.IOException;

public class HBaseGet {

    public static void main(String[] args) throws IOException {

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181," +
                "hadoop2:2181,hadoop3:2181");
        // Connect HBase data sheet
        HTable hTable = new HTable(conf, "testTab");

        // To create a new Get object, you need to specify a row key
        Get get = new Get("rk1".getBytes());
        // Specify column families and columns
        get.addColumn("cf1".getBytes(), "c1".getBytes());

        // Query the data and return the Result object, because it may contain multiple results
        Result res = hTable.get(get);
        // To retrieve the result data, you need to specify columns and column families
        byte[] val = res.getValue("cf1".getBytes(), "c1".getBytes());
        // Convert to String, output
        System.out.println("res = " + new String(val));

        // Close connection
        hTable.close();
    }

}

Output on console:

res = value

Delete data:

package cn.lazycat.bdd.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;

import java.io.IOException;

public class HBaseDelete {

    public static void main(String[] args) throws IOException {

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181," +
                "hadoop2:2181,hadoop3:2181");
        // Connect HBase data sheet
        HTable hTable = new HTable(conf, "testTab");

        // New Delete object
        Delete delete = new Delete("rk1".getBytes());
        delete.deleteColumn("cf1".getBytes(), "c1".getBytes());

        hTable.delete(delete);

        // Close connection
        hTable.close();
    }

}

In HBase, it is found that the data has been deleted:

hbase(main):006:0> scan 'testTab'
ROW                                      COLUMN+CELL
0 row(s) in 0.0210 seconds

Finally, delete the entire table and use the HAdmin object:

package cn.lazycat.bdd.hbase;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HBaseAdmin;

import java.io.IOException;

public class HBaseDrop {

    public static void main(String[] args) throws IOException {
        // Get Conf
        Configuration conf = HBaseConfiguration.create();
        // Configure zk address
        conf.set("hbase.zookeeper.quorum", "hadoop1:2181," +
                "hadoop2:2181,hadoop3:2181");
        // Build Admin object
        HBaseAdmin admin = new HBaseAdmin(conf);

        // Disable table
        admin.disableTable("testTab");

        // Delete table
        admin.deleteTable("testTab");

        // Close connection
        admin.close();
    }

}

As a result, the table was deleted:

hbase(main):007:0> list
TABLE
tab1
1 row(s) in 0.0120 seconds

=> ["tab1"]

Added by jera on Wed, 12 Jan 2022 22:26:51 +0200