Importing Excel data into hbase

Table design

  • Column cluster: 1-2 are recommended. If you can use one, you can't use two
  • Version design: if our project does not need to save historical VERSIONS, it is OK to directly configure VERSIONS=1 according to the default configuration. If you need to save historical change information in the project, you can set VERSIONS to > 1. But setting it to greater than 1 means taking up more space
  • Data compression: when creating a table, you can specify the data compression method (GZ, SNAPPY, LZO) for the column cluster. GZ mode has the highest compression ratio, about 13% of the space, but its compression and decompression speed is slower

Key actions to avoid hot spots

  1. Pre partition

    • When creating a table, configure some policies so that a table has multiple regions and is distributed in different hregionservers
    • HBase will automatically split. If a region is too large, HBase will automatically split into two, which is to divide horizontally according to rowkey
  2. rowkey design

    • Reverse: for example: mobile phone number and time stamp can reverse the mobile phone number

    • Salt: add random numbers in front of rowkey. After adding random numbers, the data will not be queried, because HBase has no secondary index by default

    • Hash: take the hash according to a part of the rowkey, because the hash has the same value every time. Therefore, we can use hash operation to obtain data

    • The core of these strategies is to break up the rowkey s and put them into the cluster nodes. Therefore, the data is no longer stored orderly, which will lead to the decline of the efficiency of scan

Pre partition

  • There are two strategies for pre partitioning

    • Pre partition by startKey and endKey [10, 40, 50]

  • Directly specify the quantity. startKey and endKey are automatically generated by hbase, and the algorithm of key needs to be specified

  • HBase data is stored in HDFS

    • /hbase/data / namespace / table / column cluster / StoreFiles

Table building instruction

# 1, Namespace
# 1. Create a namespace
create_namespace 'MOMO_CHAT'

# 2. View namespace
list_namespace

# 3. Delete the previous namespace
drop_namespace 'MOMO_CHAT'

# 4. View a specific namespace
describe_namespace 'MOMO_CHAT'
describe_namespace 'default'

# 5. Command momo_ Create a table named MSG under the chat namespace, which contains a column cluster named C1.
# Note: for tables with namespaces, use colons to join namespaces and table names together
create "MOMO_CHAT:MSG", "C1"

# 6. Specify how to modify the column cluster of a table and its compression method
alter "MOMO_CHAT:MSG", {NAME => "C1", COMPRESSION => "GZ"}

# 7. Delete the previously created table
disable "MOMO_CHAT:MSG"
drop "MOMO_CHAT:MSG"

# 8. You need to specify a pre partition when creating a table
create 'MOMO_CHAT:MSG', {NAME => "C1", COMPRESSION => "GZ"}, { NUMREGIONS => 6, SPLITALGO => 'HexStringSplit'}

You can see that there are already six region s.

Randomly generate a message

  1. Read data from excel file through Excel reader tool class and put it into a Map structure
    • key: field name
    • value: List, the data List corresponding to the field
  2. Create getOneMessage method, which is specially used to randomly generate an Msg entity object according to the data read by Excel
    • Call ExcelReader Randomcolumn method to randomly obtain the data of a column
  3. Note: the message uses the current system time. The format of time is year month day hour minute second
public class MoMoMsgGen {
    public static void main(String[] args) {
        // Read data from Excel file
        Map<String, List<String>> resultMap =
                ExcelReader.readXlsx("D:\\Curriculum development\\51.V8.0_NoSQL_MQ\\2.HBase\\3.code\\momo_chat_app\\data\\Test data set.xlsx", "Momo data");

        System.out.println(getOneMessage(resultMap));
    }

    /**
     * Randomly generate an Msg object based on the data read from the Excel table
     * @param resultMap Excel Read data (Map structure)
     * @return An Msg object
     */
    public static Msg getOneMessage(Map<String, List<String>> resultMap) {
        // one 	 Build Msg entity class object
        Msg msg = new Msg();

        // Set the time of the current system as the time of the message and store it in the form of month, day, hour, minute and second
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        // get SysTime
        Date now = new Date();
        msg.setMsg_time(simpleDateFormat.format(now));

        // two 	 Call randomColumn in ExcelReader to randomly generate the data of a column
        // Initialize sender_nickyname field, call randomColumn to get Nick randomly_ Name set data
        msg.setSender_nickyname(ExcelReader.randomColumn(resultMap, "sender_nickyname"));
        msg.setSender_account(ExcelReader.randomColumn(resultMap, "sender_account"));
        msg.setSender_sex(ExcelReader.randomColumn(resultMap, "sender_sex"));
        msg.setSender_ip(ExcelReader.randomColumn(resultMap, "sender_ip"));
        msg.setSender_os(ExcelReader.randomColumn(resultMap, "sender_os"));
        msg.setSender_phone_type(ExcelReader.randomColumn(resultMap, "sender_phone_type"));
        msg.setSender_network(ExcelReader.randomColumn(resultMap, "sender_network"));
        msg.setSender_gps(ExcelReader.randomColumn(resultMap, "sender_gps"));
        msg.setReceiver_nickyname(ExcelReader.randomColumn(resultMap, "receiver_nickyname"));
        msg.setReceiver_ip(ExcelReader.randomColumn(resultMap, "receiver_ip"));
        msg.setReceiver_account(ExcelReader.randomColumn(resultMap, "receiver_account"));
        msg.setReceiver_os(ExcelReader.randomColumn(resultMap, "receiver_os"));
        msg.setReceiver_phone_type(ExcelReader.randomColumn(resultMap, "receiver_phone_type"));
        msg.setReceiver_network(ExcelReader.randomColumn(resultMap, "receiver_network"));
        msg.setReceiver_gps(ExcelReader.randomColumn(resultMap, "receiver_gps"));
        msg.setReceiver_sex(ExcelReader.randomColumn(resultMap, "receiver_sex"));
        msg.setMsg_type(ExcelReader.randomColumn(resultMap, "msg_type"));
        msg.setDistance(ExcelReader.randomColumn(resultMap, "distance"));
        msg.setMessage(ExcelReader.randomColumn(resultMap, "message"));

        // three 	 Note that the time uses the current time of the system

        return msg;
    }
}

Generate rowkey

  1. ROWKEY = MD5Hash_ Sender account_ Recipient account_ Message timestamp
  2. MD5Hash.getMD5AsHex generates the MD5 value. In order to shorten the rowkey, take the first 8 bits
// Generate rowkey from Msg entity object
    public static byte[] getRowkey(Msg msg) throws ParseException {
        //
        // ROWKEY = MD5Hash_ Sender account_ Recipient account_ Message timestamp
        //
        // Use StringBuilder to underline sender account, recipient account and message timestamp () Splice together
        StringBuilder builder = new StringBuilder();
        builder.append(msg.getSender_account());
        builder.append("_");
        builder.append(msg.getReceiver_account());
        builder.append("_");
        // Gets the timestamp of the message
        String msgDateTime = msg.getMsg_time();
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date msgDate = simpleDateFormat.parse(msgDateTime);
        long timestamp = msgDate.getTime();
        builder.append(timestamp);

        // Use bytes Tobytes converts the spliced string into a byte [] array
        // Use md5hash Getmd5ashex generates MD5 value and takes its first 8 bits
        String md5AsHex = MD5Hash.getMD5AsHex(builder.toString().getBytes());
        String md5Hex8bit = md5AsHex.substring(0, 8);

        // Then, the MD5 value and the previously spliced sender account, recipient account and message timestamp are spliced with underscores and converted into Bytes array
        String rowkeyString = md5Hex8bit + "_" + builder.toString();
        System.out.println(rowkeyString);

        return Bytes.toBytes(rowkeyString);
    }

Push randomly generated data into HBase

public static void main(String[] args) throws ParseException, IOException {
        // Read data from Excel file
        Map<String, List<String>> resultMap =
                ExcelReader.readXlsx("D:\\Curriculum development\\51.V8.0_NoSQL_MQ\\2.HBase\\3.code\\momo_chat_app\\data\\Test data set.xlsx", "Momo data");


        // Generate data into HBase
        // 1. Get Hbase connection
        Configuration config = HBaseConfiguration.create();
        Connection connection = ConnectionFactory.createConnection(config);

        // 2. Obtain HBase table MOMO_CHAT:MSG
        Table table = connection.getTable(TableName.valueOf("MOMO_CHAT:MSG"));

        int i = 0;
        int MAX = 100000;

        while (i < MAX) {
            Msg msg = getOneMessage(resultMap);
            // 3. Initialize the variables (column cluster, column name) required for the operation of Hbase
            byte[] rowkey = getRowkey(msg);
            String cf = "C1";
            String colMsg_time = "msg_time";
            String colSender_nickyname = "sender_nickyname";
            String colSender_account = "sender_account";
            String colSender_sex = "sender_sex";
            String colSender_ip = "sender_ip";
            String colSender_os = "sender_os";
            String colSender_phone_type = "sender_phone_type";
            String colSender_network = "sender_network";
            String colSender_gps = "sender_gps";
            String colReceiver_nickyname = "receiver_nickyname";
            String colReceiver_ip = "receiver_ip";
            String colReceiver_account = "receiver_account";
            String colReceiver_os = "receiver_os";
            String colReceiver_phone_type = "receiver_phone_type";
            String colReceiver_network = "receiver_network";
            String colReceiver_gps = "receiver_gps";
            String colReceiver_sex = "receiver_sex";
            String colMsg_type = "msg_type";
            String colDistance = "distance";
            String colMessage = "message";

            // 4. Build put request
            Put put = new Put(rowkey);

            // 5. Add all columns of unfamiliar messages one by one
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colMsg_time), Bytes.toBytes(msg.getMsg_time()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_nickyname), Bytes.toBytes(msg.getSender_nickyname()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_account), Bytes.toBytes(msg.getSender_account()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_sex), Bytes.toBytes(msg.getSender_sex()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_ip), Bytes.toBytes(msg.getSender_ip()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_os), Bytes.toBytes(msg.getSender_os()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_phone_type), Bytes.toBytes(msg.getSender_phone_type()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_network), Bytes.toBytes(msg.getSender_network()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colSender_gps), Bytes.toBytes(msg.getSender_gps()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_nickyname), Bytes.toBytes(msg.getReceiver_nickyname()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_ip), Bytes.toBytes(msg.getReceiver_ip()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_account), Bytes.toBytes(msg.getReceiver_account()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_os), Bytes.toBytes(msg.getReceiver_os()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_phone_type), Bytes.toBytes(msg.getReceiver_phone_type()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_network), Bytes.toBytes(msg.getReceiver_network()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_gps), Bytes.toBytes(msg.getReceiver_gps()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colReceiver_sex), Bytes.toBytes(msg.getReceiver_sex()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colMsg_type), Bytes.toBytes(msg.getMsg_type()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colDistance), Bytes.toBytes(msg.getDistance()));
            put.addColumn(Bytes.toBytes(cf), Bytes.toBytes(colMessage), Bytes.toBytes(msg.getMessage()));

            // 6. Initiate put request
            table.put(put);

            // Show progress
            ++i;
            System.out.println(i + " / " + MAX);
        }
        table.close();
        connection.close();
    }

The number of data written here is 10w. You can see that the request is evenly distributed in the region.

Implement getMessage data service interface

Implemented using scan + filter

  1. Building scan objects
  2. Build 4 filter s (start date query, end date query, sender and recipient)
  3. Build a list of Msg objects
  /**
     *
     * @param date Date: September 10, 2020
     * @param sender Sender
     * @param receiver addressee
     * @return
     * @throws Exception
     */
    @Override
    public List<Msg> getMessage(String date, String sender, String receiver) throws Exception {
        // one 	 Building scan objects
        Scan scan = new Scan();

        // Build two date strings with hours, minutes and seconds
        String startDateStr = date + " 00:00:00";
        String endDateStr = date + " 23:59:59";

        // two 	 Build a range for query time, for example: 2020-10-05 00:00:00 – 2020-10-05 23:59:59
        // three 	 Build two filters for query date, greater than or equal to and less than or equal to. Use SingleColumnValueFilter to Filter a single column here.
        SingleColumnValueFilter startDateFilter = new SingleColumnValueFilter(Bytes.toBytes("C1")
                , Bytes.toBytes("msg_time")
                , CompareOperator.GREATER_OR_EQUAL
                , new BinaryComparator(Bytes.toBytes(startDateStr)));

        SingleColumnValueFilter endDateFilter = new SingleColumnValueFilter(Bytes.toBytes("C1")
                , Bytes.toBytes("msg_time")
                , CompareOperator.LESS_OR_EQUAL
                , new BinaryComparator(Bytes.toBytes(endDateStr)));

        // four 	 Build sender Filter
        SingleColumnValueFilter senderFilter = new SingleColumnValueFilter(Bytes.toBytes("C1")
                , Bytes.toBytes("sender_account")
                , CompareOperator.EQUAL
                , new BinaryComparator(Bytes.toBytes(sender)));

        // five 	 Build recipient Filter
        SingleColumnValueFilter receiverFilter = new SingleColumnValueFilter(Bytes.toBytes("C1")
                , Bytes.toBytes("receiver_account")
                , CompareOperator.EQUAL
                , new BinaryComparator(Bytes.toBytes(receiver)));

        // six 	 Use FilterList to combine all filters
        FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL
                , startDateFilter
                , endDateFilter
                , senderFilter
                , receiverFilter);

        // seven 	 Set scan object filter
        scan.setFilter(filterList);

        // eight 	 Get the HTable object and call getScanner to execute
        Table table = connection.getTable(TableName.valueOf("MOMO_CHAT:MSG"));
        ResultScanner resultScanner = table.getScanner(scan);

        // nine 	 Get the iterator, iterate each row, and iterate each cell at the same time
        Iterator<Result> iterator = resultScanner.iterator();

        // Create a list to save the queried messages
        ArrayList<Msg> msgList = new ArrayList<>();

        while(iterator.hasNext()) {
            // The data queried in each row is an Msg object
            Result result = iterator.next();
            Msg msg = new Msg();
            // Get rowkey
            String rowkey = Bytes.toString(result.getRow());
            // Cell list
            List<Cell> cellList = result.listCells();

            for (Cell cell : cellList) {
                // Determine according to the column name of the current cell cell and set the corresponding field
                String columnName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
                if(columnName.equals("msg_time")) {
                    msg.setMsg_time(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_nickyname")){
                    msg.setSender_nickyname(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_account")){
                    msg.setSender_account(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_sex")){
                    msg.setSender_sex(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_ip")){
                    msg.setSender_ip(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_os")){
                    msg.setSender_os(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_phone_type")){
                    msg.setSender_phone_type(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_network")){
                    msg.setSender_network(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("sender_gps")){
                    msg.setSender_gps(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_nickyname")){
                    msg.setReceiver_nickyname(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_ip")){
                    msg.setReceiver_ip(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_account")){
                    msg.setReceiver_account(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_os")){
                    msg.setReceiver_os(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_phone_type")){
                    msg.setReceiver_phone_type(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_network")){
                    msg.setReceiver_network(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_gps")){
                    msg.setReceiver_gps(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("receiver_sex")){
                    msg.setReceiver_sex(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("msg_type")){
                    msg.setMsg_type(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("distance")){
                    msg.setDistance(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
                if(columnName.equals("message")){
                    msg.setMessage(Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()));
                }
            }

            msgList.add(msg);
        }

        // close resource
        resultScanner.close();
        table.close();

        return msgList;
    }

Keywords: Big Data HBase Middleware

Added by switchdoc on Sat, 29 Jan 2022 16:29:20 +0200