The rows of the RowKey of HBase are sorted by row keys in dictionary order. This design optimizes scanning and allows the storage of related rows or adjacent rows that will be read together. However, poorly designed row keys are a common cause of hotspotting. When a large amount of client traffic is directed to one or more nodes in the cluster, hotspotting occurs. These flows may represent read, write, or other operations. The traffic exceeds the capacity of a single machine carrying the region, which will lead to performance degradation and may lead to the unavailability of the region. Other regions on the same RegionServer may also be adversely affected because the host cannot provide the load requested by the service. It is very important to design a data access mode so that the cluster can be used fully and evenly.
In order to prevent hotspots during write operations, row keys should be designed so that data can be written to multiple regionservers at the same time, and avoid writing to only one RegionServer, unless those rows really need to be written to one RegionServer. The sorting problem of HBase is sorted by rowkey in dictionary order, from small to large. If the task is to obtain the latest time data, it can be reversed with timestamp or long maxvalue - timestamp
In order of hash unique length
hash Hash and pre partition can be put together, for example, 10 in advance region Reduce hot issues Can id %10 Start, but if id Regular words can be used(id+specific date)%10 only It must be unique in design, rowkey It is stored in dictionary order, Therefore, design rowkey To make full use of this sorting feature, you can store the frequently read data in one piece and the recently accessed data in one piece. You can put a timestamp in it hbase If it is sorted by dictionary order, it is necessary to consider the case of single partition writing all the time length rowkey As a binary code stream, the maximum is 64 kb It's best not to exceed 16 bytes. Too long design will occupy memstore space order HBase of rowkey The order of the installation dictionary is from small to large,Therefore, when you need the latest data, you need to flip the timestamp or self increment id
Attributes with fewer enumerable attribute values are placed in front of rowkey
Rowkey is a combination of multiple fields. The order of these fields is directly related to the efficiency of access. A general rule is: a small number of controllable fields are placed in the front of rowkey (such as eengine_type, provinc, etc.); On the contrary, put it in the back (such as vin,timestamp, etc.). The reason for this is that controllable attributes are put in front of each other, and the balance of different query requirements is stronger. On the contrary, the balance is poor.
Case 1: YCK09360-60-1638290481900-9011D6L00124 434.7 YCK09360-60-1638290482900-9011D6L00124 76.1 YCK09360-60-1638290483900-9011D6L00124 18.6 YCK09360-60-1638290484900-9011D6L00124 150.1 YCK09360-60-1638290485900-9011D6L00124 96.1 YCK09360-60-1638290586900-9011D6L00124 35.7 ENGINE_TYPE Enumerable, and the number is small, put it in the front; and vin There are a lot, so put it in the back. This design can meet the following two requirements, and the complexity is relatively small: 1) query,Within a certain period of time YCK09360-60 All the data. This demand setting scan of startrow='YCK09360-60_time stamp'，endrow='YCK09360-60_time stamp'，Just. 2) Query all 9011 in a certain period of time D6L00124 Data. Under this demand, according to scan rowkey The principle of continuity sets this demand scan of startrow='YCK09360-60_time stamp_9011D6L00124 '，endrow='YCK09360-60_time stamp_9011D6L00124 '，Just.
However, if vin is put in front, the adaptability is poor, as shown in case 2:
9011D6L00124-YCK09360-60-1638290481900 434.7 9011D6L00124-YCK09360-60-1638290482900 76.1 9011D6L00124-YCK09360-60-1638290483900 18.6 9011D6L00124-YCK09360-60-1638290484900 150.1 9011D6L00124-YCK09360-60-1638290485900 96.1 9011D6L00124-YCK09360-60-1638290586900 35.7 1)All queries in a certain period 9011 D6L00124 Data. Under this demand, set scan of startrow='9011D6L00124-YCK09360-60-Timestamp, endrow='9011D6L00124-YCK09360-60-time stamp'，Just. 2) Query a certain period of time YCK09360-60 All the data. This demand setting scan It's for taking YCK09360-60 Small all vin Number and put it in rowkey The first paragraph of,Start multiple scan To scan multiple sets of data
HBase of rowkey The order of the installation dictionary is from small to large,Therefore, when you need the latest data, you need to flip the timestamp or self increment id ,Therefore, it is best to use the timestamp field long.maxvalue-timestamp/1000 ,This ensures that the latest data is always in a newer position,Easy to read
The setting of pre partition is also necessary,Prevent data hotspot issues.,have access to hash The remainder is used to delete the key partition fields,The uniformity of zoning is ensured
So the final design scheme
Partition field-Identification field-Identification field-long.maxvalue-timestamp
Other points that I find meaningful
Reduce column family and data storage overhead,Reduce qualifiers if necessary ,Write the data directly into a single item,Split in the rest of the processing code. stay HBase In, value Always with its key Transmitted together. When a specific value is transmitted between systems, its rowkey，The column name and timestamp will also be transmitted together. If your rowkey And the list is very large, HBase storefiles The index in (which facilitates random access) will occupy HBase Allocate a lot of memory because of the specific value and its key Very big. Can increase block Size makes storefiles Increase the index at a greater interval, or modify the schema of the table to reduce it rowkey And the size of the column name. Compression also contributes to larger indexes. control rowkey Below 16 bytes and maintained at an integer multiple of 8,8-byte alignment for 64 bit systems ,Maximum 64 bits High weight in business access key Put it in the front for example URLRecords The main purpose of the table is to calculate the of the day URL Access ranking. According to business requirements, you need to access all the information on a certain day URL，therefore date It is the primary key, with higher weight. It is placed in the front, while URL Put it in the back. Construct redundant data For example, percontent Your data contains URL Records Data, URL Records The data is stored redundantly. The difference is percontent of URL Put on date Front, and URL Records Tabular URL Put on date Back. This is because URL When meeting different needs, the weights are different because URL Records The amount of data required is not large, so the redundant mechanism is used to solve this contradiction. Weigh the importance of requirements and system tolerance and choose a scheme. When there are contradictions between the two requirements, but one of them belongs to secondary requirements and is within the range of system tolerance, one scheme can be abandoned. Give priority to the party with stronger needs Rowkey Time attribute problem loop key If there is a problem with using (1) rowkey There is a time attribute in, and with the increase of time, rowkey If it will continue to increase, it will cause region The number is increasing. If used TTL To control the life cycle of data, some old data will expire, resulting in old data region The amount of data will gradually decrease or even become empty region. On the one hand region The total number is increasing. On the other hand, it is old region Is constantly becoming empty region，And empty region Will not merge automatically, resulting in too many empty region Load and memory consumption. (2) Solution in this case, a loop can be used key To solve the problem. The idea is to set according to the life cycle of data rowkey After a cycle has passed, the old expired data will continue to be used through the method of time mapping rowkey. For example, key The format is as follows: YY-MM-DD-URL. If the life cycle of the data is one year, it can be used MM-DD-URL Format. In this way, after the current year has passed, the data has aged, and the data of the next year can continue to be written to the location of the previous year, using the location of the data of the previous year rowkey. This avoids empty region Occupation of resources. according to hbase The principle of, key The cycle needs to be at least TTL Big 2* hbase.hregion.majorcompaction(The default time is 24 hours) to ensure that expired data can be stored in key It is completely cleaned before the cycle comes back. The method of creating a table according to the time period can also solve the problem of null region Problems, and cycles key Method comparison, cycle key The advantages are as follows: Simple operation, no need to create a table repeatedly, and the system will process it automatically Similarly, cycle key It has the following disadvantages: Need to use TTL To aging data, it may increase compact burden It is necessary to ensure that the query operation will not query expired data, otherwise it will affect the system performance. If the system is not under great pressure and needs to run for a long time, and the query can be controlled so that the overdue data will not be queried, it is recommended to use it TTL+loop key Otherwise, it is recommended to create a table according to the time period. adopt rowkey Designed to control concurrency Under the same business model, different rowkey The concurrency of the design system is different. Similar to the idea of building tables by day, through rowkey The principle of controlling concurrency is active region The total number is moderate, each regionserver Activation of Region Number greater than 1, less than (write operation memory)/flushsize)Better. To achieve this, you can put enumerable, limited number of attributes in rowkey To improve concurrency, time is put in the front of time; By putting large granularity time attributes (such as days, hours, etc.) in rowkey Previously, a large number of enumerable attributes (such as phone number URL Etc.) placed in the later method to control the activation region Count.
Telecom mobile phone industry case
Case 1 xx_yy_zz_time stamp xx Must hash for partition field yy and zz To identify the query field, if enumerable, the number of enumerations shall be from small to large to facilitate query Timestamp is the best query field identification If the purpose of the table is to count the data of the current day, you can put the year, month and day into the table rowkey Side by side yy before Scene question Usage scenario: Telecom case:Query someone(cell-phone number)A year[A month and a day](time)Call details. 1) Pre partition (1) Evaluate the data growth in the next six months to one year,Don't let it partition automatically(10G) (2) Determine partition key 00| 01| 02| ... 000| 001| ... 2) Design RowKey (1) Determine the area code (Hashing) 00_ 01_ 02_... cell-phone number%Number of partitions Insufficient hash (cell-phone number+specific date)%It is inconvenient to query the number of partitions by month and year (cell-phone number+years)%Number of partitions (2) Splice field (Uniqueness, length) XX_cell-phone number_time stamp XX_cell-phone number_Year month day hour minute second XX_time stamp_cell-phone number XX_Year month day hour minute second_cell-phone number (3) check 13412341234 2021-09-07 XX_cell-phone number_Year month day hour minute second startRow:05_13412341234_2021-09-07 stopRow :05_13412341234_2021-09-08 05_13412341234_2021-09-07| XX_Year month day hour minute second_cell-phone number startRow:05_2021-09-07 00:00:00_13412341234 stopRow :05_2021-09-08 00:00:00_13412341234 13412341234 2021-09 2021-11 XX_cell-phone number_Year month day hour minute second startRow:05_13412341234_2021-09 stopRow :05_13412341234_2021-09| 05_13412341234_2021-10 startRow:03_13412341234_2021-10 stopRow :03_13412341234_2021-11 startRow:04_13412341234_2021-11 stopRow :04_13412341234_2021-12