Virtual Keys for Hudi kernel analysis

Overview

Apache Hudi helps you build and manage data lakes according to different table types and configuration parameters to meet everyone's needs. Hudi added metadata fields for each record, such as_ hoodie_record_key， _ hoodie_partition path， _ hoodie_commit_time, it has many uses. They help avoid recalculating record keys, partition paths, and support record level incremental queries (compared to other table formats that track only files) during merge, compression, and other table operations. In addition, even if the key field of a given table changes during its life cycle, it ensures data quality by ensuring that unique key constraints are enforced. However, for simple use cases that do not require these benefits or very few key changes, one of the repeated requirements from the community is to use existing fields instead of adding additional meta fields.

Virtual key support

Hudi now supports virtual keys, where Hudi meta fields can be calculated from data fields as needed. At present, meta fields are only calculated once, stored as record metadata, and reused in various operations. If they do not need incremental query support, they can start to use Hudi's Virtual key support and continue to use Hudi to build and manage their data lake to reduce the storage overhead caused by each record metadata.

Related configuration

You can use the following configuration to enable virtual keys for a given table. When setting Hoodie population. meta. When fields = false, Hudi will use virtual keys for the corresponding tables. The default value for this configuration is true, which means that all meta fields will be added by default.

Once the virtual key is enabled, the given hudi The table disables it because records that have been stored may not be populated with meta fields. But if you have an existing table of an old version of Hudi, virtual keys can be enabled. w. Another constraint supported by R.T virtual keys is that the key generator properties of a given table cannot be changed during the life cycle of a given Hudi table. In this model, users also share the responsibility of ensuring the uniqueness of keys in the table. For example, if you configure the record key to point to field_5, and then switch to field_10, then Hudi cannot guarantee the uniqueness of the key, because earlier write operations may affect the field_10 repeat.

When using virtual keys, you must recalculate the keys every time you need to (merge, compress, MOR snapshot read). Therefore, we support virtual keys for all built-in key generators on the copy on write table. Supporting all key generators on the merge on read table will need to read all fields from the base log and incremental log, thus sacrificing the core column query performance, which is very expensive for users. Therefore, we currently only support simple key generators (the default key generator, where both record keys and partition paths refer to existing fields).

Key generators supported by CopyOnWrite(COW) table

SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.

Key generators supported by MergeOnRead(MOR) table

SimpleKeyGenerator

Supported index types

The initial version only supports SIMPLE and gold_ SIMPLE, subsequent plans support other indexes such as BLOOM.

Supported operations

With the exception of incremental queries, all existing features support Hudi tables with virtual keys. This means that cleaning, archiving, metadata tables, clustering, and so on can be enabled for a Hudi table with virtual keys enabled. Therefore, if you want to do so, you can only use Hudi as a transactional table format and use it with all the excellent table service runtime and platform services without any overhead associated with supporting incremental data processing.

Sample display

As mentioned earlier, you need to set Hoodie population. meta. Fields = false to turn on the virtual key. Next, let's look at the difference between turning on and not turning on the virtual key.

The following are some examples of records for a regular hudi table (virtual keys are disabled)

+--------------------+--------------------------------------+--------------------------------------+---------+---------+-------------------+
|_hoodie_commit_time |           _hoodie_record_key         |        _hoodie_partition_path        |  rider  | driver  |        fare       |
+--------------------+--------------------------------------+--------------------------------------+---------+---------+-------------------+
|   20210825154123   | eb7819f1-6f04-429d-8371-df77620b9527 | americas/united_states/san_francisco |rider-284|driver-284|98.3428192817987  |
|   20210825154123   | 37ea44f1-fda7-4ec4-84de-f43f5b5a4d84 | americas/united_states/san_francisco |rider-213|driver-213|19.179139106643607|
|   20210825154123   | aa601d6b-7cc5-4b82-9687-675d0081616e | americas/united_states/san_francisco |rider-213|driver-213|93.56018115236618 |
|   20210825154123   | 494bc080-881c-48be-8f8a-8f1739781816 | americas/united_states/san_francisco |rider-284|driver-284|90.9053809533154  |
|   20210825154123   | 09573277-e1c1-4cdd-9b45-57176f184d4d | americas/united_states/san_francisco |rider-284|driver-284|49.527694252432056|
|   20210825154123   | c9b055ed-cd28-4397-9704-93da8b2e601f | americas/brazil/sao_paulo            |rider-213|driver-213|43.4923811219014  |
|   20210825154123   | e707355a-b8c0-432d-a80f-723b93dc13a8 | americas/brazil/sao_paulo            |rider-284|driver-284|63.72504913279929 |
|   20210825154123   | d3c39c9e-d128-497a-bf3e-368882f45c28 | americas/brazil/sao_paulo            |rider-284|driver-284|91.99515909032544 |
|   20210825154123   | 159441b0-545b-460a-b671-7cc2d509f47b | asia/india/chennai                   |rider-284|driver-284|9.384124531808036 |
|   20210825154123   | 16031faf-ad8d-4968-90ff-16cead211d3c | asia/india/chennai                   |rider-284|driver-284|90.25710109008239 |
+--------------------+--------------------------------------+--------------------------------------+---------+----------+------------------+Copy

Here are some sample records of a hudi table with virtual keys enabled.

+--------------------+------------------------+-------------------------+---------+---------+-------------------+
|_hoodie_commit_time |    _hoodie_record_key  |  _hoodie_partition_path |  rider  | driver  |        fare       |
+--------------------+------------------------+-------------------------+---------+---------+-------------------+
|        null        |            null        |          null           |rider-284|driver-284|98.3428192817987  |
|        null        |            null        |          null           |rider-213|driver-213|19.179139106643607|
|        null        |            null        |          null           |rider-213|driver-213|93.56018115236618 |
|        null        |            null        |          null           |rider-284|driver-284|90.9053809533154  |
|        null        |            null        |          null           |rider-284|driver-284|49.527694252432056|
|        null        |            null        |          null           |rider-213|driver-213|43.4923811219014  |
|        null        |            null        |          null           |rider-284|driver-284|63.72504913279929 |
|        null        |            null        |          null           |rider-284|driver-284|91.99515909032544 |
|        null        |            null        |          null           |rider-284|driver-284|9.384124531808036 |
|        null        |            null        |          null           |rider-284|driver-284|90.25710109008239 |
+--------------------+------------------------+-------------------------+---------+----------+------------------+Copy

As you can see, all meta fields in the storage are empty, but all user fields remain unchanged, similar to a normal table.

Incremental query

Since hudi does not maintain the metadata of any table (such as the submission time at the record level) after the virtual key is enabled, incremental queries are not supported. If you make an incremental query, the following exceptions will appear:

scala> val tripsIncrementalDF = spark.read.format("hudi").
     |   option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
     |   option(BEGIN_INSTANTTIME_OPT_KEY, "20210827180901").load(basePath)
org.apache.hudi.exception.HoodieException: Incremental queries are not supported when meta fields are disabled
  at org.apache.hudi.IncrementalRelation.<init>(IncrementalRelation.scala:69)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:120)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 61 elidedCopy

summary

I hope this blog is useful for you to learn another feature of Apache Hudi.

This article is the original article of "xiaozhch5", a blogger from big data to artificial intelligence. It follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprint.

Original link: https://lrting.top/backend/2026/

Added by xgrewellx on Fri, 21 Jan 2022 04:16:55 +0200

Programming VIP