Zeppelin combines Flink to query hudi data

About Zeppelin

Zeppelin is a Web-based notebook that supports data-driven, interactive data analysis and collaboration using SQL, Scala, Python, R, and so on.

Zeppelin supports multiple language backends, and the Apache Zeppelin interpreter allows any language/data processing backend to be inserted into Zeppelin. Currently, Apache Zeppelin supports many interpreters, including Apache Spark, Apache Flink, Python, R, JDBC, Markdown, and Shell.

Simply put, it lets you use the Web UI to implement many functions that would otherwise require logging on to the server and using the terminal.

(For Flink and Hudi introductions, refer to other articles on this blog, or search by yourself)

The topic for today follows.

This article covers components and their versions

Component Name	version number
hadoop	3.2.0
hudi	0.10.0-SNAPSHOT
zeppelin	0.10.0
flink	1.13.1

Import data into hudi before doing the following. If you have not already done so, you can refer to:

Restore hudi jobs from savepoint using FLINK SQL (flink 1.13)

Related blog posts import data into hudi

zeppelin installation package download

mkdir /data && cd /data
wget https://dlcdn.apache.org/zeppelin/zeppelin-0.10.0/zeppelin-0.10.0-bin-all.tgz

tar zxvf zeppelin-0.10.0-bin-all.tgz
ln -s /data/zeppelin-0.10.0-bin-all /data/zeppelin

zeppelin Profile Modification

cd /data/zeppelin/conf
cp zeppelin-site.xml.template zeppelin-site.xml

Modify the zeppelin.server.addr configuration item to 0.0.0.0

The default port for Zeppelin is 8080. If it conflicts with your local port, you can change it to a different port. This document changes the port to 8008, which is to change the zeppelin.server.port configuration item to 8008.

cp zeppelin-env.sh.template zeppelin-env.sh

Fill in the following variables:

export JAVA_HOME=/data/jdk
export HADOOP_CONF_DIR=/data/hadoop/etc/hadoop
export FLINK_HOME=/data/flink

Variables should be set according to your environment.

This article then starts Flink using the default local mode.

Start zeppelin

bin/zeppelin-daemon.sh start

If you do not create the logs and run folders at this point, they will be automatically created in the zeppelin directory, as follows:

[root@hadoop zeppelin]# bin/zeppelin-daemon.sh start
Log dir doesn't exist, create /data/zeppelin/logs
Pid dir doesn't exist, create /data/zeppelin/run
Zeppelin start                                             [  OK  ]

The browser enters the following page by typing zeppelin server ip:8008 or hostname:8008:

Basic Use

Click Notebook, click Create new note, fill in the text name and select flink interpreter as follows:

After you create the new page, go to the following page:

As mentioned earlier, we have passed the article

Restore hudi jobs from savepoint using FLINK SQL (flink 1.13)

Importing data into hudi as described, then we can query:

We choose

%flink.ssql

First define the hudi table:

 create table stu8_binlog_sink_hudi(
  id bigint not null,
  name string,
  `school` string,
  nickname string,
  age int not null,
  score decimal(4,2) not null,
  class_num int not null,
  phone bigint not null,
  email string,
  ip string,
  primary key (id) not enforced
)
 partitioned by (`school`)
 with (
  'connector' = 'hudi',
  'path' = 'hdfs://hadoop:9000/tmp/test_stu8_binlog_sink_hudi',
  'table.type' = 'MERGE_ON_READ',
  'write.precombine.field' = 'school'
  );

Statistics of hudi tables:

select * from stu8_binlog_sink_hudi;

The results are as follows:

Next, order by query

select * from stu8_binlog_sink_hudi order by age desc limit 100;

summary

This article uses zeppelin in combination with the flink engine to query and count the given hudi data, but what we can do in Flink SQL Client before can be done in zeppelin, and we can completely replace the Flink SQL Client described in the previous article.

Learn more

The practice of hudi in this article is an example of the hudi theme, and more can be found in the following:

https://lrting.top/hudi.html

Keywords: Python Big Data flink Hudi zeppelin

Added by thefamouseric on Sat, 09 Oct 2021 19:06:18 +0300

Programming VIP