Big data Flume enterprise development practice

1 replication and multiplexing

1.1 case requirements

Flume-1 is used to monitor file changes. Flume-1 passes the changes to Flume-2, which is responsible for storing them
To HDFS. At the same time, Flume-1 passes the changes to Flume-3, which is responsible for outputting them to the local file system.

1.2 demand analysis: single data source multi export case (selector)

1.3 implementation steps

(1) Preparatory work
Create the group1 folder in the / opt/module/flume/job directory
[bigdata@hadoop102 job]$ cd group1/
Create the flume3 folder in the / opt / module / data / directory
[bigdata@hadoop102 datas]$ mkdir flume3
(2) Create flume-file-flume.conf
Configure one source, two channel s and two sink s to receive log files and send them to flume flume respectively-
hdfs and flume flume dir.
Edit profile
[bigdata@hadoop102 group1]$ vim flume-file-flume.conf
Add the following

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Copy data flow to all channel s
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# avro on sink side is a data sender
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

(3) Create flume-flume-hdfs.conf
Configure the Source of the superior Flume output, which is the Sink to HDFS.
Edit profile
[bigdata@hadoop102 group1]$ vim flume-flume-hdfs.conf
Add the following

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# avro on the source side is a data receiving service
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9820/flume2/%Y%m%d/%H
#Prefix of uploaded file
a2.sinks.k1.hdfs.filePrefix = flume2-
#Scroll folders by time
a2.sinks.k1.hdfs.round = true
#How many time units to create a new folder
a2.sinks.k1.hdfs.roundValue = 1
#Redefine time units
a2.sinks.k1.hdfs.roundUnit = hour
#Use local timestamp
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#How many events are accumulated to flush to HDFS once
a2.sinks.k1.hdfs.batchSize = 100
#Set the file type to support compression
a2.sinks.k1.hdfs.fileType = DataStream
#How often do I generate a new file
a2.sinks.k1.hdfs.rollInterval = 30
#Set the scroll size of each file to about 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#File scrolling is independent of the number of events
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create flume-flume-dir.conf
Configure the Source of the superior Flume output. The output is Sink to the local directory.
Edit profile
[bigdata@hadoop102 group1]$ vim flume-flume-dir.conf
Add the following

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

Tip: the output local directory must be an existing directory. If the directory does not exist, a new directory will not be created
Record.
(5) Execution profile
Start the corresponding flume processes respectively: flume flume dir, flume flume HDFS, flume file flume.

[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group1/flume-flume-dir.conf
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group1/flume-flume-hdfs.conf
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group1/flume-file-flume.conf

(6) Start Hadoop and Hive

[bigdata@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh
[bigdata@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
[bigdata@hadoop102 hive]$ bin/hive
hive (default)>

(7) Check data on HDFS

(8) Check the data in the / opt / module / data / flume3 directory

[bigdata@hadoop102 flume3]$ ll
 Total consumption 8
-rw-rw-r--. 1 bigdata bigdata 5942 5 June 22 00:09 1526918887550-3

2 load balancing and failover

2.1 case requirements

Flume1 is used to monitor a port. The sink in the sink group is connected to Flume2 and Flume3 respectively. FailoverSinkProcessor is used to realize the function of failover.

2.2 requirements analysis: failover cases

2.3 implementation steps

(1) Preparatory work
Create the group2 folder in the / opt/module/flume/job directory
[bigdata@hadoop102 job]$ cd group2/
(2) Create flume-netcat-flume.conf
Configure 1 netcat source, 1 channel and 1 sink group (2 sinks) to transport to
flume-flume-console1 and flume-flume-console2.
Edit profile
[bigdata@hadoop102 group2]$ vim flume-netcat-flume.conf
Add the following

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

(3) Create flume-flume-console1.conf
Configure the Source of the superior Flume output, which is to the local console.
Edit profile
[bigdata@hadoop102 group2]$ vim flume-flume-console1.conf
Add the following

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create flume-flume-console2.conf
Configure the Source of the superior Flume output, which is to the local console.
Edit profile
[bigdata@hadoop102 group2]$ vim flume-flume-console2.conf
Add the following

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(5) Execution profile
Open the corresponding configuration files: flume-flume-console2, flume-flume-console1, flume netcat flume.

[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group2/flume-flume-console2.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group2/flume-flume-console1.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group2/flume-netcat-flume.conf

(6) Use the netcat tool to send content to port 444 of this machine
$ nc localhost 44444
(7) View the console print logs of Flume2 and Flume3
(8) Kill flume2 and observe the printing of Flume3 console.
Note: use JPS -- l ml to view the e Flume process.

3 polymerization

3.1 case requirements

Flume-1 monitoring file / opt/module/group.log on Hadoop 102, Flume-2 on Hadoop 103 monitors the data flow of a port, flume-1 and Flume-2 send the data to Flume-3 on Hadoop 104, and Flume-3 prints the final data
To the console.

3.2 demand analysis: multi data source summary case

3.3 implementation steps

(1) Preparatory work
Distribute Flume
[bigdata@hadoop102 module]$ xsync flume
Create one in / opt/module/flume/job directory of Hadoop 102, Hadoop 103 and Hadoop 104
group3 folder.

[bigdata@hadoop102 job]$ mkdir group3
[bigdata@hadoop103 job]$ mkdir group3
[bigdata@hadoop104 job]$ mkdir group3

(2) Create flume1-logger-flume.conf
Configure Source to monitor hive.log file and configure Sink to output data to the next level Flume.
Edit the configuration file on Hadoop 102
[bigdata@hadoop102 group3]$ vim flume1-logger-flume.conf
Add the following

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) Create flume2-netcat-flume.conf
Configure the Source monitoring port 444 data flow, and configure the Sink data to the next level Flume:
Edit the configuration file on Hadoop 103
[bigdata@hadoop102 group3]$ vim flume2-netcat-flume.conf
Add the following

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create flume3 -- flume -- logger.conf
Configure the source to receive the data streams sent by flume1 and flume2, and finally sink to the control after merging
Stage.
Edit the configuration file on Hadoop 104

[bigdata@hadoop104 group3]$ touch flume3-flume-logger.conf
[bigdata@hadoop104 group3]$ vim flume3-flume-logger.conf

Add the following

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

(5) Execution profile
Open the corresponding configuration files: flume3-flume-logger.conf, flume2-netcat-flume.conf, flume1-logger-flume.conf.

[bigdata@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --namea3 --conf-file job/group3/flume3-flume-logger.conf -
Dflume.root.logger=INFO,console
[bigdata@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --namea2 --conf-file job/group3/flume1-logger-flume.conf
[bigdata@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --namea1 --conf-file job/group3/flume2-netcat-flume.conf

(6) Add content to group.log in / opt/module directory on Hadoop 103
[bigdata@hadoop103 module]$ echo 'hello' > group.log
(7) Send data to port 444 on Hadoop 102
[bigdata@hadoop102 flume]$ telnet hadoop102 44444
(8) Check data on Hadoop 104

Keywords: Scala Big Data Spark

Added by ss-mike on Fri, 26 Nov 2021 15:40:56 +0200