flume Learning - Including Installation

1. What is Flume: Flume is a highly available, highly reliable, distributed system for collecting, aggregating and transferring massive logs provided by Cloudera. Flume is based on streaming architecture, flexible and simple.

Flume Composition Architecture

Now let's go into detail about the components in the Flume architecture.

1) Agent:

Is a jvm program, which sends data from the source to the destination in the form of events. It is the basic unit of Flume data transmission.

Agent consists of three parts: Source, Channel and Sinl.

2) Source:

Source is the organization responsible for receiving data to Flume Agent. Source can process various types and formats of log data, including avro, thirft, exec, jms, spooling directory, netcat, sequence generator, syslog, http, legacy.

3) Channel:

Channel is a buffer between Source and Sun

Therefore, Channel allows SOurce and Link to operate at different rates. Channel is thread-safe and can handle several Source write operations and several Sink read operations at the same time.

Flume comes with two Channels: Memory Channel and FIle Channel.

Memory Channel: It's a queue in memory. Memory Channel is suitable without loss of relational data. If relational data needs to be lost, Memory Channel should not be used because program death and restart of machine outage value will result in data loss.

File Channel writes all events to disk. Therefore, the data will not be lost when the program relationship value machine goes down.

4) Sink:

Sink constantly polls Channel events and removes them in batches, and writes these events in batches to a storage or indexing system, or is sent to another Flume Agent.

Sink is completely transactional. Between mass deletion of data from Channel, each Sink starts a transaction with Channel. Once batch events are successfully written to the storage system or the next Flume Agent, Sink commits transactions using Channel. Once a transaction is committed, the Channel deletes events from its internal buffer.

Sink build destinations include: hdfs, logger, avro, thrift, ipc, file, null, Hbase, solr, custom.

Event Topology

The topology of Flume is shown in the figure.

Flume Agent Connection

Single source, multiple channel, sink

Flume Load Balancing

Flume Agent Aggregation

Internal Principles of Flume Agent

Flume Installation

Installation Deployment

1) Upload apache-flume-1.9.0-bin.tar.gz to the / usr/local/directory of linux

2) Unzip apache-flume-1.9.0-bin.tar.gz to/usr/local/directory

$ tar -zxf apache-flume-1.9.0-bin.tar.gz -C /usr/local/

3) Modify the name of apache-flume-1.9.0-bin to flume

$ mv apache-flume-1.9.0-bin flume

4) Modify the flume-env.sh.template file under flume/conf to flume-env.sh and configure the flume-env.sh file

$ mv flume-env.sh.template flume-env.sh
$ vi flume-env.sh
export JAVA_HOME=/home/hduser/software/jdk1.8.0_221

Case:

1)

First, Flume monitors port 44444, then sends messages to port 44444 via telnet tool. Finally, Flume displays the monitored data in real time on the console.

Demand analysis:

3) Implementation steps:

Install telnet tools
Check if telnet has been installed

rpm -qa | grep telnet

If there's nothing, just don't install it and go on to the next step.
Install telnet and telnet-server. Note that root permission is required to install. (Seems like you need to follow the server first)

yum install telnet-server -y
yum install telnet -y

Because after installing the telnet service, the default is not to open the service. Next we need to modify the file to open the service.

Note: The telnet configuration file for centos7 is: / etc/xinetd.conf. Centos7 was before / etc/xinetd.d/telnet

vim /etc/xinetd.d/telnet

Modify disable = yes to disable = no

Need to activate xinetd service

systemctl start xinetd.service

Let xinetd boot

systemctl enable xinetd.service

Start telnet service

systemctl restart telnet.socket

Setting up telnet service to boot

systemctl enable telnet.socket

Test whether telnet is successfully opened

telnet localhost

Determine whether port 4444444 is occupied

sudo netstat -tunlp | grep 44444
 Function Description: The netstat command is a very useful tool for monitoring TCP/IP networks. It can display routing tables, actual network connections and each network interface setting.
Prepared status information. 
Basic grammar: netstat [option]
Option parameters:
	- t or -- tcp: Shows the connection status of TCP transport protocol; 
- u or - udp: Shows the connection status of UDP transport protocol;
	- n or -- numeric: use the ip address directly, not through the domain name server; 
	- l or -- listening: Show the Socket of the server under monitoring; 
	- p or -- programs: Displays the program identifier and program name that is using Socket;

Create the Flume Agent configuration file flume-telnet-logger.conf

Create a job folder under the flume directory and enter the job folder

mkdir jobconf
cd jobconf/

Create the Flume Agent configuration file flume-telnet-logger.conf under the job folder

touch flume-telnet-logger.conf

Add the following to the flume-telnet-logger.conf file.

vim flume-telnet-logger.conf

Add the following:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Function

Open the flume listening port first

bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume-telnet-logger.conf -Dflume.root.logger=INFO,console

Description of parameters:
Conf/: Represents that configuration files are stored in conf / directory
- name a1: denotes naming an agent A1
conf-file job/flume-telnet.conf: The configuration file read by flume at this boot is the flume-telnet.conf file under the job folder.
- D flume.root.logger== INFO, console:-D indicates that the flume runtime dynamically modifies the flume.root.logger parameter attribute values, and sets the console log printing level to INFO level. Log levels include log, info, warn, error.

Send Content to Port 44444 of Local Machine Using telnet Tool

telnet localhost 44444

Watch the received data on the Flume monitor page

Case 2) Real-time reading of local files to HDFS cases

1. Flume must have Hadoop-related jar packages in order to export data to HDFS

take

commons-configuration-1.6.jar,	
hadoop-auth-2.7.2.jar,
hadoop-common-2.7.2.jar,
hadoop-hdfs-2.7.2.jar,
commons-io-2.4.jar,
htrace-core-3.1.0-incubating.jar

Copy to / opt/module/flume/lib folder.

2. Create flume-file-hdfs.conf file

create a file

 touch flume-file-hdfs.conf

Note: To read files in Linux system, you must execute commands according to the rules of Linux commands. Because Hive logs are in Linux system, the type of file to read is chosen: exec is the meaning of execute execution. Represents executing Linux commands to read files.

 vim flume-file-hdfs.conf

Add the following

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /udr/local/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://chun1:9000/flume/%Y%m%d/%H
#Prefix for uploading files
a2.sinks.k2.hdfs.filePrefix = logs-
#Whether to scroll folders according to time
a2.sinks.k2.hdfs.round = true
#How much time to create a new folder
a2.sinks.k2.hdfs.roundValue = 1
#Redefining unit of time
a2.sinks.k2.hdfs.roundUnit = hour
#Whether to use local timestamp
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#How many Event s are saved to flush to HDFS once
a2.sinks.k2.hdfs.batchSize = 1000
#Set file type to support compression
a2.sinks.k2.hdfs.fileType = DataStream
#How often to generate a new file
a2.sinks.k2.hdfs.rollInterval = 600
#Set the scroll size for each file
a2.sinks.k2.hdfs.rollSize = 134217700
#File scrolling is independent of the number of Event s
a2.sinks.k2.hdfs.rollCount = 0
#Minimum Redundancy Number
a2.sinks.k2.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

3. Implementation of monitoring configuration

bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

4. Open Hadoop and Hive and operate Hive to generate logs

sbin/start-dfs.sh
sbin/start-yarn.sh
bin/hive
hive >

5. View files on HDFS.

Case 3) Real-time reading directory files to HDFS cases

Use Flume to listen for files in the entire directory

3) Implementation steps:
1. Create configuration file flume-dir-hdfs.conf
Create a file

 touch flume-dir-hdfs.conf

Open the file

 vim flume-dir-hdfs.conf