Big Data Tutorial (14.2) Website Data Analysis

The previous article introduced the business background of the website click stream data analysis project; this blogger will continue to share the relevant knowledge of website analysis.

I. Overall technical process and architecture

1.1. Data Processing Flow

This project is a pure data analysis project, and its overall process is basically based on the data processing process. There are the following major steps:
1) Data acquisition
First, access behavior is acquired by embedding JS code into the page and sent to the background log of the web service.
Next, the click-stream logs generated on each server are aggregated into the HDFS file system in real time or in batches.

Of course, in a comprehensive analysis system, the data source may not only contain click stream data, but also business data in the database (such as user information, commodity information, order information, etc.) and useful external data for analysis.

2) Data preprocessing
Pre-process the collected click stream data through mapreduce program, such as cleaning, formatting, filtering dirty data, etc.

3) Data warehousing
Import the pre-processed data into the corresponding libraries and tables in the HIVE warehouse

4) Data analysis
The core content of the project is to develop ETL analysis statements according to the requirements and get various statistical results.

5) Data presentation
Visualize the data from the analysis

1.2. Project structure

Since this project is a pure data analysis project, its overall structure matches the analysis process, and there is no particularly complex structure, as shown in the following figure:

Among them, it should be emphasized that the data analysis of the system is not one-time, but is calculated repeatedly according to a certain time and frequency, so the links in the whole processing chain need to be closely linked according to a certain sequence dependency relationship, which involves the management and scheduling of a large number of task units, so a task scheduling module needs to be added to the project (here can be done). From what I said before azkaban Replace OOZIE.

1.3. Data presentation

The purpose of data presentation is to visualize the data obtained from the analysis so that operational decision makers can obtain data more easily and understand the data more quickly and simply. Spring MVC + echarts can be implemented with a simple architecture.

Module Development-Data Acquisition

2.1. Demand

Data acquisition needs are broadly divided into two parts.
1) It collects user's access behavior on the page and develops it specifically:
1. Developing Page Buried Point js to Collect User Access Behavior
Background acceptance page js request log
This part of the work can also be attributed to the "data source", whose development is usually undertaken by the web development team.

2) It gathers logs from web servers to HDFS, which is the data acquisition of data analysis system. This part of the work is undertaken by the data analysis platform construction team. Specific technologies are implemented in many ways:
Shell script
Advantages: Lightweight, simple development
Disadvantage: Inconvenient control of fault-tolerant processing in log acquisition
Java Collector
                Advantages: fine control of the collection process
Disadvantage: High development workload
Flume Log Collection Framework
A mature open source log collection system, which is itself a member of hadoop ecosystem, has a natural affinity and scalability with various framework components in hadoop system.

2.2. Technical Selection

In the scenario of click-stream log analysis, the reliability and fault tolerance requirements of data acquisition are usually not very strict, so the general flume log acquisition framework can fully meet the requirements.
This project uses flume to collect logs.

Flume Log Acquisition System

1. Data Source Information

The data analyzed in this project are stored on various nginx servers using traffic logs generated by nginx servers, such as:
               /var/log/httpd/access_log.2015-11-10-13-00.log
               /var/log/httpd/access_log.2015-11-10-14-00.log
               /var/log/httpd/access_log.2015-11-10-15-00.log
               /var/log/httpd/access_log.2015-11-10-16-00.log

2. Sample data content

The specific content of data need not be paid much attention in the collection stage.

58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

Field resolution:
Visitor ip address: 58.215.204.118
2. Visitor user information:--
3. Request time: [18/Sep/2013:06:51:35+0000]
4. Request mode: GET
5. url:/wp-include/js/jquery/jquery.js?Ver=1.10.2
6. Protocol used for requests: HTTP/1.1
7. Response code: 304
8. Data flow returned: 0
9. Source url of visitors: http://blog.fens.me/nodejs-socketio-chat/
10. Visitor's browser: Mozilla/5.0 (Windows NT 5.1; rv: 23.0) Gecko/2010 101 Firefox/23.0

3. Generation Rules of Log Files

The basic law is:
The file currently being written is access_log.
The volume of the file is 256M, or the interval is 60 minutes, that is, scrolling rename to switch to history log file;
access_log.2015-11-10-13-00.log

Of course, each company's web server log strategy is different and can be defined in the log4j.properties of the web program as follows:

log4j.appender.logDailyFile = org.apache.log4j.DailyRollingFileAppender 
log4j.appender.logDailyFile.layout = org.apache.log4j.PatternLayout 
log4j.appender.logDailyFile.layout.ConversionPattern = [%-5p][%-22d{yyyy/MM/dd HH:mm:ssS}][%l]%n%m%n 
log4j.appender.logDailyFile.Threshold = DEBUG 
log4j.appender.logDailyFile.ImmediateFlush = TRUE 
log4j.appender.logDailyFile.Append = TRUE 
log4j.appender.logDailyFile.File = /var/logs/access_log 
log4j.appender.logDailyFile.DatePattern = '.'yyyy-MM-dd-HH-mm'.log' 
log4j.appender.logDailyFile.Encoding = UTF-8

4. Implementation of Flume Acquisition

Flume acquisition system is relatively simple to build.
1. Deploy agent nodes on a web server and modify configuration files
2. Start the agent node to aggregate the collected data into the specified HDFS directory
The following picture:

Version selection: apache-flume-1.7.0
Collection rule design:
Collection Source: nginx Server Log Directory
2. Deposit: hdfs directory / home/hadoop/weblogs/

Collect Rule Configuration Details

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure spooldir source1
#agent1.sources.source1.type = spooldir
#agent1.sources.source1.spoolDir = /var/logs/nginx/
#agent1.sources.source1.fileHeader = false

# Describe/configure tail -F source1
#Using exec as data source source component
agent1.sources.source1.type = exec 
#Real-time collection of newly generated log data using tail-F command
agent1.sources.source1.command = tail -F /var/logs/nginx/access_log
agent1.sources.source1.channels = channel1

#configure host for source
#Configure an interceptor plug-in
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
#Use the interceptor plug-in to get the host name of the server where the agent is located
agent1.sources.source1.interceptors.i1.hostHeader = hostname

#Configure sink component to hdfs
agent1.sinks.sink1.type = hdfs
#a1.sinks.k1.channel = c1

#agent1.sinks.sink1.hdfs.path=hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H%M%S
#Specify the path from file sink to hdfs
agent1.sinks.sink1.hdfs.path=
hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M_%hostname
#Specify filename prefix
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 
#Specify the number of records for each batch of sinking data
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
#Specify that the sinking file scrolls by 1G size
agent1.sinks.sink1.hdfs.rollSize = 1024*1024*1024
#Specify that the sinking file scrolls by 1000000 bars
agent1.sinks.sink1.hdfs.rollCount = 1000000
#Specify a sink file to scroll in 30 minutes
agent1.sinks.sink1.hdfs.rollInterval = 30
#agent1.sinks.sink1.hdfs.round = true
#agent1.sinks.sink1.hdfs.roundValue = 10
#agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
#Use memory type channel
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

Start the acquisition
On the nginx server where flume is deployed, start the agent of flume with the following commands:
               bin/flume-ng agent --conf ./conf -f ./conf/weblog.properties.2 -n agent

Note: The - n parameter in the startup command gives the agent name to be configured in the configuration file

3. Module Development-Data Preprocessing

3.1 Main purposes: a. filtering "irregular" data; b. format conversion and adjustment; c. filtering and separating the basic data of different topics (different column paths) according to the follow-up statistical requirements

3.2 Implementation: Develop a mr program WeblogPreProcess (too long, see Log analysis of large Internet e-commerce companies with big data hadoop,web Log Data Cleaning)

public class WeblogPreProcess {
	static class WeblogPreProcessMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
		Text k = new Text();
		NullWritable v = NullWritable.get();
		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String line = value.toString();
			WebLogBean webLogBean = WebLogParser.parser(line);
//			WebLogBean productWebLog = WebLogParser.parser2(line);
//			WebLogBean bbsWebLog = WebLogParser.parser3(line);
//			WebLogBean cuxiaoBean = WebLogParser.parser4(line);
			if (!webLogBean.isValid())
				return;
			k.set(webLogBean.toString());
			context.write(k, v);
//			k.set(productWebLog);
//			context.write(k, v);
		}
	}
	public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJarByClass(WeblogPreProcess.class);
		job.setMapperClass(WeblogPreProcessMapper.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
		
	}
}

Running mr to preprocess data

hadoop jar weblog.jar  cn.itcast.bigdata.hive.mr.WeblogPreProcess /weblog/input /weblog/preout

3.3 Click Stream Model Data Combing

Since a large number of statistical indicators are easier to derive from the click-stream model, in the pre-processing stage, the mr program can be used to generate the data of the click-stream model.

3.3.1 Click Stream Model pageviews table

Pageviews table model data generation

See Engineering for Code
hadoop jar weblogpreprocess.jar  \
cn.itcast.bigdata.hive.mr.ClickStreamThree   \
/user/hive/warehouse/dw_click.db/test_ods_weblog_origin/datestr=2013-09-20/ /test-click/pageviews/

Table structure:

(See Section 6.2 for table definition and data import)

3.3.2 Click Flow Model visit Information Table

Note: "One visit" = "N consecutive requests"
It is difficult to get the "secondary" access information of each person directly from the original data using hql grammar. mapreduce program can be used to analyze the original data to get the "secondary" information data, and then use hql to do more dimension statistics.

Using MR program to sort out the start and end time and page information of each visit from pageviews data

See Engineering for Code
hadoop jar weblogpreprocess.jar cn.itcast.bigdata.hive.mr.ClickStreamVisit /weblog/sessionout /weblog/visitout

Next, build the Click Flow visit Model Table in the hive Warehouse

load data inpath '/weblog/visitout' into table click_stream_visit partition(datestr='2013-09-18');

Keywords: Big Data log4j Hadoop hive Nginx

Added by gonsman on Wed, 15 May 2019 19:12:18 +0300