[Introduction to Cloud Computing Experiment 3] MapReduce programming

Pre-environment

You need to set up a hadoop pseudo-distributed cluster platform, which you can see in this tutorial Quick Start Tutorial for Hadoop Big Data Technology and Pseudo-Distributed Clustering

Eclipse Environment Configuration

Eclipse(Windows Local System)

1. Install plug-ins:

hadoop-eclipse-plugin-2.7.3.jar

Address: https://pan.baidu.com/s/1BaAOQkZaY4RvUPuBPVlgJg Extraction code: 067u

Copy the plug-in to the dropins directory in the eclipse directory

2. Configure the MapReduce environment:

After configuring the Hadoop plug-in, configure the development environment to connect to the Hadoop cluster as follows:
(1) Increase the Map/Reduce ribbon

(2) Increase connectivity to the Hadoop cluster

Point Blue Elephant creates a new Locations

Here are the following:
Related connection configuration information is:

1 Location name: Name the new Haop connection and write it freely

Map/Reduce(V2)Master: Fill in the IP and port of ReurceManager for the Hadoop cluster, and look for yarn-site.xml that previously configured the Hadoop cluster

(3) DFS Master: Fill in the IP and port of Name Node of the Hadoop cluster and look for it from the previously configured core-site.xml

If you do not configure the port, you can use the default settings and finish the configuration.

There's a hole here. Clicking on Locations you've created may cause an error, but if it's set up correctly, it doesn't matter to him

(3) Browse directories and files on HDFS

After configuring the connection information for the Hadoop cluster, you can browse the directories and files on HDFS from the Eclipse interface, and also use the mouse to perform file operations.

It is important to note that after performing the operation, the HDFS list needs to be refreshed to get the latest status of the file directory.

3. New MapReduce Project

This section establishes the MapReduce project MemberCount in Eclipse, roughly in the following four steps:

(1) Import the jar package that MapReduce runs on

First import the hadoop package from the virtual environment into the local environment, following the process:

Here you need to unzip a hadoop under the local windows system and set the path.

(2) Create a MapReduce project

(3) Create a project on the interface of MapReduce Project

(4) Check whether the project has been created properly

Set log4j.properties

A new log4j.properties file needs to be created in the project's src directory, which contains the following

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

Other miscellaneous pits and environmental settings

1. Computer username cannot have spaces

Use the following command in cmd to view user names

whoami

The modifications are as follows:
Right-click the Start button and select Computer Management. Make the changes as shown in the figure below. The changes need to be restarted before they take effect, but do not restart urgently because there are still others behind them.

2. The hadoop.dll is missing

In this link https://github.com/4ttty/winutils Find the corresponding version of hadoop.dll Download

Then put it in the C:WindowsSystem32 directory

3. The winutils.exe file is also required under Windows

https://github.com/4ttty/winutils Find the appropriate version of winutils.exe Download

Then put it in the bin folder in the local hadoop directory and set the environment variable

Create the following new system variable HADOOP_HOME, installation directory with a value of hadoop

Then set the Path variable, adding one item as

%HADOOP_HOME%\bin

Effective after restart

4. Modify dfs.permissions to false

Switch to the path shown below

Add the following property using the command vim hdfs-site.xml and modify its value to false if it already exists

<property>
	<name>dfs.permissions</name>
	<value>false</value>
</property>

MapReduce Program Case--Word Frequency Statistics Program

directory structure

Code

WordCountMapper class, which inherits from Hadoop's Mapper:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // key offset value per line of text: I love Guiyang
        // Separate each line of text by space: [I,love,Guiyang]
        String line = value.toString(); // Conversion of data types
        String[] words = line.split(" "); // Split the string [I,love,Guiyang]
        for (int i = 0; i < words.length; i++) {
            // (I,1)
            Text word = new Text(words[i]);
            IntWritable value2 = new IntWritable(1);
            context.write(word,value2);
        }
    }
}

WordCountReducer class, Reducer inherited from Hadoop:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, IntWritable,Text, IntWritable> {
    @Override
    protected void reduce(Text k3, Iterable<IntWritable> v3, Context context) throws IOException, InterruptedException {
        // To sum, you need to calculate the length of v3
        // <I,[1,1]>
        int count = 0;
        for (IntWritable v: v3) {
            int value = v.get();
            count += value;
        }
        context.write(k3,new IntWritable(count));
    }
}

WordCountJob class:

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountJob {    
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {       // Serialized Job       
	 Job job = Job.getInstance();       
	 job.setJarByClass(WordCountJob.class);        // Set Mapper      
	 job.setMapperClass(WordCountMapper.class);       
	 job.setMapOutputKeyClass(Text.class);        
	 job.setMapOutputValueClass(IntWritable.class);        // Set Reducer        
 	job.setReducerClass(WordCountReducer.class);       
	 job.setOutputKeyClass(Text.class);        
	job.setOutputValueClass(IntWritable.class);        // Indicates the path to the input file        
	FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.80.128:8020/input/One Hundred Years of Solitude.txt "); //Indicates the path to the output file        
	FileOutputFormat.setOutputPath(job,new Path("d:/mr_result/wc01"));        // Start running the task        
	boolean completion = job.waitForCompletion(true);       
	 if (completion){          
 	 System.out.println("The program ran successfully~");       
	 }
	else {            System.out.println("Program failed to run~");        }   
 }
}

Execute locally

Create the input file wordcount.txt, which contains:

I love Guiyang I love Guizhou Guiyang is the capical of Guizhou

Set the local input file path in line 16 of the WordCountJob class

FileInputFormat.setInputPaths(job,new Path("d:/wordcount.txt"));

Right-click in the WordCountJob class and select Run As - [1 Java Application]

If the error is shown in the following figure, the output path already exists, delete it

Successful runs are illustrated below

The results are as follows:

Executing on a Hadoop pseudo-distributed cluster

The service needs to be started first, and the command is:

start-all.sh

Right-click on the DFS Locations path and click Upload files to DFS

Upload the test file One Hundred Years of Solitude.txt to the HDFS file server

Set the cloud input file path in line 16 of the WordCountJob class

FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.80.128:8020/input/One Hundred Years of Solitude.txt "); //Indicates the path to the output file

Right-click in the WordCountJob class and select [Run As] - [2 Run on Hadoop]

Successful runs are illustrated below