[spark] Spark's basic environment Day02

Spark Day03: Spark basic environment

02 - [understand] - outline of today's course content

It mainly explains two aspects: what is Spark on YARN cluster and RDD

1,Spark on YARN
	take Spark Application, submit run to YARN On the cluster, the vast majority of operation modes in enterprises must be mastered
	- How to configure
	- Submit application run
	- Spark There are two kinds of applications running on the cluster Deploy-Mode
	- yarn-client pattern
	- yarn-cluster pattern

2,RDD What is it?
	RDD，Elastic distributed data sets, abstract concepts, are equivalent to collections, such as lists List，Distributed collection, storing massive data
	introduce RDD data structure
	RDD Official definitions, from documentation and source code
	RDD 5 Big features (interview must ask)
	word frequency count WordCount see RDD What are there
	
	RDD Creation method, how to encapsulate data into RDD In the collection, there are 2 ways
	establish RDD How to deal with small files (interview)

03 - [Master] - attribute configuration and service startup of Spark on YARN

It is very important to submit spark applications to run on the YARN cluster. Most enterprises run on YANR file: http://spark.apache.org/docs/2.4.5/running-on-yarn.html

When the Spark Application runs on yarn, you can specify the master as yarn when submitting the application. At the same time, you need to inform yarn of cluster configuration information (such as ResourceManager address information). In addition, you need to monitor the Spark Application and configure the relevant properties of the history server.

In the actual project, you only need to configure: 6.1.1 to 6.1.4. Since it is tested on the virtual machine, configure 6.1.5 to remove the resource check restriction.

04 - [Master] - Spark on YARN submission application

First submit PI program and run it on YARN. The command is as follows:

SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10

The screenshot of YARN monitoring page after operation is completed is as follows

Set the resource information, submit and run the WordCount program to YARN, and the command is as follows:

SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
--num-executors 2 \
--queue default \
--class cn.itcast.spark.submit.SparkSubmit \
hdfs://node1.itcast.cn:8020/spark/apps/spark-day02_2.11-1.0.0.jar \
/datas/wordcount.data /datas/swcy-output

When the WordCount application runs on YARN, click the application history service connection from the 8080 WEB UI page to view the application running status information.

05 - [Master] - DeployMode differences between the two modes

Spark Application submission runtime deployment mode deployment mode refers to the place where the Driver Program runs, either the client: client submitting the application, or the slave node (Standalone: Worker, YARN: NodeManager): cluster in the cluster.

client mode

The default DeployMode is Client, which means that the application Driver Program runs on the Client host that submits the application (starts the JVM Process). The diagram is as follows:

Assuming that PI program is run and client mode is adopted, the command is as follows:

SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master spark://node1.itcast.cn:7077,node2.itcast.cn:7077 \
--deploy-mode client \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
--total-executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10

cluster mode

If the application is run in cluster mode, the application Driver Program runs on a machine of the cluster slave node Worker.

Assuming that PI program is run and cluster mode is adopted, the command is as follows:

SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master spark://node1.itcast.cn:7077,node2.itcast.cn:7077 \
--deploy-mode cluster \
--supervise \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
--total-executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10

06 - [Master] - YARN Client mode of Spark on YARN

When the application runs on YARN, it consists of two parts:

AppMaster, the application manager, applies for resources and schedules Job execution
Process, process running on NodeManager, Task running

When the Spark application runs on the cluster, it also has two parts:

Driver Program, the application manager, applies for resources, runs Executors and schedules Job execution
Executors, which runs the JVM process, which executes Task tasks and caches data

When the Spark application runs on the YARN cluster, what is the running architecture like????

YARN Client mode

When Spark runs in YARN cluster and client DeployMode is adopted, there are three processes as follows:

AppMaster, request resources, run Executors
Driver Program, scheduling Job execution and monitoring
Executors, which runs the JVM process, which executes Task tasks and caches data
YARN Cluster mode

When Spark runs in a YARN cluster and adopts clusterDeployMode, there are two processes as follows:

The Driver Program (AppMaster) performs both resource application and Job scheduling
Executors, which runs the JVM process, which executes Task tasks and caches data

Therefore, when Spark Application runs on YARN, the architecture is different when different deployment modes are adopted. The actual production environment of the enterprise is still dominated by cluster mode, and the client mode is used for development and testing. The difference between the two is often asked in interviews.

In the YARN Client mode, the Driver runs on the local machine where the task is submitted. The schematic diagram is as follows:

Run the WordCount program of word frequency statistics in Yan client mode

/export/server/spark/bin/spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
--num-executors 2 \
--queue default \
--class cn.itcast.spark.submit.SparkSubmit \
hdfs://node1.itcast.cn:8020/spark/apps/spark-day02_2.11-1.0.0.jar \
/datas/wordcount.data /datas/swcy-client

07 - [Master] - YARN Cluster mode of Spark on YARN

In the YARN Cluster mode, the Driver runs in the nodemanager container. At this time, the Driver is integrated with the AppMaster. The schematic diagram is as follows:

Take running WordCount program of word frequency statistics as an example, submit the following command:

/export/server/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
--num-executors 2 \
--queue default \
--class cn.itcast.spark.submit.SparkSubmit \
hdfs://node1.itcast.cn:8020/spark/apps/spark-day02_2.11-1.0.0.jar \
/datas/wordcount.data /datas/swcy-cluster

08 - [understand] - Spark application MAIN function code execution

When the Spark Application is running, regardless of the client or cluster deployment mode, the code of the MAIN function in the application should be executed after the DriverProgram and Executors are started. Take the word frequency statistics WordCount program as an example.

In the above picture, A and B are executed in the Executor because RDD data operations are executed in the Driver when C has no return value, such as when calling count, first and other functions.

09 - [understand] - introduction of RDD concept

For a large amount of data, Spark uses a data structure called resilient distributed datasets (RDD) to store calculations internally. All operations and operations are based on the RDD data structure In the Spark framework, data is encapsulated into a collection: RDD. If you want to process data, you can call the functions in the collection RDD.

In other words, the core points of RDD design are:

file: http://spark.apache.org/docs/2.4.5/rdd-programming-guide.html

10 - [Master] - official definition of RDD concept

RDD (Resilient Distributed Dataset) is called elastic distributed dataset. It is the most basic data abstraction in Spark. It represents an immutable, divisible and parallel computing set of elements.

Split the core points in three aspects:

It can be considered that RDD is a distributed List or Array, and an abstract data structure. RDD is an abstract class AbstractClass and generic Generic Type:

The core points of RDD elastic distributed dataset are as follows:

11 - [Master] - Analysis of five characteristics of RDD concept

There are five features inside the RDD data structure (excerpt from the RDD source code): the first three features must be included; The last 2 features are optional.

First: a list of partitions
- Each RDD consists of a series of Partitions. An RDD contains multiple Partitions

Second: A function for computing each split
- When processing data in RDD, each partition (partition) data shall be processed by function

Third: A list of dependencies on other RDDs
- An RDD depends on a number of columns of RDDS

In the RDD class, there is a corresponding method:

Fourth: optionally, a partitioner for key value RDDS
- When the data type in RDD is Key/Value (binary), the Partitioner can be set

Fifth: optically, a list of preferred locations to compute each split on
- Find the best location list when calculating each partition data in RDD
- When calculating the data, consider the local row of data, where the data is, and try to put the Task where it is, so as to quickly read the data for processing

RDD is a representation of a data set. It not only represents the data set, but also shows where the data set comes from and how to calculate it. The main attributes include five aspects (it must be borne in mind that to deepen understanding through coding, interviews often ask):

12 - [Master] - word frequency statistics of RDD concept RDD in WordCount

Take the WordCount program of word frequency statistics as an example to view the RDD types and dependencies in the whole Job. The WordCount program code is as follows:

After running the program, view the WEB UI monitoring page. This Job (triggered by RDD calling foreach) executes DAG diagram:

13 - [Master] - two ways to create RDD

There are two main ways to encapsulate data into RDD sets: parallelizing local sets (in Driver Program) and loading data sets of external storage systems (such as HDFS, Hive, HBase, Kafka, Elasticsearch, etc.) by reference.

The most commonly used method: textFile, read the text file on HDFS or LocalFS, and specify the file path and the number of RDD partitions.

In the actual project, if massive data is read from HDFS, the application runs on YARN. By default, the number of RDD partitions is equal to the number of Block blocks on HDFS.

14 - [Master] - small file reading when creating RDD

In actual projects, sometimes the data files processed belong to small files (each file has a small amount of data, such as KB, tens of MB, etc.), and the number of files is large. If each file is read as a partition of RDD, the calculation of data is time-consuming and the performance is low. Use the wholeTextFiles class provided in SparkContext to specifically read small file data.

Example demonstration: read 100 small file data, each file size is less than 1MB, and set the number of RDD partitions to 2.

package cn.itcast.spark.source

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Use the SparkContext#wholeTextFiles() method to read small files
 */
object _02SparkWholeTextFileTest {
	
	def main(args: Array[String]): Unit = {
		val sc: SparkContext = {
			// sparkConf object
			val sparkConf = new SparkConf()
				// _01SparkParallelizeTest$  ->(.stripSuffix("$"))   ->  _01SparkParallelizeTest
				.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
				.setMaster("local[2]")
			// sc instance object
			SparkContext.getOrCreate(sparkConf)
		}
		
		/*
		  def wholeTextFiles(
		      path: String,
		      minPartitions: Int = defaultMinPartitions
		  ): RDD[(String, String)]
		  Key: Name and path of each small file
		  Value: Contents of each small file
		 */
		val inputRDD: RDD[(String, String)] = sc.wholeTextFiles("datas/ratings100", minPartitions = 2)
		
		println(s"RDD Number of partitions = ${inputRDD.getNumPartitions}")
		
		inputRDD.take(2).foreach(tuple => println(tuple))
		
		// After application, close the resource
		sc.stop()
		
	}
	
}

Added by tester2 on Tue, 07 Dec 2021 12:48:45 +0200

Programming VIP