Complete collection of 23 basic knowledge of big data series scala (the most complete in history, recommended Collection)

Official account: Data and intelligence, big data Club

The series of articles on big data are presented from three parts: technical ability, business foundation and analytical thinking. You will gain:

❖ improve self-confidence, deal with the interview freely, and get an internship or offer smoothly;

❖ master the basic knowledge of big data and communicate with other colleagues without obstacles;

❖ have certain project actual combat ability and get started directly with big data work;

 

Welcome to csdn homepage if you have any questions. Comments, likes and collections are my greatest support!!!

Big Data Engineer series column: Real interview questions, development experience and tuning strategies

Knowledge system of Big Data Engineer:

The era of big data has come

In recent decades, the rapid development of the Internet has penetrated into all aspects of our life, and the whole human society has been connected by the Internet. In the Internet, we always produce a lot of data, such as browsing commodity records, transaction order records, watching video data, browsed web pages, searched keywords, clicked advertisements, self photos and status of the circle of friends, etc. These data are not only the traces of our behavior, but also the best evidence to describe ourselves.

In March 2014, Ma Yun once said in a speech in Beijing: "mankind is moving from the IT era to the DT era". Seven years later, as Ma Yun expected, the era of big data has arrived.

What are the work contents of big data engineers?

In the era of big data, one key position has to be mentioned, that is, big data engineer. Presumably, you will also be curious. What do big data engineers do every day?  

1. Data collectionFind out the data describing users or helpful to business development, define the relevant data format, and hand it over to the business development department to collect the corresponding data.
2.ETL projectFor the collected data, various operations such as cleaning, processing and conversion are carried out to complete the format conversion, which is convenient for subsequent analysis and ensure the data quality, so as to obtain reliable results.
3. Build data warehouseEffectively manage the data, build a unified data warehouse, establish a connection between data and data, and collide with greater value.
4. Data modelingBased on the existing data, sort out the complex relationship between the data and establish an appropriate data model, which is convenient to analyze and draw valuable conclusions.
5. Statistical analysisConduct statistical analysis of data in various dimensions, establish an index system, systematically describe the current state of business development, look for problems in business, and find new optimization and growth points.
6. User portrait

Based on all aspects of users' data, establish an all-round understanding of users and build a portrait of each specific user, so as to complete fine operation for each individual.

Essential skills of Big Data Engineer

So, the question is, what conditions do you need to meet if you want to become a big data engineer and be competent for the above work? What kind of knowledge do you have?

classification

Sub classification

skill

describe

technique

art

can

power

Fundamentals of programming

Java Foundation

java foundation necessary for big data ecology

Scala Basics

Essential skills of Spark related ecology

SQL Foundation

Common language for data analysts

SQL advanced

Necessary skills to complete complex analysis

Big data framework

HDFS&YARN

The bottom cornerstone of big data ecology

Hive Foundation

Common tools for big data analysis

Hive advanced

Advanced equipment for big data analysts

Spark Foundation

Underlying operation principles necessary for troubleshooting

Spark SQL

A sharp blade for complex tasks

tool

Hue&Zeppelin

General exploration and analysis tools

Azkaban

Job management and scheduling platform

Tableau

Data visualization platform

Business basis

data collection

How was the data collected?

ETL project

How to clean, process and transform data?

Fundamentals of data warehouse

How to complete analysis oriented data modeling?

Metadata Center

How to do data governance well?

Analytical thinking

Data analysis thinking methodology

How to analyze a specific problem?

Troubleshooting thinking

How to efficiently troubleshoot data problems?

Index system

How to systematize the data?

There are three main reasons why Scala is so important:

1. Because spark

Most engineers engaged in big data first understand Spark and then choose to learn Scala, because Spark is developed in scala. Now Spark is a killer application framework in the field of big data. As long as a big data platform is built, Spark will be widely used to process and analyze data. To learn Spark well, Scala must pass. By the way, Kafka was also developed based on Scala.

 

2. Seamless connection of big data ecological components

As we all know, most components of big data ecology are developed in java language. Scala is a language based on JVM, which can be seamlessly mixed with java, so it can be well integrated into the big data ecosystem.

 

3. Suitable for big data processing and machine learning

Scala's syntax is concise and expressive, making it easier to master. Scala combines object-oriented and functional programming, which is powerful and concise, and is very suitable for dealing with all kinds of data. Therefore, it plays an important role in big data processing and machine learning.

 

For the basic knowledge of scala that big data analysts must master, the explanation idea of this paper is as follows:

Part 1: scala features. It mainly explains object-oriented features, functional programming, static types, expansibility and concurrency.

Part 2: expressions. In scala, everything is an expression, and understanding the expression is the premise of understanding its syntax.

Part 3: methods and functions. It mainly talks about the difference and transformation between the two.

Part 4: pattern matching. Explain several common modes and give examples.

Part 5: scala} trait. Explain the basic characteristics and examples of traits.

Part 6: set operations. Mainly for the explanation and introduction of common sets and set functions.

Part 7: reading data sources. This is just a brief introduction to how scala reads the data Source through the Source class.

Part 8: implicit conversion, implicit parameters. It mainly explains the type conversion between Java and scala, and introduces the concept of implicit parameters through an example.

Part 9: regular matching. It mainly explains how to write regular related code.

Part 10: exception handling. Introduce the difference between scala and java exceptions.

Part 11: type hierarchy. This paper mainly introduces the type hierarchy of scala.

Part 12: basic value type conversion. Explain the problems often encountered in the conversion of basic numerical types between scala and java.

 

Basic knowledge of scala

1, Scala features

Object oriented characteristics

Scala is a pure object-oriented language, which thoroughly implements the concept that everything is an object. The type and behavior of objects are described by classes and traits. Scala introduces traits to improve the object model of Java, so that the functions of classes can be extended by mixing traits.

Functional programming

Scala is also a functional language, and functions can also be passed as values. Scala provides lightweight syntax to define anonymous functions, supports high-order functions, allows nested multi-layer functions, and supports curry. Scala's case class and its built-in pattern matching are equivalent to the algebraic types commonly used in functional programming languages.

Static type

Scala has a strong expressive type system, which ensures the security and consistency of code through compile time inspection. Scala has the feature of type inference, which allows developers to make the code look cleaner and easier to read without marking duplicate type information.

Expansibility

Scala's design is based on the fact that in practice, domain specific application development often requires domain specific language extensions. Scala provides many unique language mechanisms, which can easily and seamlessly add new language structures in the form of libraries.

2, Expression

In scala, everything is an expression. scala highly values expression syntax because it is very friendly to functional programming. For developers, expression syntax makes the code very concise and easy to read.

For example, when defining a method, we will use the equal sign (=) connection just like declaring variables. To the left of the equal sign is the function name, parameter list and return value type (which can be omitted), while to the right of the equal sign is a multiline expression wrapped by braces ({}).

Expression must have a return value. In java, void is used to declare a method without a return value, while in scala, there will also be a return value, and a Unit will be returned, which is a specific value, indicating that the return value of the method is ignored.

3, Methods and functions

When you first learn Scala, you often feel that the concepts of methods and functions are a little vague. You may not know whether to use methods or functions in use. So how to distinguish? The key is to see whether the function is defined in the class. The definition in the class is the method, so Scala method is a part of the class. The function in scala is a complete object that can be assigned to a variable. However, in Scala, methods and functions can be transformed into each other. Let's focus on how to turn a method into a function.

Method transfer function

Any method mentioned above is declaring an expression, so it is very simple to convert the method into a function, which is equivalent to reassigning the expression pointed to by the method to a function variable, which is explicit conversion. There is another way to write, which is to convert the method into a new function by partial application function, which is called implicit transformation.

1) Implicit transformation

val f2 = f1 _

2) Explicit transformation

val f2: (Int) => Int = f1

4, Pattern matching

Pattern matching is a mechanism to check whether a value matches a pattern. It is an upgraded version of the switch statement in Java. It can also be used to replace a series of if/else statements. The following describes several common pattern matching: constant pattern, variable pattern and wildcard pattern.

Constant mode

Constant pattern matching is to match constants in pattern matching.

object ConstantPattern{  def main(args:Array[String]) :Unit = {    //The pattern matching result is returned as the function value def patternshow (X: any) = x match {/ / constant pattern case 5 = > "Five" case true = > "true" case "test" = > "string" case null = > "null value" case nil = > "empty list" / / variable pattern case x = > "Variable" / / wildcard pattern case = > "Wildcard"}}
Both variable patterns and wildcard patterns can match any value. The difference between them is that after the variable pattern is successfully matched, the successfully matched value will be stored in the variable and can be referenced in subsequent codes. After the wildcard pattern is successfully matched, the matched value can no longer be referenced. In addition, it should be noted that since pattern matching is matched in order, variable patterns and wildcard patterns should be written at the end of the expression.

Type matching pattern

You can match the type of input variable.

object TypePattern{  def main(args:Array[String]) :Unit = {  //Type matching pattern def Typepattern (T: any) = t match {case T: String = > "string" case T: int = > "intger" case T: Double = > "double" case = > "other type"}}}

case class mode

The constructor pattern means that the case statement is directly followed by the class constructor, and the matching content is placed in the constructor parameters.

object CaseClassPattern{  def main(args:Array[String]) :Unit = {  //Define a Person instance Val P = new Person ("nyz", 27) / / case class mode def constructorpattern (P: Person) = P match {case Person (name, age) = > "name =" + name + ", age =" + age case = > "other"}}}

Mode guard

In order to make the matching more specific, you can use pattern guard, that is, add if judgment statement after the pattern.

object ConstantPattern{  def main(args:Array[String]) :Unit = {    //The pattern matching result is returned as the function value def patternshow (X: any) = x match {/ / pattern guard case x if (x = = 5) = > guard / / wildcard pattern case = > wildcard}}

 

Option matching

In Scala, the sample class of Option type is used to represent values that may or may not exist (the subclasses of Option are some and None). Some wraps a value, and None indicates no value.

class OptionDemo {  val map = Map (("a",18),("b",81))  //The type returned by the get method is option [int] map Get ("B") match {case some (x) = > println (x) case none = > println ("does not exist")}}

 

5, Scala trait

Scala trait is equivalent to the interface of Java, but in fact, it is more powerful than the interface. Unlike interfaces, it can also define the implementation of properties and methods.

In general, Scala classes can only inherit a single parent class, but you can use the with keyword to mix multiple traits. However, if a Scala class does not have a parent class, the first trait it blends in needs to use the extends keyword, and the later traits use the with keyword.

Trait is defined in a similar way to a class, but it uses the keyword trait as follows:

trait Equal {  def isEqual(x: Any): Boolean  def isNotEqual(x: Any): Boolean = !isEqual(x)}

The above Equal consists of two methods: isEqual and isNotEqual. The isEqual method does not define the implementation of the method, and isNotEqual defines the implementation of the method. Subclass inheritance can implement unimplemented methods.

The following illustrates a complete example of a trait:

trait Equal { def isEqual(x: Any): Boolean def isNotEqual(x: Any): Boolean = !isEqual(x)}

class Point(xc: Int, yc: Int) extends Equal {  val x: Int = xc  val y: Int = yc  def isEqual(obj: Any) =    obj.isInstanceOf[Point] &&    obj.asInstanceOf[Point].x == x}
object Test {   def main(args: Array[String]) {      val p1 = new Point(2, 3)      val p2 = new Point(2, 4)      val p3 = new Point(3, 3)
      println(p1.isNotEqual(p2))      println(p1.isNotEqual(p3))      println(p1.isNotEqual(2))   }}

Execute the above code and the output result is:

$ scalac Test.scala $ scala -cp . Testfalsetruetrue

6, Collection operation

Common collection

Through the following code, you can understand the creation method of common collections

// Defines an integer list whose elements are stored linearly and can store duplicate objects. val x = List(1,2,3,4)
// Define a set whose objects are not sorted in a specific way, and there are no duplicate objects. val x = Set(1,3,5,7)
// Define a map, a set that maps key objects and value objects. Each element of the map contains a pair of key objects and value objects. val x = Map("one" -> 1, "two" -> 2, "three" -> 3)
// Create tuples of two elements of different types. Tuples are sets of values of different types. val x = (10, "Bigdata")
// Define an option to represent a container that may or may not contain values. val x:Option[Int] = Some(5)

Set function

When operating Scala collections at work, there are generally two types of operations: transformation and action. The first operation type converts a collection to another collection, and the second operation type returns values of some types.

1) Maximum and minimum

Start with the action function. Finding the maximum or minimum value in a sequence is a very common requirement.

Let's start with a simple example.

val numbers = Seq(11, 2, 5, 1, 6, 3, 9)  numbers.max //11 numbers.min //1

For such a simple data set, Scala's functional characteristics are undoubtedly revealed. It is so simple to get the maximum and minimum values. Let's take another example of complex data sets.

case class Book(title: String, pages: Int)  val books = Seq(  Book("Future of Scala developers", 85),  Book("Parallel algorithms", 240),  Book("Object Oriented Programming", 130),  Book("Mobile Development", 495))  //The following code returns book (mobile development, 495) books Maxby (book = > book. Pages) / / the following code returns book (future of scala developers, 85) books minBy(book => book.pages)

Minby & maxby method solves the problem of complex data.

2) Filter filter

Filter the set and return a new set of elements that meet the conditions, such as filtering even numbers in a set of data.

val numbers = Seq(1,2,3,4,5,6,7,8,9,10) numbers.filter(n => n % 2 == 0)//Seq(2,4,6,8,10) is returned above

Get books with more than 300 pages.

val books = Seq(  Book("Future of Scala developers", 85),  Book("Parallel algorithms", 240),  Book("Object Oriented Programming", 130),  Book("Mobile Development", 495))
books.filter(book => book.pages >= 300)//The above returns Seq(Book("Mobile Development", 495))

Another method similar to filter is filterNot, which is to filter out objects that do not meet the conditions.

3)Flatten

Its function is to expand multiple sets to form a new set, for example.

val abcd = Seq('a', 'b', 'c', 'd')val efgj = Seq('e', 'f', 'g', 'h')val ijkl = Seq('i', 'j', 'k', 'l')val mnop = Seq('m', 'n', 'o', 'p')val qrst = Seq('q', 'r', 's', 't')val uvwx = Seq('u', 'v', 'w', 'x')val yz   = Seq('y', 'z')  val alphabet = Seq(abcd, efgj, ijkl, mnop, qrst, uvwx, yz)
alphabet.flatten

After execution, return the following set:

List('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z')

4) Set operation function

Set operations are difference set, intersection set and union set operations.

val num1 = Seq(1, 2, 3, 4, 5, 6)val num2 = Seq(4, 5, 6, 7, 8, 9)  //Return list (1, 2, 3) num1 Diff (num2) / / return list (4, 5, 6) num1 Intersect (num2) / / return list (1, 2, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9) num1 union(num2)
//Remove the duplicate after merging and return list (1, 2, 3, 4, 5, 6, 7, 8, 9) num1 union(num2). distinct

5) map function

The logic of the map function is to traverse the collection and process the incoming function called by each element.

val numbers = Seq(1,2,3,4,5,6)  //Return list (2, 4, 6, 8, 10, 12) numbers Map (n = > n * 2) Val chars = SEQ ('a ',' B ',' C ','d') / / list (a, B, C, d) chars is returned map(ch => ch.toUpper)

6)flatMap

It combines Map & flatten. See the following operation.

val abcd = Seq('a', 'b', 'c', 'd')  //List(A, a, B, b, C, c, D, d)abcd.flatMap(ch => List(ch.toUpper, ch))

It can be seen from the results that the map is made first, and then the flatten is made

 

7)forall & exists

forall is used to judge the whole collection. When all elements in the collection meet the conditions, it returns true. exists returns true as long as one element meets the condition.

val numbers = Seq(3, 7, 2, 9, 6, 5, 1, 4, 2)  //Return to turnnumbers Forall (n = > n < 10) / / false numbers returned forall(n => n > 5)
//Return truenumbers exists(n => n > 5)

7, Read data source

Reading external data sources is a common requirement in development, such as reading and parsing external configuration files in programs to obtain corresponding execution parameters. Here is only a brief introduction to how scala reads the data Source through the Source class.

import scala.io.Source

object ReadFile {  //Read the configuration file Val file = source under ClasPath fromInputStream(this.getClass.getClassLoader.getResourceAsStream("app.conf"))    
  //Read the file line by line. Getlines() means to read all lines of the file. Def readLine: unit = {for (line < - file. Getlines()) {println (line)}} / / read the content on the network. Def readnetwork: unit = {Val file = source.fromurl(“ http://www.baidu.com ")    for(line <- file.getLines()){      println(line)    }  }
 //Read the given string - mostly for debugging Val source = source fromString("test") }

8, Implicit conversion

Implicit conversion is a very distinctive function in Scala, which is not available in other programming languages. It can convert objects of one type into objects of another type. In data analysis, the most commonly used is the mutual conversion between java and scala collections. After the conversion, another type of method can be called. Scala provides Scala collection. Javaconversions class, as long as the corresponding implicit conversion method in this class is introduced, the corresponding type can be used to replace the required type in the program.

For example, through the following conversion, Scala collection. mutable. Buffer is automatically converted to Java util. List.

import scala.collection.JavaConversions.bufferAsJavaListscala.collection.mutable.Buffer => java.util.List

Similarly, Java util. List can also be converted to scala collection. mutable. Buffer.

import scala.collection.JavaConversions.asScalaBufferjava.util.List => scala.collection.mutable.Buffer

All possible conversions are summarized as follows. The two-way arrow indicates that they can be converted to each other, and the single arrow indicates that only the left can be converted to the right.

import scala.collection.JavaConversions._
scala.collection.Iterable <=> java.lang.Iterablescala.collection.Iterable <=> java.util.Collectionscala.collection.Iterator <=> java.util.{ Iterator, Enumeration }scala.collection.mutable.Buffer <=> java.util.Listscala.collection.mutable.Set <=> java.util.Setscala.collection.mutable.Map <=> java.util.{ Map, Dictionary }scala.collection.concurrent.Map <=> java.util.concurrent.ConcurrentMap
scala.collection.Seq         => java.util.Listscala.collection.mutable.Seq => java.util.Listscala.collection.Set         => java.util.Setscala.collection.Map         => java.util.Mapjava.util.Properties   => scala.collection.mutable.Map[String, String]

Implicit parameter

The so-called implicit parameter refers to the definition of parameters modified with implicit in a function or method. When calling this function or method, scala will try to find an object decorated with implicit that matches the specified type in the variable scope, that is, the implicit value, which is injected into the function parameter and used by the function body. Examples are as follows:

class SayHello{  def write(content:String) = println(content)}implicit val sayHello=new SayHello
def saySomething(name:String)(implicit sayHello:SayHello){ sayHello.write("Hello," + name)}
saySomething("Scala")
//Print Hello,Scala

It is worth noting that implicit parameters are matched according to type, so two implicit variables of the same type cannot appear in the scope at the same time, otherwise the exception of fuzzy implicit variables will be thrown during compilation.

 

9, Regular matching

The concepts, functions and rules of regularization have been fully described in the previous article "Introduction to big data analysis engineers -- 1. Fundamentals of Java". Here we will explain how to write regularization related code in scala through examples:

definition

val TEST_REGEX = "home\\*(classification|foundation|my_tv)\\*[0-9-]{0,2}([a-z_]*)".r

use

//Path is the string test used to match_ Regex findfirstmatchin path match {case some (P) = > {/ / get the content matched by the regular fragment in the first bracket in TEST_REGEX launcher_area_code = p.group(1) / / get the content matched by the regular fragment in the second bracket in TEST_REGEX launcher_location_code = p.group(2)}}

10, Exception handling

Students who have studied Java must not be unfamiliar with exceptions. Exceptions are usually an important way to interrupt program execution when they encounter problems during program execution. The precautions for exception handling have been mentioned in the previous lecture introduction to big data analysis engineers - 1.Java foundation, which will not be repeated here. Let's focus on the differences between scala and Java in the design of exception.

1. The way to catch exceptions is slightly different

In java, different types of exceptions are captured through multiple catch clauses, while in scala, different types of exceptions are captured through one catch clause and pattern matching. As shown in the figure below:

2.scala has no checked exception

In java, non runtime exceptions are forced to be checked at compile time, or write try catch... Processing, or use the throws keyword to throw the exception to the caller for processing. In scala, it is more recommended to reduce the dependence on exceptions and their handling by using functional structure and strong typing. Therefore, scala does not support checked exception s.

When using scala to call the java class library, scala will convert the exceptions declared in the java code into non checking exceptions.

 

3.scala has a return value in case of throw exception

In the design of scala, all expressions have return values. Then, the natural throw expression is no exception. The return value of the throw expression is Nothing. Since Nothing type is a subtype of all types, throw expressions can appear anywhere without affecting the inference of types.

11, Type hierarchy

In scala, all values are typed, including numeric values and functions, which more thoroughly implements the concept that everything is an object than java. Therefore, scala has its own type hierarchy, as shown in the following figure:

 

(pictures from the Internet)

As shown in the figure, the top-level class of scala is Any, which contains two subclasses, AnyVal and AnyRef. AnyVal is the parent class of all value types, which contains a special value Unit; AnyRef is the parent class of all reference types, and scala types of all java types and non value types are its subclasses. Among them, there are two special bottom-level subtypes. One is Null, which is a subtype of all reference types and can be assigned to Any reference type variable; The other is Nothing, which is a subclass of all types, so it can be assigned to either reference type variables or value type variables.

12, Basic numeric type conversion

In Scala, the conversion of basic numeric types between java and scala is usually carried out automatically, which does not need to be handled separately. Therefore, in our experience, the basic data types of java and scala can be seamlessly connected. However, there is an exception, that is, when you reference a third-party java class library, and the received parameter in its code is Object type, and then judge the actual value type of the incoming Object, it usually fails and reports an error.

The reason is very simple. The third-party java class library is written in java language. It only recognizes the type of java. When the receiving parameter is of Object type, Scala will not be converted to java value type by default. In this way, when judging the specific value type of the Object, there will be exceptions that do not know the scala Object type.

The solution is also very simple. You only need to manually wrap it into java type before passing in the third-party class library method. The following is a code example. This example demonstrates the processing when the DBUtils class library passes in the scala type, and only shows part of the code:

//Because the short names of types in java and scala are the same, in order to avoid ambiguity, rename import java Lang. {long = > jlong, double = > jdouble} / / conn is the database connection, and SQL is the SQL statement to be executed, queryrunner update(conn, sql, new JLong(1L), new JDouble(2.2))

summary

Combined with practical work experience, this paper combs some knowledge points most commonly used in scala. In order to become a junior Big Data Engineer, these knowledge must be mastered.

 

Big data club, machine learning, python & Java

 

 

It's not easy to organize. Comments, likes and collections are my greatest support!!!

Keywords: Scala Big Data

Added by Wolphie on Wed, 02 Feb 2022 03:48:28 +0200