[Spark] action operator of RDD

The so-called action operator is the method to trigger job execution reduce Function signature: def reduce (F: (T, t) = > t): t Function Description: aggregate all elements in RDD, first aggregate data in partitions, and then aggregate data between partitions @Test def reduce(): Unit = { val rdd = sc.makeRDD(List(1,2,3,4)) ...

Added by nascarjunky on Wed, 05 Jan 2022 02:37:28 +0200

Spark SQL functions.scala source code parsing String functions (based on Spark 3.3.0)

preface This article belongs to the column "1000 problems to solve big data technology system", which is original by the author. Please indicate the source of quotation. Please help point out the deficiencies and errors in the comment area. Thank you! Please refer to table of contents and references for this column 1000 problems ...

Added by davelr459 on Tue, 04 Jan 2022 13:45:58 +0200

Spark introduction and spark deployment, principle and development environment construction

Spark introduction and spark deployment, principle and development environment construction Introduction to spark Spark is a fast, universal and scalable big data analysis and calculation engine based on memory. It is a general memory parallel computing framework developed by the AMP Laboratory (Algorithms, Machines, and People Lab) at the U ...

Added by benreisner on Mon, 03 Jan 2022 22:14:19 +0200

[review] Spark core programming --- RDD

Spark computing framework encapsulates three data structures to handle different application scenarios in order to process data with high concurrency and high throughput. The three data structures are:  RDD: elastic distributed data set  accumulator: distributed shared write only variables Broadcast variable: distributed shared read-o ...

Added by faraco on Mon, 03 Jan 2022 03:37:59 +0200

Big data -- Introduction to Algorithms in Spark GraphX

1, ConnectedComponents algorithm ConnectedComponents, that is, the connectome algorithm labels each connectome in the graph with id, and takes the id of the vertex with the smallest serial number in the connectome as the id of the connectome. When the diagram is as follows: //Create point val vertexRDD: RDD[(VertexId, (String,Int)) ...

Added by nvee on Wed, 29 Dec 2021 05:09:54 +0200

Spark sql learning notes -- DataFrame, Dataset and sql parsing principles

catalogue 1, SparkSession, DataFrame and Dataset 2, Spark Sql parsing 1. Overall overview 2. sql syntax parsing key objects 3, Spark LogicalPlan 1. Overall overview 2. LogicalPlan class structure system​ 3. Generated by analyzed logicalplan 1, SparkSession, DataFrame and Dataset 1. To use the sparksql function, you need to create a ...

Added by cute_girl on Fri, 24 Dec 2021 03:44:53 +0200

Spark source code series: spark job submission process

Spark supports the use of spark shell, spark SQL and spark submit, but the final calling code is submitted through spark submit. The example of spark submit introduced in the previous article: # spark local mode submit job ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local /path/to/examples.jar \ 1000 Le ...

Added by wmac on Sun, 19 Dec 2021 13:12:36 +0200

Big data Spark Structured Streaming

1 insufficient spark streaming In 2016, Apache Spark launched the Structured Streaming project, a new stream computing engine based on Spark SQL, which allows users to write high-performance stream processing programs as easily as writing batch programs. Structured Streaming is not a simple improvement to Spark Streaming, but a new stre ...

Added by Gamic on Sat, 18 Dec 2021 07:26:14 +0200

[CDH 6.3.X] clouder manager 6.3.2 ,CDH 6.3.1. Installation process

Software packages required during deployment Link: https://pan.baidu.com/s/1UajMORVvQ_VSLOdVkJWYQQ Extraction code: e28y Link: https://pan.baidu.com/s/1dMj8JEaRIOaXP53W2kF_mQ Extraction code: xbyo a key: Sets the host name in FQDN formatTurn off firewallTurn off ipv6Configure local http serviceConfigure local storageThe mysql jdbc drive ...

Added by Jacquelyn L. Ja on Sat, 18 Dec 2021 06:00:54 +0200

Introduction summary of PySpark Feature Engineering

PySpark Feature Tool 1. Data preparation We define some test data to verify the effectiveness of the function; At the same time, for most beginners, if they understand what the input and output of the function are, they can better understand the characteristic function and use characteristics: df = spark.createDataFrame([ ('zhu', "Hi I h ...

Added by gumby51 on Tue, 14 Dec 2021 21:04:34 +0200