[Spark] action operator of RDD
The so-called action operator is the method to trigger job execution
reduce
Function signature: def reduce (F: (T, t) = > t): t Function Description: aggregate all elements in RDD, first aggregate data in partitions, and then aggregate data between partitions
@Test
def reduce(): Unit = {
val rdd = sc.makeRDD(List(1,2,3,4)) ...
Added by nascarjunky on Wed, 05 Jan 2022 02:37:28 +0200
Spark SQL functions.scala source code parsing String functions (based on Spark 3.3.0)
preface
This article belongs to the column "1000 problems to solve big data technology system", which is original by the author. Please indicate the source of quotation. Please help point out the deficiencies and errors in the comment area. Thank you!
Please refer to table of contents and references for this column 1000 problems ...
Added by davelr459 on Tue, 04 Jan 2022 13:45:58 +0200
Spark introduction and spark deployment, principle and development environment construction
Spark introduction and spark deployment, principle and development environment construction
Introduction to spark
Spark is a fast, universal and scalable big data analysis and calculation engine based on memory.
It is a general memory parallel computing framework developed by the AMP Laboratory (Algorithms, Machines, and People Lab) at the U ...
Added by benreisner on Mon, 03 Jan 2022 22:14:19 +0200
[review] Spark core programming --- RDD
Spark computing framework encapsulates three data structures to handle different application scenarios in order to process data with high concurrency and high throughput. The three data structures are: RDD: elastic distributed data set accumulator: distributed shared write only variables Broadcast variable: distributed shared read-o ...
Added by faraco on Mon, 03 Jan 2022 03:37:59 +0200
Big data -- Introduction to Algorithms in Spark GraphX
1, ConnectedComponents algorithm
ConnectedComponents, that is, the connectome algorithm labels each connectome in the graph with id, and takes the id of the vertex with the smallest serial number in the connectome as the id of the connectome.
When the diagram is as follows:
//Create point
val vertexRDD: RDD[(VertexId, (String,Int)) ...
Added by nvee on Wed, 29 Dec 2021 05:09:54 +0200
Spark sql learning notes -- DataFrame, Dataset and sql parsing principles
catalogue
1, SparkSession, DataFrame and Dataset
2, Spark Sql parsing
1. Overall overview
2. sql syntax parsing key objects
3, Spark LogicalPlan
1. Overall overview
2. LogicalPlan class structure system
3. Generated by analyzed logicalplan
1, SparkSession, DataFrame and Dataset
1. To use the sparksql function, you need to create a ...
Added by cute_girl on Fri, 24 Dec 2021 03:44:53 +0200
Spark source code series: spark job submission process
Spark supports the use of spark shell, spark SQL and spark submit, but the final calling code is submitted through spark submit. The example of spark submit introduced in the previous article:
# spark local mode submit job
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local
/path/to/examples.jar \
1000
Le ...
Added by wmac on Sun, 19 Dec 2021 13:12:36 +0200
Big data Spark Structured Streaming
1 insufficient spark streaming
In 2016, Apache Spark launched the Structured Streaming project, a new stream computing engine based on Spark SQL, which allows users to write high-performance stream processing programs as easily as writing batch programs. Structured Streaming is not a simple improvement to Spark Streaming, but a new stre ...
Added by Gamic on Sat, 18 Dec 2021 07:26:14 +0200
[CDH 6.3.X] clouder manager 6.3.2 ,CDH 6.3.1. Installation process
Software packages required during deployment
Link: https://pan.baidu.com/s/1UajMORVvQ_VSLOdVkJWYQQ Extraction code: e28y
Link: https://pan.baidu.com/s/1dMj8JEaRIOaXP53W2kF_mQ Extraction code: xbyo
a key:
Sets the host name in FQDN formatTurn off firewallTurn off ipv6Configure local http serviceConfigure local storageThe mysql jdbc drive ...
Added by Jacquelyn L. Ja on Sat, 18 Dec 2021 06:00:54 +0200
Introduction summary of PySpark Feature Engineering
PySpark Feature Tool
1. Data preparation
We define some test data to verify the effectiveness of the function; At the same time, for most beginners, if they understand what the input and output of the function are, they can better understand the characteristic function and use characteristics:
df = spark.createDataFrame([
('zhu', "Hi I h ...
Added by gumby51 on Tue, 14 Dec 2021 21:04:34 +0200