Analysis of RDD of key value type in Spark

1.partitionBy 1) Function signature def partitionBy(partitioner: Partitioner): RDD[(K, V)] 2) Function description Repartition the data according to the specified Partitioner. Spark's default Partitioner is HashPartitioner Note: partitionBy can only be called when rdd is converted to key value tuple type import org.apache.spark.{HashPartit ...

Added by anticore on Mon, 13 Dec 2021 10:41:45 +0200

spark integrated hive summary

  I won't say much about installing spark here~ !!! Look! To install mysql and hive: Install RPM package and download mysql:   sudo yum localinstall https://repo.mysql.com//mysql80-community-release-el7-1.noarch.rpm sudo yum install mysql-community-server Start MySQL service and view the status: systemctl start mysqld.service service ...

Added by jaimitoc30 on Tue, 07 Dec 2021 22:24:11 +0200

[proficient in Spark series] is it difficult to start everything? This article makes it easy for you to get started with Spark

🚀 Author: "big data Zen" 🚀 ** Introduction * *: This article is a series of spark articles. The column will record the contents from the basic to advanced spark, including the introduction of spark, cluster construction, core components, RDD, the use of operators, underlying principles, SparkCore, SparkSQL, SparkStreaming, etc, S ...

Added by stringman on Sun, 05 Dec 2021 18:32:19 +0200

Analysis of wrong questions in the 269th weekly game of LeetCode

Week 269 race The topic is very simple. There are only three questions, and the fourth one is a little short. I'm still too good. I'll continue to work hard to train my thinking and speed. 5938. Find the target subscript after array sorting class Solution(object): def targetIndices(self, nums, target): """ :type nums ...

Added by Fergal Andrews on Sun, 28 Nov 2021 08:50:58 +0200

Big data Flume enterprise development practice

1 replication and multiplexing 1.1 case requirements Flume-1 is used to monitor file changes. Flume-1 passes the changes to Flume-2, which is responsible for storing them To HDFS. At the same time, Flume-1 passes the changes to Flume-3, which is responsible for outputting them to the local file system. 1.2 demand analysis: single data ...

Added by ss-mike on Fri, 26 Nov 2021 15:40:56 +0200

Spark 3.0.0 environment installation

1. Spark overview 1.1 what is Spark Spark is a fast, universal and scalable big data analysis framework based on memory. 1.2 Hadoop and Spark Hadoop: a one-time computing framework based on disk, which is not suitable for iterative computing. When processing data, the framework will flush the storage device to read out the data, carry out ...

Added by Protato on Wed, 27 Oct 2021 15:13:57 +0300

Cloudera series uses data frames and Schemas

1, Create DataFrames from Data Sources 1. Data source for DataFrame DataFrames reads data from the data source and writes data to the data sourceSpark SQL supports a wide range of data source types and formats Text files CSV, JSON, plain text Binary format files Apache Parquet, Apache ORC, Apache Avro data format Tables Hive ...

Added by ruach on Thu, 21 Oct 2021 17:47:06 +0300

Spark Doris Connector design

Spark Doris Connector is a new feature introduced by Doris in version 0.12. Users can use this function to directly read and write the data stored in Doris through spark, and support SQL, Dataframe, RDD and other methods. From the perspective of Doris, introducing its data into Spark can use a series of rich ecological products of Spark, bro ...

Added by jackyhuphp on Wed, 13 Oct 2021 00:15:42 +0300

Hadoop Learning Notes-Spqrk for TopN(Python)

Spqrk implements TopN Experimentation Requirements Data preparation Expected results Related Classes and Operators findspark pyspark: SparkContext: parallelize(*c*, *numSlices=None*) collect() textFile(*name*, *minPartitions=None*, *use_unicode=True*) map(*f*, *preservesPartitioning=False*) cache( ...

Added by jdiver on Sun, 07 Jun 2020 05:11:10 +0300

Actual | Write to Hudi using Spark Streaming

1. Project Background Traditional data warehouse organization structure is designed for OLAP (Online Transaction Analysis) requirements of offline data. The common way to import data is to use sqoop or spark timer jobs to import business database data into warehouses one by one.With the increasing requirement of real-time in data analysis, hou ...

Added by robh76 on Sun, 19 Apr 2020 03:03:03 +0300