Analysis of RDD of key value type in Spark
1.partitionBy
1) Function signature
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
2) Function description Repartition the data according to the specified Partitioner. Spark's default Partitioner is HashPartitioner Note: partitionBy can only be called when rdd is converted to key value tuple type
import org.apache.spark.{HashPartit ...
Added by anticore on Mon, 13 Dec 2021 10:41:45 +0200
spark integrated hive summary
I won't say much about installing spark here~
!!! Look! To install mysql and hive:
Install RPM package and download mysql:
sudo yum localinstall https://repo.mysql.com//mysql80-community-release-el7-1.noarch.rpm
sudo yum install mysql-community-server
Start MySQL service and view the status:
systemctl start mysqld.service
service ...
Added by jaimitoc30 on Tue, 07 Dec 2021 22:24:11 +0200
[proficient in Spark series] is it difficult to start everything? This article makes it easy for you to get started with Spark
🚀 Author: "big data Zen"
🚀 ** Introduction * *: This article is a series of spark articles. The column will record the contents from the basic to advanced spark, including the introduction of spark, cluster construction, core components, RDD, the use of operators, underlying principles, SparkCore, SparkSQL, SparkStreaming, etc, S ...
Added by stringman on Sun, 05 Dec 2021 18:32:19 +0200
Analysis of wrong questions in the 269th weekly game of LeetCode
Week 269 race
The topic is very simple. There are only three questions, and the fourth one is a little short.
I'm still too good. I'll continue to work hard to train my thinking and speed.
5938. Find the target subscript after array sorting
class Solution(object):
def targetIndices(self, nums, target):
"""
:type nums ...
Added by Fergal Andrews on Sun, 28 Nov 2021 08:50:58 +0200
Big data Flume enterprise development practice
1 replication and multiplexing
1.1 case requirements
Flume-1 is used to monitor file changes. Flume-1 passes the changes to Flume-2, which is responsible for storing them To HDFS. At the same time, Flume-1 passes the changes to Flume-3, which is responsible for outputting them to the local file system.
1.2 demand analysis: single data ...
Added by ss-mike on Fri, 26 Nov 2021 15:40:56 +0200
Spark 3.0.0 environment installation
1. Spark overview
1.1 what is Spark
Spark is a fast, universal and scalable big data analysis framework based on memory.
1.2 Hadoop and Spark
Hadoop: a one-time computing framework based on disk, which is not suitable for iterative computing. When processing data, the framework will flush the storage device to read out the data, carry out ...
Added by Protato on Wed, 27 Oct 2021 15:13:57 +0300
Cloudera series uses data frames and Schemas
1, Create DataFrames from Data Sources
1. Data source for DataFrame
DataFrames reads data from the data source and writes data to the data sourceSpark SQL supports a wide range of data source types and formats
Text files
CSV, JSON, plain text Binary format files
Apache Parquet, Apache ORC, Apache Avro data format Tables
Hive ...
Added by ruach on Thu, 21 Oct 2021 17:47:06 +0300
Spark Doris Connector design
Spark Doris Connector is a new feature introduced by Doris in version 0.12. Users can use this function to directly read and write the data stored in Doris through spark, and support SQL, Dataframe, RDD and other methods.
From the perspective of Doris, introducing its data into Spark can use a series of rich ecological products of Spark, bro ...
Added by jackyhuphp on Wed, 13 Oct 2021 00:15:42 +0300
Hadoop Learning Notes-Spqrk for TopN(Python)
Spqrk implements TopN
Experimentation Requirements
Data preparation
Expected results
Related Classes and Operators
findspark
pyspark:
SparkContext:
parallelize(*c*, *numSlices=None*)
collect()
textFile(*name*, *minPartitions=None*, *use_unicode=True*)
map(*f*, *preservesPartitioning=False*)
cache( ...
Added by jdiver on Sun, 07 Jun 2020 05:11:10 +0300
Actual | Write to Hudi using Spark Streaming
1. Project Background
Traditional data warehouse organization structure is designed for OLAP (Online Transaction Analysis) requirements of offline data. The common way to import data is to use sqoop or spark timer jobs to import business database data into warehouses one by one.With the increasing requirement of real-time in data analysis, hou ...
Added by robh76 on Sun, 19 Apr 2020 03:03:03 +0300