Integrated java and scala development environment using maven
Git address: https://gitee.com/jyq_18792721831/sparkmaven.git
Create Project
Let's start by creating a generic maven project
Create a project followed by a hello module
Is also a normal maven module
Increase scala dependency
We don't write code in the parent project, the parent project is just for managing the child project, so th ...
Added by ahundiak on Fri, 18 Feb 2022 00:15:40 +0200
[3 days to master Spark] - Sogou log statistical analysis contact
SogouQ log analysis
Data research and business analysis
Sogou lab is used to provide [user query log (SogouQ)] data, and Spark framework is used to encapsulate the data into RDD for business data processing and analysis.
1) . data introduction:
The search engine query log database is designed as a collection of Web query log data in ...
Added by mjm7867 on Fri, 11 Feb 2022 20:57:19 +0200
python parallel scheduling spark tasks
background
Translate pyspark code that implements a business logic into sparksql to supplement the historical data for the past six months (run by day) based on sparksql;
Core Point
1) Translate pyspark to sparksql; 2) Based on sparksql, supplement the historical data of the past half year (run by day);
Realization
1) First, pyspark is tra ...
Added by crimsonmoon on Fri, 11 Feb 2022 03:30:23 +0200
Hadoop + spark big data analysis: Hadoop cluster construction
Article catalogue
preface
1, Download and configuration of cluster environment
1. Download hadoop
2. Configure hadoop environment variables
Configure hadoop core environment
Configure core site xml
Configure HDFS site xml
Configure mapred site xml
Configure yarn site xml
Configure workers
Disable firewall
2, Clone ...
Added by jonniejoejonson on Tue, 08 Feb 2022 05:25:06 +0200
3. Spark and D3 JS analyze flight big data
Experimental resources
1998.csv airports.csv
Experimental environment
VMware Workstation Ubuntu 16.04 spark-2.4.5 scala-2.12.10
Experimental content
"We are sorry to inform you that your flight XXXX from XX to XX has been delayed." I believe many passengers waiting at the airport do not want to hear this sentence. With the gradua ...
Added by J-C on Mon, 07 Feb 2022 14:00:50 +0200
Spark learning notes [1]-scala environment installation and basic syntax
Spark learning notes [1]-scala environment installation and basic syntax
just as the saying goes, if you want to do a good job, you must first use your tools. Spark's development language is not Java but scala. Although they both run on the JVM, the basic characteristics of the two languages are still somewhat different. Here is a ...
Added by GateGuardian on Sun, 06 Feb 2022 08:36:12 +0200
Apache hudi source code analysis - zorder layout optimization
This article aims to gradually get familiar with the implementation of the overall architecture of hudi through a certain function, and will not discuss the implementation details of the algorithmhudi newcomer, if you have any questions, please correct themspark : version, 3.1.2 hudi : branch, masterTime: 2022/02/06 First EditionObjective: to r ...
Added by blakey on Sun, 06 Feb 2022 06:26:03 +0200
Spark chasing Wife Series (RDD of Value type)
Today is the third day of the lunar new year. Monkey Sai Lei
Small talk
These days, I send her a red envelope every night, a new year's red envelope, and an expression package can be added. I feel that the Chinese New Year is good and there is no new year flavor. My throat hurts when I eat melon seeds.
There are many operators in Spark, in ...
Added by Rincewind on Thu, 03 Feb 2022 13:40:07 +0200
Python big data processing library PySpark Practice II
Pyspark establishes Spark RDD
Each RDD can be divided into multiple partitions. Each partition can be regarded as a data set fragment and can be saved to different nodes in the Spark clusterRDD itself has fault-tolerant mechanism and is a read-only data structure, which can only generate new RDD through transformation; An RDD can be proces ...
Added by pete07920 on Sun, 30 Jan 2022 16:23:19 +0200
Spark sparksql of big data
1 Spark SQL overview
1.1 what is Spark SQL
Spark SQL is a module used by spark to process structured data. It provides two programming abstractions: DataFrame and DataSet, and acts as a distributed SQL query engine. We have learned about Hive, which converts Hive SQL into MapReduce and then submits it to the cluster for execution, which great ...
Added by condoug on Sat, 29 Jan 2022 06:48:02 +0200