Integrated java and scala development environment using maven

Git address: https://gitee.com/jyq_18792721831/sparkmaven.git Create Project Let's start by creating a generic maven project Create a project followed by a hello module Is also a normal maven module Increase scala dependency We don't write code in the parent project, the parent project is just for managing the child project, so th ...

Added by ahundiak on Fri, 18 Feb 2022 00:15:40 +0200

[3 days to master Spark] - Sogou log statistical analysis contact

SogouQ log analysis Data research and business analysis Sogou lab is used to provide [user query log (SogouQ)] data, and Spark framework is used to encapsulate the data into RDD for business data processing and analysis. 1) . data introduction: The search engine query log database is designed as a collection of Web query log data in ...

Added by mjm7867 on Fri, 11 Feb 2022 20:57:19 +0200

python parallel scheduling spark tasks

background Translate pyspark code that implements a business logic into sparksql to supplement the historical data for the past six months (run by day) based on sparksql; Core Point 1) Translate pyspark to sparksql; 2) Based on sparksql, supplement the historical data of the past half year (run by day); Realization 1) First, pyspark is tra ...

Added by crimsonmoon on Fri, 11 Feb 2022 03:30:23 +0200

Hadoop + spark big data analysis: Hadoop cluster construction

  Article catalogue preface 1, Download and configuration of cluster environment 1. Download hadoop 2. Configure hadoop environment variables Configure hadoop core environment Configure core site xml Configure HDFS site xml Configure mapred site xml Configure yarn site xml Configure workers Disable firewall 2, Clone ...

Added by jonniejoejonson on Tue, 08 Feb 2022 05:25:06 +0200

3. Spark and D3 JS analyze flight big data

Experimental resources 1998.csv airports.csv Experimental environment VMware Workstation Ubuntu 16.04 spark-2.4.5 scala-2.12.10 Experimental content "We are sorry to inform you that your flight XXXX from XX to XX has been delayed." I believe many passengers waiting at the airport do not want to hear this sentence. With the gradua ...

Added by J-C on Mon, 07 Feb 2022 14:00:50 +0200

Spark learning notes [1]-scala environment installation and basic syntax

Spark learning notes [1]-scala environment installation and basic syntax    just as the saying goes, if you want to do a good job, you must first use your tools. Spark's development language is not Java but scala. Although they both run on the JVM, the basic characteristics of the two languages are still somewhat different. Here is a ...

Added by GateGuardian on Sun, 06 Feb 2022 08:36:12 +0200

Apache hudi source code analysis - zorder layout optimization

This article aims to gradually get familiar with the implementation of the overall architecture of hudi through a certain function, and will not discuss the implementation details of the algorithmhudi newcomer, if you have any questions, please correct themspark : version, 3.1.2 hudi : branch, masterTime: 2022/02/06 First EditionObjective: to r ...

Added by blakey on Sun, 06 Feb 2022 06:26:03 +0200

Spark chasing Wife Series (RDD of Value type)

Today is the third day of the lunar new year. Monkey Sai Lei Small talk These days, I send her a red envelope every night, a new year's red envelope, and an expression package can be added. I feel that the Chinese New Year is good and there is no new year flavor. My throat hurts when I eat melon seeds. There are many operators in Spark, in ...

Added by Rincewind on Thu, 03 Feb 2022 13:40:07 +0200

Python big data processing library PySpark Practice II

Pyspark establishes Spark RDD Each RDD can be divided into multiple partitions. Each partition can be regarded as a data set fragment and can be saved to different nodes in the Spark clusterRDD itself has fault-tolerant mechanism and is a read-only data structure, which can only generate new RDD through transformation; An RDD can be proces ...

Added by pete07920 on Sun, 30 Jan 2022 16:23:19 +0200

Spark sparksql of big data

1 Spark SQL overview 1.1 what is Spark SQL Spark SQL is a module used by spark to process structured data. It provides two programming abstractions: DataFrame and DataSet, and acts as a distributed SQL query engine. We have learned about Hive, which converts Hive SQL into MapReduce and then submits it to the cluster for execution, which great ...

Added by condoug on Sat, 29 Jan 2022 06:48:02 +0200