Hadoop + spark big data analysis: Hadoop cluster construction

  Article catalogue preface 1, Download and configuration of cluster environment 1. Download hadoop 2. Configure hadoop environment variables Configure hadoop core environment Configure core site xml Configure HDFS site xml Configure mapred site xml Configure yarn site xml Configure workers Disable firewall 2, Clone ...

Added by jonniejoejonson on Tue, 08 Feb 2022 05:25:06 +0200

CDH6.1. Upgrade Impala to version 3.4 to enable auto refresh metadata function and Its Solutions

At cdh6 Version 1 we try on cdh6 In version 1, Impala was upgraded and the function of automatically refreshing metadata was enabled. Some problems were encountered during this period. They were finally solved by checking the log, source code, Google and so on. Use this article to sort it out and give back to the community. The main reference ...

Added by gwydionwaters on Tue, 08 Feb 2022 02:43:35 +0200

1 line of code climb CSDN hot list, Python ha beer style writing

Eraser, a funny senior Internet bug Project background Group Friends: sister wipe, how many lines of code can CSDN hot list data climb at least? Sister wipe: it's estimated to be 10. Group Friends: oh baby, show me your code! This is how the project needs to climb the CSDN hot list with the least number of lines of code. The import module ...

Added by J@ystick_FI on Mon, 07 Feb 2022 10:09:55 +0200

Flink de duplication scheme

Flink heavy De duplication calculation should be a common indicator calculation in data analysis business, such as the number of users visiting the website in a day, the number of users clicking on advertisements, etc. offline calculation is a full and one-time calculation process, and the de duplication results can usually be obtained by dist ...

Added by cowboy_x on Mon, 07 Feb 2022 05:46:01 +0200

ES introduction learning notes

Introduction: ES is a non relational database of distributed documents (a document is similar to a single record in a relational database). Each field of the document will be indexed by default, and the data of each field can be searched. It can be horizontally extended to hundreds of servers to store and process PB level data. ES is based on ...

Added by daloss on Mon, 07 Feb 2022 03:25:07 +0200

Spark learning notes [1]-scala environment installation and basic syntax

Spark learning notes [1]-scala environment installation and basic syntax    just as the saying goes, if you want to do a good job, you must first use your tools. Spark's development language is not Java but scala. Although they both run on the JVM, the basic characteristics of the two languages are still somewhat different. Here is a ...

Added by GateGuardian on Sun, 06 Feb 2022 08:36:12 +0200

Apache hudi source code analysis - zorder layout optimization

This article aims to gradually get familiar with the implementation of the overall architecture of hudi through a certain function, and will not discuss the implementation details of the algorithmhudi newcomer, if you have any questions, please correct themspark : version, 3.1.2 hudi : branch, masterTime: 2022/02/06 First EditionObjective: to r ...

Added by blakey on Sun, 06 Feb 2022 06:26:03 +0200

Flink real-time data warehouse of big data project (DWM layer)

Design ideas In the past, we split the data into independent Kafka topics through diversion and other processing methods. Next, when processing the data, we should consider processing the index items used in real-time calculation. Timeliness is the pursuit of real-time data warehouse. Therefore, in some scenarios, it is not necessary to have a ...

Added by SteveMellor on Thu, 03 Feb 2022 21:34:05 +0200

Spark chasing Wife Series (RDD of Value type)

Today is the third day of the lunar new year. Monkey Sai Lei Small talk These days, I send her a red envelope every night, a new year's red envelope, and an expression package can be added. I feel that the Chinese New Year is good and there is no new year flavor. My throat hurts when I eat melon seeds. There are many operators in Spark, in ...

Added by Rincewind on Thu, 03 Feb 2022 13:40:07 +0200

elastic_ Getting started with search

Basic concepts An index is similar to a table in a traditional relational database. It is a place to store relational documents Document type [removed after version 7.0] Document (doc) A doc represents a piece of data in the index, like a record in the database table. Doc stores data in json format es architecture design Simple defini ...

Added by abie10 on Thu, 03 Feb 2022 05:10:31 +0200