Hadoop Learning Notes-Spqrk for TopN(Python)

Spqrk implements TopN Experimentation Requirements Data preparation Expected results Related Classes and Operators findspark pyspark: SparkContext: parallelize(*c*, *numSlices=None*) collect() textFile(*name*, *minPartitions=None*, *use_unicode=True*) map(*f*, *preservesPartitioning=False*) cache( ...

Added by jdiver on Sun, 07 Jun 2020 05:11:10 +0300

Actual | Write to Hudi using Spark Streaming

1. Project Background Traditional data warehouse organization structure is designed for OLAP (Online Transaction Analysis) requirements of offline data. The common way to import data is to use sqoop or spark timer jobs to import business database data into warehouses one by one.With the increasing requirement of real-time in data analysis, hou ...

Added by robh76 on Sun, 19 Apr 2020 03:03:03 +0300

Pyspark learning -- 2. Try to run pyspark

pyspark learning -- 2. pyspark's running method attempt and various sample code attempts Operation method Pycharmrun spark operation in the system: spark submit Start spark task run Sample code Streaming text processing streaming context Stream text word count Error reporting summary Operati ...

Added by lemming_ie on Sat, 08 Feb 2020 10:38:57 +0200

mapreduce implementation of big data learning find common friends JobControl implementation of directed acyclic graph

Note: Please note that this is just a MR training project. In practical application, do not use MR calculation friend recommendation and calculation program of directed acyclic graph logic. Because MR will need to write intermediate results to disk, disk IO greatly reduces efficiency. Hadoop is a bit b ...

Added by Artiom on Mon, 27 Jan 2020 12:45:06 +0200

Spark on K8S (spark on kubernetes operator) FAQ

Spark on K8S (spark on kubernetes operator) environment construction and demo process (2) Common problems in the process of Spark Demo (two) How to persist logs in Spark's executor/driver How to configure Spark history server to take effect What does xxxxx webhook do under spark operator namespace ...

Added by diggysmalls on Fri, 17 Jan 2020 14:10:41 +0200

Resolution of jackson version conflict in spark application

In the spark program, jackson is used to do json serialization and deserialization of scala objects. There are java.lang.NoClassDefFoundError and java.lang.AbstractMethodError errors at runtime. After searching the Internet, it is found that the version conflicts between jackson/guava and spark. 1. In idea, by adjusting the order of dependencie ...

Added by DocSeuss on Sun, 15 Dec 2019 17:06:15 +0200

Spark obtains a case of a mobile phone number staying under a base station and the location of the current mobile phone

1. Business requirements Calculate a cell phone number (base station, dwell time), (current longitude, current latitude) by holding the cell phone number's dwell time log and base station information at each base station The log information generated by connecting the mobile phone to the base station is similar to the following: ...

Added by compguru910 on Tue, 10 Dec 2019 21:16:10 +0200

Spark source code analysis: Master registration mechanism

Master registration mechanism Application registration The previous article has analyzed the initialization process of the SparkContext, and finally sent the registration information of the RegisterApplication type to the Master Now let's see how the Master responds after receiving these messages First, the Master class inhe ...

Added by Muddy_Funster on Mon, 02 Dec 2019 11:41:42 +0200

Spark custom external data source

Background: sometimes we need to define an external data source and use spark sql to process it. There are two benefits: (1) after defining the external data source, it is very simple to use, and the software architecture is clear. It can be used directly through sql. (2) it is easy to divide modules into layers and build them up layer by lay ...

Added by kinaski on Tue, 12 Nov 2019 21:13:18 +0200

Best practice | RDS & POLARDB archiving to X-Pack Spark computing

The X-Pack Spark service provides Redis, Cassandra, MongoDB, HBase and RDS storage services with the ability of complex analysis, streaming processing, warehousing and machine learning through external computing resources, so as to better solve user data processing related scenario problems. RDS & polardb sub table archiving to X-Pack S ...

Added by cyber_ghost on Thu, 07 Nov 2019 08:55:51 +0200