Cloudera series uses data frames and Schemas

1, Create DataFrames from Data Sources 1. Data source for DataFrame DataFrames reads data from the data source and writes data to the data sourceSpark SQL supports a wide range of data source types and formats Text files CSV, JSON, plain text Binary format files Apache Parquet, Apache ORC, Apache Avro data format Tables Hive ...

Added by ruach on Thu, 21 Oct 2021 17:47:06 +0300

HDFS basic operation

1, Viewing storage system information hdfs dfsadmin -report [-live] [-dead] [-decommissioning] Output the basic information and relevant data statistics of the file system [root@master ~]# hdfs dfsadmin -report Output the basic information and relevant data statistics of online nodes in the file system [root@master ~]# hdfs dfsadmin -report ...

Added by elklabone on Wed, 20 Oct 2021 08:55:17 +0300

Tencent cloud installs the configuration database MySQL and uses SQLyog to connect

1, The first is the one click installation and uninstallation commands of MySQL (you can choose any one to execute) (1) Installation command sudo apt-get install mysql (Download the latest version) sudo apt install mysql-server mysql-client (2) Uninstall command: if there is a problem, reinstall MySQL for use, such as forgetting the initial ...

Added by ziong on Tue, 19 Oct 2021 04:19:29 +0300

JDBC connection to database: notes of Shang school

JDBC introduction JDBC(Java Database Connectivity) is a common interface (a set of API s) independent of a specific database management system and common SQL database access and operation. It defines the standard Java class libraries used to access the database (java.sql, javax.sql). Using these class libraries, you can easily access database ...

Added by djopie on Mon, 18 Oct 2021 21:37:46 +0300

Flink+Hudi framework Lake warehouse integrated solution

Abstract: This paper introduces the prototype construction of Flink + Hudi Lake Warehouse Integration Scheme in detail. The main contents are as follows: Hudi The new architecture is integrated with the lake warehouse Best practices Flink on Hudi Flink CDC 2.0 on Hudi Tips: FFA 2021 is heavily opened. Click "read the original te ...

Added by benzrf on Mon, 18 Oct 2021 07:38:52 +0300

Experiment 3: familiar with common HBase operation

1, Experimental purpose (1) Understand the role of HDFS in Hadoop architecture; (2) Proficient in using HDFS to operate common Shell commands; (3) Familiar with Java API s commonly used in HDFS operation. 2, Experimental platform Operating system: Linux (CentOS recommended);Hadoop version: 3.2.2;HBase version: 2.3.6;JDK version: 1.7 or abo ...

Added by pido on Sat, 16 Oct 2021 10:08:47 +0300

Data warehouse tool hive

1. What's Hive 1. General Apache Hive data warehouse software provides query and management of large data sets stored in distributed. It is built on Apache Hadoop and mainly provides the following functions: (1) It provides a series of tools that can be used to extract / transform / load data (ETL); (2) It is a mechanism that can store, quer ...

Added by SleepyP on Sat, 16 Oct 2021 08:51:21 +0300

35-Blog Site Database-Blog Information Data Operation

35-Blog Site Database-Blog Information Data Operation (2) Item Description Nowadays, micro-blog and blog publishing information have become the main information publishing and dissemination system. How to manage these data, this project mainly operates on the classified information table and blog information table of blog website. The blog s ...

Added by rich1983 on Fri, 15 Oct 2021 19:29:27 +0300

Spark Doris Connector design

Spark Doris Connector is a new feature introduced by Doris in version 0.12. Users can use this function to directly read and write the data stored in Doris through spark, and support SQL, Dataframe, RDD and other methods. From the perspective of Doris, introducing its data into Spark can use a series of rich ecological products of Spark, bro ...

Added by jackyhuphp on Wed, 13 Oct 2021 00:15:42 +0300

Hive environment building + reading es data to internal tables

Scenario:          The project needs function optimization. It needs to compare the same data. Which is more efficient to query from hive or es. Therefore, we need to synchronize all the data of an index in es to hdfs, and query hdfs data through hive to compare their efficiency. Step 1: preliminary pre ...

Added by Naez on Wed, 13 Oct 2021 00:02:23 +0300