Cloudera series uses data frames and Schemas
1, Create DataFrames from Data Sources
1. Data source for DataFrame
DataFrames reads data from the data source and writes data to the data sourceSpark SQL supports a wide range of data source types and formats
Text files
CSV, JSON, plain text Binary format files
Apache Parquet, Apache ORC, Apache Avro data format Tables
Hive ...
Added by ruach on Thu, 21 Oct 2021 17:47:06 +0300
HDFS basic operation
1, Viewing storage system information hdfs dfsadmin -report [-live] [-dead] [-decommissioning] Output the basic information and relevant data statistics of the file system
[root@master ~]# hdfs dfsadmin -report
Output the basic information and relevant data statistics of online nodes in the file system
[root@master ~]# hdfs dfsadmin -report ...
Added by elklabone on Wed, 20 Oct 2021 08:55:17 +0300
Tencent cloud installs the configuration database MySQL and uses SQLyog to connect
1, The first is the one click installation and uninstallation commands of MySQL (you can choose any one to execute) (1) Installation command
sudo apt-get install mysql (Download the latest version)
sudo apt install mysql-server mysql-client
(2) Uninstall command: if there is a problem, reinstall MySQL for use, such as forgetting the initial ...
Added by ziong on Tue, 19 Oct 2021 04:19:29 +0300
JDBC connection to database: notes of Shang school
JDBC introduction
JDBC(Java Database Connectivity) is a common interface (a set of API s) independent of a specific database management system and common SQL database access and operation. It defines the standard Java class libraries used to access the database (java.sql, javax.sql). Using these class libraries, you can easily access database ...
Added by djopie on Mon, 18 Oct 2021 21:37:46 +0300
Flink+Hudi framework Lake warehouse integrated solution
Abstract: This paper introduces the prototype construction of Flink + Hudi Lake Warehouse Integration Scheme in detail. The main contents are as follows:
Hudi The new architecture is integrated with the lake warehouse Best practices Flink on Hudi Flink CDC 2.0 on Hudi
Tips: FFA 2021 is heavily opened. Click "read the original te ...
Added by benzrf on Mon, 18 Oct 2021 07:38:52 +0300
Experiment 3: familiar with common HBase operation
1, Experimental purpose
(1) Understand the role of HDFS in Hadoop architecture;
(2) Proficient in using HDFS to operate common Shell commands;
(3) Familiar with Java API s commonly used in HDFS operation.
2, Experimental platform
Operating system: Linux (CentOS recommended);Hadoop version: 3.2.2;HBase version: 2.3.6;JDK version: 1.7 or abo ...
Added by pido on Sat, 16 Oct 2021 10:08:47 +0300
Data warehouse tool hive
1. What's Hive
1. General Apache Hive data warehouse software provides query and management of large data sets stored in distributed. It is built on Apache Hadoop and mainly provides the following functions:
(1) It provides a series of tools that can be used to extract / transform / load data (ETL);
(2) It is a mechanism that can store, quer ...
Added by SleepyP on Sat, 16 Oct 2021 08:51:21 +0300
35-Blog Site Database-Blog Information Data Operation
35-Blog Site Database-Blog Information Data Operation (2)
Item Description
Nowadays, micro-blog and blog publishing information have become the main information publishing and dissemination system. How to manage these data, this project mainly operates on the classified information table and blog information table of blog website.
The blog s ...
Added by rich1983 on Fri, 15 Oct 2021 19:29:27 +0300
Spark Doris Connector design
Spark Doris Connector is a new feature introduced by Doris in version 0.12. Users can use this function to directly read and write the data stored in Doris through spark, and support SQL, Dataframe, RDD and other methods.
From the perspective of Doris, introducing its data into Spark can use a series of rich ecological products of Spark, bro ...
Added by jackyhuphp on Wed, 13 Oct 2021 00:15:42 +0300
Hive environment building + reading es data to internal tables
Scenario:
The project needs function optimization. It needs to compare the same data. Which is more efficient to query from hive or es. Therefore, we need to synchronize all the data of an index in es to hdfs, and query hdfs data through hive to compare their efficiency.
Step 1: preliminary pre ...
Added by Naez on Wed, 13 Oct 2021 00:02:23 +0300