Introduction to big data

1, Introduction to big data

1. Data and data analysis

2. Role of data analysis

Current situation analysis
Cause analysis
Forecast analysis

3. Basic steps of data analysis

Clarify the purpose of analysis
data collection
data processing
Data analysis
Data presentation
Report writing

4. Big data

What is big data
The challenge of massive data
Characteristics of big data
- Volume: large amount of data, including collection, storage and calculation;
- Variety: variety of species and sources. Including structured, semi-structured and unstructured data;
- Value: the data value density is relatively low, or it is valuable to wash sand in the waves;
- Velocity: fast data growth, fast processing speed and high timeliness requirements;
- Veracity: the accuracy and reliability of data, that is, the quality of data.
Big data application scenarios

5. Distributed technology

What is distributed
- Deploy a service to run on multiple computers, provide unified external interface services, and computers can communicate with each other
Common distributed solutions
- Distributed application (rpc communication mode) distributed storage and distributed computing
Distributed cluster
- Different functions under the same service are deployed on different computers to provide a unified service, which is distributed
- Services are deployed on different computers. Each computer can provide services independently. Between multiple computers, computers are called clusters

Computer A HDFS storage A MR calculation A
Computer B HDFS storage B MR calculation B
The data stored in Hadoop is stored on A and B respectively

2, Apache Zookeeper

1. Basic knowledge of zookeeper

introduce

Zookeeper is used to manage hadoop services and realize the high availability (HA) main service backup service of hadoop

  Zookeeper Is an open source framework for distributed coordination services, which is mainly used to solve the consistency problem of application systems in distributed clusters
  Zookeeper It is essentially a distributed small file storage system

characteristic
1. Global data consistency
2. reliability
3. Order
4. Data update atomicity
5. Real time
Cluster role
1. Leader: responsible for scheduling and processing transaction requests (write operations) to ensure the order of transaction processing
2. Follower: it is responsible for scheduling and processing non transaction requests (read operations), and forwarding transactions to the Leader
3. Observer: observer, which is responsible for observing the running status of the cluster. Read operations can be performed independently, and write operations can be forwarded to the Leader
Cluster construction
1. Upload files to / export/software
2. Unzip the file to / expand / server

one. cd /export/software
	 rz take zookeeper Upload compressed package to /export/software
 two. Unzip the package to /exprot/server
	tar zxvf zookeeper-3.4.6.tar.gz -C /exprot/server
 three. cd /exprot/server
 four. take Zookeeper Change your name
	mv zookeeper-3.4.6/ zookeeper

Configure environment variables

vim /etc/profile

export ZOOKEEPER_HOME=/export/server/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin

When you exit after saving, remember to refresh the environment variables  source /etc/profile

Enter the conf directory of Zookeeper
Modify profile name
```
mv zoo_sample.cfg zoo.cfg
```
Modify profile content

dataDir=/export/data/zkdata
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888

Create data store directory

mkdir /export/data/zkdata

Specify service number

echo 1 > /export/data/zkdata/myid

Copy the zookeeper service file from node1 to another computer

scp -r /export/server/zookeeper root@node2:/export/server

scp -r /export/server/zookeeper root@node3:/export/server

Modify the service number of another computer

mkdir /export/data/zkdata
echo 1 > /export/data/zkdata/myid

Modify the environment variable configuration file / etc/profile of other computers
When the service is started, it will automatically generate a small log file in the current directory and view the log message through cat

	zkServer.sh start  Start service
	zkServer.sh stop  Out of Service
	zkServer.sh status  View status

Or enter /export/server/zookeeper/bin Directory
 implement ./zkServer.sh start It's OK

2. shell operation

1. Connection

zkCli.sh -server node1

2. Create node

 create [-s] [-e] path data acl
 -s Specify order
 -e Specify temporary node
 path Specifies the path of the created node
 data Data written by node
 acl Automatically generated

3. View nodes

ls Node path: check whether the node basic information has child nodes
ls2 Node path: check whether the node details have child nodes
get Node path view node details and data written by the node

4. Modification

set Node data

5. Delete

delete Node data

6. Node restrictions

setquota -n|-b val path 
-n Limit number
-b Limit data size
val Restricted data
path Node path

delquota path Delete qualification

7. Delete multi tier

rmr /route

8. View historical information

history

3. Data model

Tree hierarchy, each node is called Znode

Znode has two characteristics: file and directory
Znode has atomic operation
Znode storage data size limit 1M
Znode is referenced by path
Node information
- stat status information description node version permission
- Data the data associated with the node
- Child node information associated with children
Node type
- Temporary node
  - When the client connects to the server, the temporary node will always exist. Once the client disconnects, the temporary node will be cleared
- Permanent node
- Serialization properties
  - Nodes are assigned numbers to make them sequential
Node attributes

extend

1,Big data: essence is a set of theories

2,Big data manages data and maintains a large amount of data through data warehouse management

3,Hadoop Construction of technology stack to realize offline data warehouse   

Data source: business data mysql MongoDB,Log data Excel



ETL The composition of data storage warehouse is described

Data extraction: get data from no source --Data collection process (buried point, reptile flume,sqoop)

Data conversion: regular cleaning process (de duplication and removal of empty data) sql

Data loading: the process of storing data into the data warehouse



Data analysis: by classifying, clustering and associating the data, the corresponding index data is extracted, and the causes are analyzed through the index data to provide solutions



Distributed technology provides unified services for multiple computers

Hadoop Technology relies on distributed technology

Hadoop Core services HDFS,MR,YARN

use Hive yes Hadoop Add, delete, modify and query the stored data

use zookeeper yes hadoop Node management for high availability

Keywords: Hadoop hive Zookeeper Distribution

Added by Monkeymatt on Sun, 21 Nov 2021 09:40:44 +0200

Programming VIP