Introduction to big data

1, Introduction to big data

1. Data and data analysis

2. Role of data analysis

  • Current situation analysis
  • Cause analysis
  • Forecast analysis

3. Basic steps of data analysis

  • Clarify the purpose of analysis
  • data collection
  • data processing
  • Data analysis
  • Data presentation
  • Report writing

4. Big data

  • What is big data
  • The challenge of massive data
  • Characteristics of big data
    • Volume: large amount of data, including collection, storage and calculation;
    • Variety: variety of species and sources. Including structured, semi-structured and unstructured data;
    • Value: the data value density is relatively low, or it is valuable to wash sand in the waves;
    • Velocity: fast data growth, fast processing speed and high timeliness requirements;
    • Veracity: the accuracy and reliability of data, that is, the quality of data.
  • Big data application scenarios

5. Distributed technology

  • What is distributed
    • Deploy a service to run on multiple computers, provide unified external interface services, and computers can communicate with each other
  • Common distributed solutions
    • Distributed application (rpc communication mode) distributed storage and distributed computing
  • Distributed cluster
    • Different functions under the same service are deployed on different computers to provide a unified service, which is distributed
    • Services are deployed on different computers. Each computer can provide services independently. Between multiple computers, computers are called clusters

Computer A HDFS storage A MR calculation A
Computer B HDFS storage B MR calculation B
The data stored in Hadoop is stored on A and B respectively

2, Apache Zookeeper

1. Basic knowledge of zookeeper

  • introduce

  • Zookeeper is used to manage hadoop services and realize the high availability (HA) main service backup service of hadoop

      Zookeeper Is an open source framework for distributed coordination services, which is mainly used to solve the consistency problem of application systems in distributed clusters
      Zookeeper It is essentially a distributed small file storage system
    
  • characteristic

    1. Global data consistency
    2. reliability
    3. Order
    4. Data update atomicity
    5. Real time
  • Cluster role

    1. Leader: responsible for scheduling and processing transaction requests (write operations) to ensure the order of transaction processing
    2. Follower: it is responsible for scheduling and processing non transaction requests (read operations), and forwarding transactions to the Leader
    3. Observer: observer, which is responsible for observing the running status of the cluster. Read operations can be performed independently, and write operations can be forwarded to the Leader
  • Cluster construction

    1. Upload files to / export/software
    2. Unzip the file to / expand / server
one. cd /export/software
	 rz take zookeeper Upload compressed package to /export/software
 two. Unzip the package to /exprot/server
	tar zxvf zookeeper-3.4.6.tar.gz -C /exprot/server
 three. cd /exprot/server
 four. take Zookeeper Change your name
	mv zookeeper-3.4.6/ zookeeper
  1. Configure environment variables
vim /etc/profile

export ZOOKEEPER_HOME=/export/server/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin
When you exit after saving, remember to refresh the environment variables  source /etc/profile
  1. Enter the conf directory of Zookeeper
  2. Modify profile name
    mv zoo_sample.cfg zoo.cfg
    
  3. Modify profile content
dataDir=/export/data/zkdata
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888
  1. Create data store directory
mkdir /export/data/zkdata
  1. Specify service number
echo 1 > /export/data/zkdata/myid 
  1. Copy the zookeeper service file from node1 to another computer
scp -r /export/server/zookeeper root@node2:/export/server

scp -r /export/server/zookeeper root@node3:/export/server
  1. Modify the service number of another computer
mkdir /export/data/zkdata
echo 1 > /export/data/zkdata/myid 
  1. Modify the environment variable configuration file / etc/profile of other computers
    When the service is started, it will automatically generate a small log file in the current directory and view the log message through cat
	zkServer.sh start  Start service
	zkServer.sh stop  Out of Service
	zkServer.sh status  View status

Or enter /export/server/zookeeper/bin Directory
 implement ./zkServer.sh start It's OK

2. shell operation

1. Connection

zkCli.sh -server node1

2. Create node

 create [-s] [-e] path data acl
 -s Specify order
 -e Specify temporary node
 path Specifies the path of the created node
 data Data written by node
 acl Automatically generated

3. View nodes

ls Node path: check whether the node basic information has child nodes
ls2 Node path: check whether the node details have child nodes
get Node path view node details and data written by the node

4. Modification

set Node data

5. Delete

delete Node data

6. Node restrictions

setquota -n|-b val path 
-n Limit number
-b Limit data size
val Restricted data
path Node path

delquota path Delete qualification

7. Delete multi tier

rmr /route

8. View historical information

history

3. Data model

Tree hierarchy, each node is called Znode

  • Znode has two characteristics: file and directory
  • Znode has atomic operation
  • Znode storage data size limit 1M
  • Znode is referenced by path
  • Node information
    • stat status information description node version permission
    • Data the data associated with the node
    • Child node information associated with children
  • Node type
    • Temporary node
      • When the client connects to the server, the temporary node will always exist. Once the client disconnects, the temporary node will be cleared
    • Permanent node
    • Serialization properties
      • Nodes are assigned numbers to make them sequential
  • Node attributes

extend

1,Big data: essence is a set of theories

2,Big data manages data and maintains a large amount of data through data warehouse management

3,Hadoop Construction of technology stack to realize offline data warehouse   

Data source: business data mysql MongoDB,Log data Excel



ETL The composition of data storage warehouse is described

Data extraction: get data from no source --Data collection process (buried point, reptile flume,sqoop)

Data conversion: regular cleaning process (de duplication and removal of empty data) sql

Data loading: the process of storing data into the data warehouse



Data analysis: by classifying, clustering and associating the data, the corresponding index data is extracted, and the causes are analyzed through the index data to provide solutions



Distributed technology provides unified services for multiple computers

Hadoop Technology relies on distributed technology

Hadoop Core services HDFS,MR,YARN

use Hive yes Hadoop Add, delete, modify and query the stored data

use zookeeper yes hadoop Node management for high availability

Keywords: Hadoop hive Zookeeper Distribution

Added by Monkeymatt on Sun, 21 Nov 2021 09:40:44 +0200