1, Introduction to big data
1. Data and data analysis
2. Role of data analysis
- Current situation analysis
- Cause analysis
- Forecast analysis
3. Basic steps of data analysis
- Clarify the purpose of analysis
- data collection
- data processing
- Data analysis
- Data presentation
- Report writing
4. Big data
- What is big data
- The challenge of massive data
- Characteristics of big data
- Volume: large amount of data, including collection, storage and calculation;
- Variety: variety of species and sources. Including structured, semi-structured and unstructured data;
- Value: the data value density is relatively low, or it is valuable to wash sand in the waves;
- Velocity: fast data growth, fast processing speed and high timeliness requirements;
- Veracity: the accuracy and reliability of data, that is, the quality of data.
- Big data application scenarios
5. Distributed technology
- What is distributed
- Deploy a service to run on multiple computers, provide unified external interface services, and computers can communicate with each other
- Common distributed solutions
- Distributed application (rpc communication mode) distributed storage and distributed computing
- Distributed cluster
- Different functions under the same service are deployed on different computers to provide a unified service, which is distributed
- Services are deployed on different computers. Each computer can provide services independently. Between multiple computers, computers are called clusters
Computer A HDFS storage A MR calculation A
Computer B HDFS storage B MR calculation B
The data stored in Hadoop is stored on A and B respectively
2, Apache Zookeeper
1. Basic knowledge of zookeeper
-
introduce
-
Zookeeper is used to manage hadoop services and realize the high availability (HA) main service backup service of hadoop
Zookeeper Is an open source framework for distributed coordination services, which is mainly used to solve the consistency problem of application systems in distributed clusters Zookeeper It is essentially a distributed small file storage system
-
characteristic
- Global data consistency
- reliability
- Order
- Data update atomicity
- Real time
-
Cluster role
- Leader: responsible for scheduling and processing transaction requests (write operations) to ensure the order of transaction processing
- Follower: it is responsible for scheduling and processing non transaction requests (read operations), and forwarding transactions to the Leader
- Observer: observer, which is responsible for observing the running status of the cluster. Read operations can be performed independently, and write operations can be forwarded to the Leader
-
Cluster construction
- Upload files to / export/software
- Unzip the file to / expand / server
one. cd /export/software rz take zookeeper Upload compressed package to /export/software two. Unzip the package to /exprot/server tar zxvf zookeeper-3.4.6.tar.gz -C /exprot/server three. cd /exprot/server four. take Zookeeper Change your name mv zookeeper-3.4.6/ zookeeper
- Configure environment variables
vim /etc/profile export ZOOKEEPER_HOME=/export/server/zookeeper export PATH=$PATH:$ZOOKEEPER_HOME/bin
When you exit after saving, remember to refresh the environment variables source /etc/profile
- Enter the conf directory of Zookeeper
- Modify profile name
mv zoo_sample.cfg zoo.cfg
- Modify profile content
dataDir=/export/data/zkdata server.1=node1:2888:3888 server.2=node2:2888:3888 server.3=node3:2888:3888
- Create data store directory
mkdir /export/data/zkdata
- Specify service number
echo 1 > /export/data/zkdata/myid
- Copy the zookeeper service file from node1 to another computer
scp -r /export/server/zookeeper root@node2:/export/server scp -r /export/server/zookeeper root@node3:/export/server
- Modify the service number of another computer
mkdir /export/data/zkdata echo 1 > /export/data/zkdata/myid
- Modify the environment variable configuration file / etc/profile of other computers
When the service is started, it will automatically generate a small log file in the current directory and view the log message through cat
zkServer.sh start Start service zkServer.sh stop Out of Service zkServer.sh status View status Or enter /export/server/zookeeper/bin Directory implement ./zkServer.sh start It's OK
2. shell operation
1. Connection
zkCli.sh -server node1
2. Create node
create [-s] [-e] path data acl -s Specify order -e Specify temporary node path Specifies the path of the created node data Data written by node acl Automatically generated
3. View nodes
ls Node path: check whether the node basic information has child nodes ls2 Node path: check whether the node details have child nodes get Node path view node details and data written by the node
4. Modification
set Node data
5. Delete
delete Node data
6. Node restrictions
setquota -n|-b val path -n Limit number -b Limit data size val Restricted data path Node path delquota path Delete qualification
7. Delete multi tier
rmr /route
8. View historical information
history
3. Data model
Tree hierarchy, each node is called Znode
- Znode has two characteristics: file and directory
- Znode has atomic operation
- Znode storage data size limit 1M
- Znode is referenced by path
- Node information
- stat status information description node version permission
- Data the data associated with the node
- Child node information associated with children
- Node type
- Temporary node
- When the client connects to the server, the temporary node will always exist. Once the client disconnects, the temporary node will be cleared
- Permanent node
- Serialization properties
- Nodes are assigned numbers to make them sequential
- Temporary node
- Node attributes
extend
1,Big data: essence is a set of theories 2,Big data manages data and maintains a large amount of data through data warehouse management 3,Hadoop Construction of technology stack to realize offline data warehouse Data source: business data mysql MongoDB,Log data Excel ETL The composition of data storage warehouse is described Data extraction: get data from no source --Data collection process (buried point, reptile flume,sqoop) Data conversion: regular cleaning process (de duplication and removal of empty data) sql Data loading: the process of storing data into the data warehouse Data analysis: by classifying, clustering and associating the data, the corresponding index data is extracted, and the causes are analyzed through the index data to provide solutions Distributed technology provides unified services for multiple computers Hadoop Technology relies on distributed technology Hadoop Core services HDFS,MR,YARN use Hive yes Hadoop Add, delete, modify and query the stored data use zookeeper yes hadoop Node management for high availability