1 - learning objectives
- What do you do? What are the advantages? For which scenarios?
- What are the cluster schemes?
- Compared with other similar databases, what are the advantages and disadvantages?
- Detailed explanation of configuration file, how to optimize performance?
- Basic operation, data backup and recovery
- QA summary
2 - what do you do? What are the advantages? For which scenarios?
2.1 - Introduction
influxdb is a database based on time series written by GO. It is suitable for storing a large number of time stamped data, monitoring data, logs, application indicators, data analysis data and so on.
You don't need to delete and clean up the data automatically saved by influxdb. You just need to define it for a period of time. DB will help you clean up automatically.
The default port of influxdb is 8086, and the default is http protocol interface
- Some concepts
measurement: similar to the table in mysql
point: similar to a row of data in mysql
timestamp: the time when the data is inserted (you can also specify the time yourself when inserting. If not specified, influxdb will insert the nanosecond value of the current time)
Retention policy: retention policy, which sets the retention time of data (automatically deleted after expiration) and the number of copies
tag key: it can be understood as an indexed column in MySQL. The type can only be string
Tag value: the value corresponding to tag key
tag set: the tag contained in the data point. The format is: < tag_ key>=<tag_ value>; There can be multiple, separated by, for example: < tag_ key>=<tag_ value>,<tag_ key>=<tag_ value>
field key: it can be understood as a column without index in MySQL. The type can be string, floating point number, etc
Field value: the value corresponding to the field key
field set: each data point must have at least one field in the same format as tag set
Series: in the same database, the data with the same retention policy, measurement and tag belong to a series set, which identifies where the data comes from. The data of the same series are physically arranged in chronological order. These data are stored in memory. If there are too many series, it will lead to OOM
Shard: store data at a certain time interval. Each directory corresponds to a shard. The name of the directory is shard id. Each shard has its own cache, wal, tsm file and compactor. The purpose is to quickly locate the relevant resources of the data to be queried through time, accelerate the query process, and make the subsequent batch deletion of data very simple and efficient.
Continuous query: aggregate multiple sample points in a period of time into one by setting query statements to be executed regularly, so as to achieve the effect of compressing data and speeding up query;
Retention policy: automatically delete old data by setting how long data will be retained to prevent excessive hard disk space.
The two are usually used together: continuous query compresses the original data and saves it to another table, and the retention policy is responsible for regularly deleting the old original data.
- Hardware recommendation
If your InfluxDB performance needs to meet any of the following conditions, a single node (InfluxDB OSS) may not meet your needs:
- More than 750000 field writes per second
- More than 100 queries per second
- More than 10000000 series cardinality (i.e. number of Series)
Generally, more RAM can improve query speed. Your RAM requirements mainly depend on the series cardinality. Higher cardinality requires more RAM. No matter what kind of ram is used, a base of more than 10 million may lead to OOM (insufficient memory) failure. In general, you can solve OOM problems by redesigning the architecture.
InfluxDB is designed to run on solid-state drives (SSD s) and memory optimized cloud instances such as AWS EC2 R5 or R4 instances. InfluxDB has not been tested on hard disk drives (HDDs), and we do not recommend using HDDs for production. For best results, the InfluxDB server must have at least 1000 IOPS on storage to ensure recovery and availability. We recommend at least 2000 IOPS to quickly recover cluster data nodes after downtime.
When running InfluxDB in a production environment, the wal directory and the data directory are stored on separate storage devices. This optimization can significantly reduce disk contention under heavy write load, which is an important consideration if the write load is highly variable. If the write load does not change by more than 15%, this optimization may not be required.
2.2 - advantages and disadvantages
For the time series use case, we assume that if the same data is sent multiple times, it is exactly the same data that the client has just sent several times.
Advantage: simplified conflict resolution improves write performance.
Disadvantages: unable to store duplicate data; In rare cases, data may be overwritten.
Deletion is rare. When they do happen, they almost always encounter a lot of old data, which is cold for writing.
Advantage: restricting access to deletes can improve query and write performance.
Disadvantages: the deletion function is severely limited.
Updates to existing data rarely occur, while controversial updates never occur. Time series data is mainly new data that will never be updated.
Advantage: restricting access to updates can improve query and write performance.
Disadvantages: the update function is severely limited.
The vast majority of write operations are for data with the most recent timestamp, and the data is added in ascending time order.
Advantages: the performance of adding data in ascending time order is significantly higher.
Disadvantages: the performance of writing points in random time or time not arranged in ascending order is obviously poor.
Scale matters. The database must be able to handle large volumes of reads and writes.
Advantages: the database can handle high-volume read and write operations.
Cons: the InfluxDB development team is forced to make trade-offs to improve performance.
Being able to write and query data is more important than having a highly consistent view.
Advantages: writing and querying the database can be completed by multiple clients with high load.
Disadvantages: if the database is heavily loaded, the query return may not contain the latest point.
Many time series are transient. Usually, the time series will only appear for a few hours and then disappear. For example, a new host starts running and reports for a period of time, and then shuts down.
Advantages: InfluxDB is good at managing discontinuous data.
Disadvantages: schemaless design means that some database functions are not supported, for example, there is no crosstab join.
No point is too important.
Advantages: InfluxDB has very powerful tools to deal with aggregated data and large data sets.
Disadvantages: point has no traditional ID, which is distinguished by timestamp and sequence.
2.3 - installation and use
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.0.x86_64.rpm yum -y localinstall influxdb-1.8.0.x86_64.rpm #After installing InfluxDB #/There are the following files under usr/bin: influxd #influxdb server influx #influxdb command line client influx_inspect #Viewing tools influx_stress #Pressure test tool influx_tsm #Database conversion tool (convert database from b1 or bz1 format to tsm1 format) #/Var / lib / incluxdb / data Store the final stored data and file to.tsm ending meta Store database metadata wal Store pre written log files #/var/log/influxdb influxd.log log file #/Under / etc / incluxdb influxdb.conf configuration file /var/run/influxdb/ influxd.pid PID file systemctl start influxd systemctl enable influxd
2.4 - cluster scheme
There is no free cluster scheme!
3 - comparison with similar databases
4 - detailed explanation of configuration file
5 - how to optimize performance
See details https://blog.csdn.net/qq_35550345/article/details/115751138
6 - basic operation
6.1 - database operation
Login database: influx -precision rfc3339 perhaps influx Log in to the specified database: influx -precision rfc3339 -database NOAA_water_database View existing databases: show databases; Create database testdb: create database testdb; CREATE DATABASE "NOAA_water_database" WITH DURATION 3d REPLICATION 1 SHARD DURATION 1h NAME "liquid" Delete database testdb: drop database testdb; Select database testdb(Set as current default): use testdb;
6.2 - table operation
View existing measurement: show measurements; establish measurement: influxdb Not created measurement Statement of, Created by default when inserting data. delete measurementweather: drop measurement weather;
6.3 - user management
Create user: CREATE USER <username> WITH PASSWORD '<password>' Authorization authority: GRANT [READ,WRITE,ALL] ON <database_name> TO <username> Create and authorize: CREATE USER <username> WITH PASSWORD '<password>' WITH ALL PRIVILEGES Cancel authorization: REVOKE ALL PRIVILEGES FROM <username> Change Password: SET PASSWORD FOR <username> = '<password>' Delete user: DROP USER <username>
6.4 - addition, deletion, modification and query
# select SELECT <field_key>[,<field_key>,<tag_key>] FROM <measurement_name>[,<measurement_name>] SELECT * #Returns all fields and labels. SELECT "<field_key>" #Returns a specific field. SELECT "<field_key>","<field_key>" #Returns multiple fields. SELECT "<field_key>","<tag_key>" #Returns specific fields and labels. When the SELECT clause contains a label, at least one field must be specified. SELECT * FROM test WHERE time >= '2019-08-09T00:00:00Z' and time < '2019-08-09T10:00:00Z' # show SHOW RETENTION POLICIES ON NOAA_water_database SHOW SERIES ON NOAA_water_database FROM "h2o_quality" WHERE "location" = 'coyote_creek' LIMIT 2 SHOW TAG KEYS ON "NOAA_water_database" SHOW TAG VALUES ON "NOAA_water_database" WITH KEY = "randtag" SHOW FIELD KEYS ON "NOAA_water_database" FROM "h2o_feet" # delete DELETE FROM <measurement_name> WHERE [<tag_key>='<tag_value>'] | [<time interval>] DELETE FROM "h2o_quality" WHERE "randtag" = '3'
6.5 - other common operations
#Number of series to query testdb database in interactive mode: select * from _internal.."database" where "database"='testdb' order by time desc limit 1; #Query the number of series of a single measurement on the command line: influx -database 'testdb' -execute 'show series from testmeasurement' -format 'csv' | wc -l #Analyze and execute statements in interactive mode to obtain the actual execution time: explain analyze select count(*) from testmeasurement;
7- QA summary