MySQL chapter of data analysis

1, Database cognition

(1) What is a database

  1. Database:
    1) A warehouse that organizes, stores, and manages data according to data structures
    2) A container that holds organized data
    3) A special file used to store data

(2) Type of database

  1. Commercial databases: Oracle, SQL Server, DB2, etc
  2. Open source databases: MySQL, PostgreSQL, etc
  3. Desktop database: represented by Microsoft Access, it is suitable for desktop applications

(3) Composition of database

  1. Column: a field in a table
  2. Row: a record in a table
  3. Table: table is structured information; Row and column composition table
  4. Database: multiple tables form a database
  5. How much data can be stored in MySQL:
    1) MySQL can create up to 2 billion tables per database
    2) A table allows 1024 columns (i.e. 1024 fields) to be defined
    3) SQL has no direct limit on the number of rows in each table, but it is limited by the storage space of the database

(4) Introduction to SQL

  1. SQL: a database query and programming language for accessing data and querying, updating, and managing relational database systems
  2. The most commonly used data analysis languages are: data query language, data operation language and data definition language, that is, add, delete, modify and query
  3. The most commonly used is "query", that is, data retrieval

2, Data type

(1) Most commonly used data types

  1. Data type summary

    data typeexplain
    intinteger
    floatDecimal, floating point
    varchartext
    datedate
  2. Numerical type

    data typeexplain
    INT(size)Integer specifying the maximum number of digits in parentheses
    FLOAT(size,d)For numbers with floating decimal point, specify the maximum number of digits in parentheses, and d represents the number of digits after the decimal point
    DOUBLE(size,d)For numbers with floating decimal point, specify the maximum number of digits in parentheses, and d represents the number of digits after the decimal point
  3. Date type

    data typeexplain
    YEAR()Year in 2-digit or 4-digit format
    DATE()Date format: YYYY-MM-DD
    TIME()Time format: HH:MM:SS
    DATETIME()Combined format of date and time: YYYY-MM-DD HH:MM:SS
    TIMESTAMP()time stamp
  4. Text type

    data typeexplain
    char(n)Save a fixed length string (which can contain letters, numbers, and special characters)
    varchar()Save variable length strings (which can contain letters, numbers, and special characters)

3, SQL basic query language

  1. Common SQL commands and writing order
    SELECT * (required)
    FROM table
    WHERE conditions
    GROUP BY field
    HAVING conditions
    ORDER BY field
    LIMIT limit the number of data pieces
    Note: SQL is not case sensitive; Commands are generally uppercase, table names and field names are generally lowercase

  2. Retrieve data (SELECT statement)
    Syntax structure:

    SELECT *
    FROM Table name
    

    1) Retrieve single column: select column name from table name
    2) Retrieve multiple columns: separated by "," select column name 1, column name 2 from table name
    3) Retrieve all columns: select*
    4) LIMIT search results: LIMIT
    5) De duplication retrieval: DISTINCT

  3. Qualification (WHERE statement)
    1) Syntax structure:

    SELECT *
    FROM Table name
    WHERE condition
    

    2) Common judgment commands in WHERE conditions
    Value judgment: greater than (>), less than (<), equal to (=), not equal to (< >), greater than or equal to (> =), less than or equal to (< =), between and
    Logical judgment: AND, OR, NOT, IN()
    Fuzzy judgment: LIKE,%_ (underlined), etc., such as LIKE "% Cai%"

    3) Exercise 1:
    Value judgment: retrieve the "investment information with investment amount greater than 100000" in the "investment table"

    SELECT * FROM investmenttab2015
    WHERE investment > 100000;
    

    4) Exercise 2:
    Logical judgment: retrieve the "investment information with investment amount greater than 100000 and investment period of 360 days" in the "investment table"

    SELECT * FROM investmenttab2015
    WHERE investment > 100000
    AND investdays = 360;
    

    5) Case operation:
    Select the investment information with investment period of 270 days or 360 days, investment from January 1, 2018 to December 31, 2019 and investment amount of more than 100000 (including 100000) from the "investment table"

    SELECT * FROM investmenttab2015
    WHERE investdays IN (270,360)
    AND investment >= 100000
    AND investmentdate BETWEEN '2018-01-01' AND '2019-12-31';
    
  4. Calculation field (addition, subtraction, multiplication and division)
    1) Syntax structure:

    SELECT
    		Field 1 + Field 2 AS 'total',
    		Field 1 - Field 2 AS 'differ',
    		Field 1 * Field 2 AS 'product',
    		Field 1 / Field 2 AS 'divide'
    FROM Table name
    WHERE Condition (not required)
    

    2) Case operation:
    Add a new field in "investment table": investment income column income

    SELECT *, investment*(investrate/360)*investdays as income 
    FROM investmenttab2015;
    
  5. Splice field (concat)

    SELECT 
    	concat(Field 1,Field 2) AS 'new field name'
    FROM
    	Table name
    

    Add a new splicing field in the "user table" and splice the province province and city together as the new splicing field

    SELECT
    	*,
    	concat(province,city)
    FROM 
    	users;
    SELECT
    	*,
    	concat(province,"-",city)
    FROM 
    	users;
    
  6. Data grouping (GROUP BY)
    1) Syntax structure:

    SELECT field,Calculation field
    FROM Table name
    WHERE query criteria
    GROUP BY field
    

    2) Case operation:
    Calculate the investment amount of "investment products of various investment periods"

    SELECT investdays,sum(investment) FROM investmenttab2015
    GROUP BY investdays;
    
  7. Data filtering (HAVING)
    1) Syntax structure:

    SELECT field,Calculation field
    FROM Table name
    WHERE query criteria
    GROUP BY field
    HAVING Filter condition
    

    2) Case operation:
    Calculate the investment amount of "various investment periods" and display the investment information with an investment amount greater than 150 million (1.5 billion)

    SELECT investdays,SUM(investment) FROM investmenttab2015
    GROUP BY investdays
    HAVING SUM(investment) > 1500000000;
    
  8. Result sorting (ORDER BY)
    1) Syntax structure:

    SELECT field,Calculation field
    FROM Table name
    WHERE Query criteria (not required)
    GROUP BY field
    HAVING Filter condition (not required)
    ORDER BY field DESC
    

    2) Sort:
    Ascending: default ASC
    Descending: DESC

    3) Case operation:
    Calculate the investment amount of "various investment periods", display the investment information with an investment amount greater than 150 million (1.5 billion), and arrange it in descending order from high to low

    SELECT investdays,SUM(investment) FROM investmenttab2015
    GROUP BY investdays
    HAVING SUM(investment) > 1500000000
    ORDER BY SUM(investment) DESC;
    

4, SQL function

(1) Numerical function

  1. Functions and descriptions are shown in the following table:

    functiondescribe
    Sum (column name)Returns the sum of a column
    AVG (column name)Returns the average value of a column
    Min (column name)Returns the minimum value of a column
    Max (column name)Returns the maximum value of a column
    Count (column name)Returns the number of rows in a column (excluding NULL values)
    COUNT(*)Returns the number of selected rows
    COUNT(distinct column name)Returns the number of distinct results
    ABS (column name)Return absolute value
    QRT (column name)Return square root
  2. Case operation
    Calculate the investment amount and quantity of "various investment term financial products"

    SELECT investdays,SUM(investment),COUNT(investmentlistid)
    FROM investmenttab2015
    GROUP BY investdays;
    

(2) Time date function

  1. Time format:
    1) DATETIME - Format: YYYY-MM-DD HH:MM:SS
    2) DATE - Format: YYYY-MM-DD
    3) YEAR - Format: YYYY

  2. Functions and descriptions are shown in the following table:

    functionexplain
    NOW()Returns the current date and time
    CURDATE()Returns the current date
    CURTIME()Returns the current time
    DATE()Extracts the date part of a date or date / time expression
    EXTRACT()Returns a separate part of the date / time
    DATE_ADD()Adds the specified time interval to the date
    DATE_SUB()Subtracts the specified time interval from the date
    DATEDIFF()Returns the number of days between two dates
    DATE_FORMAT()Display date / time in different formats
  3. Common date commands

    SELECT
    NOW(), # Returns the current date and time
    DATE(NOW()), # Date format
    MONTH(NOW()), # month
    YEAR(NOW()), # particular year
    DATE_FORMATE(NOW(),'%Y-%m-%d'), # Date format
    DATE_FORMATE(NOW(),'%Y-%m'), # Year / month format, English in uppercase and Chinese in lowercase
    DATE_ADD(NOW(),INTERVAL 1 DAY), # Add date plus day / week
    DATE_SUB(NOW(),INTERVAL 1 DAY) # Date minus one day / week
    
    SELECT DATEDIFF('2019-08-31','2019-01-01'); # How many days is the date interval
    SELECT NOW(); # View current date and time
    SELECT CURDATE(); # View current date 
    
  4. Case operation:
    1) View "investment end date"

    SELECT *,DATE_ADD(investmentdate,INTERVAL investdays DAY) AS end_date
    FROM investmenttab2015;
    

    2) View "investment month"

    SELECT *,DATE_FORMAT(investmentdate,"%m") AS investment_month 
    FROM investmenttab2015;
    

(3) Textual function

  1. Functions and descriptions are shown in the following table:
    functiondescribe
    Left (column name)Returns the left character
    Right (column name)Returns the right character
    Length (column name)Returns the length of a field

(4) Control statement

  1. IF function
    Syntax structure:
    IF(condition, true, false)
    
  2. Case operation:
    Query the investment table and add a new column "investment quota type". The investment amount greater than or equal to 100000 is high investment, and the investment amount less than 100000 is low investment
    SELECT *,IF(investment>=100000,'High input','Low input') AS "Investment quota type"
    FROM investmenttab2015;
    
  3. Conditional function (CASE WHEN)
    1) Syntax structure:
    CASE
    	WHEN Condition 1 THEN 'Result 1'
    	WHEN Condition 2 THEN 'Result 2'
    	WHEN Condition 3 THEN 'Result 3'
    	......
    	ELSE 'Default value'
    END AS 'new field name'
    
    2) Case operation:
    Query the investment table and add a new column "investment quota type". The investment amount greater than or equal to 100000 is high investment, the investment amount greater than or equal to 50000 but less than 100000 is medium investment, and the investment amount less than 50000 is low investment
    SELECT *,
    			CASE WHEN investment >= 100000 THEN "High input"
    					 WHEN investment >= 50000 AND investment < 100000 THEN "Medium input"
    	    ELSE "Low input" 
          END AS "Investment quota type"
    FROM investmenttab2015;
    
  4. PivotTable report
    Use CASE WHEN with GROUP B to complete the PivotTable function
    SELECT 
    			investdays,
    			COUNT(CASE WHEN investment >= 100000 THEN "High input" END) AS "High input",
    			COUNT(CASE WHEN investment >= 50000 AND investment < 100000 THEN "Medium input" END) AS "Medium input",
    	    COUNT(CASE WHEN investment < 50000 THEN "Low input" END) AS "Low input"
    FROM investmenttab2015
    GROUP BY investdays;
    # IF implementation idea
    SELECT
    	investdays,
      COUNT(IF(investment >= 100000,"High input",null)) AS "High input",
      COUNT(IF(investment >= 50000 AND investment < 100000,"Medium input",null)) AS "Medium input",
      COUNT(IF(investment < 100000,"Low input",null)) AS "Low input"
    FROM investmenttab2015
    GROUP BY investdays;
    

5, SQL advanced

(1) SQL statement execution order

  1. Execution sequence
    Step 1: determine the data source
    FROM
    JOIN
    ON
    Step 2: filter data
    WHERE
    GROUP BY
    HAVING
    Step 3: query data
    SELECT
    Step 4: display data
    DISTINCT
    ORDER BY
    LIMIT

(2) SQL multi table join query

  1. UNION vertical connection
    1) Why connect (related to the storage form of data): one table has some information and the other table has other information. We want to obtain all the information, so we need to associate the two tables
    2) Purpose: combination
    3) Syntax structure:
    UNION is used to combine the results from many SELECT statements into a result set, also known as UNION query

    SELECT * FROM A
    UNION [ALL | DISTINCT]
    SELECT * FROM B
    UNION [ALL | DISTINCT]
    SELECT * FROM C
    

    a. UNION ALL connect all
    b. UNION(DISTINCT) de reconnection
    4) Case operation:
    Connect the investment table and investment table 2015 into a large table

    # De reconnection (when the data is consistent with the type, it is recognized as duplicate)
    SELECT * FROM investmenttab
    UNION 
    SELECT * FROM investmenttab2015;
    

    Note:
    a. The number of fields must be consistent
    b. The order of fields must be consistent

  2. Transverse connection
    1) Classification
    a. INNER JOIN: the same as INNER JOIN
    Table 1 joins Table 2 and returns the rows common to table 1 and table 2
    b. LEFT JOIN: Table 1 LEFT JOIN table 2, based on table 1, matches the relevant information of Table 2; Use the most; Even if there is no match in the right table, all rows are returned from the left table
    c. RIGHT JOIN: Table 1 RIGHT JOIN table 2, based on table 2, matches the relevant information of Table 1; Even if there is no match in the left table, all rows are returned from the right table
    d. Full connection: full connection cannot be realized directly in MySQL

    2) INNER JOIN
    Case operation:
    Connect the investment tab and users table into a large table to display data with the same userid

    SELECT * 
    FROM investmenttab
    INNER JOIN users 
    ON investmenttab.userid = users.userid;
    # using 
    SELECT * 
    FROM investmenttab
    INNER JOIN users 
    USING(userid);
    # inner ellipsis  
    SELECT * 
    FROM investmenttab
    JOIN users 
    ON investmenttab.userid = users.userid;
    # on can be selected as different fields. You must ensure that the contents of the fields are the same data
    

    3) LEFT JOIN left join
    The table on the left shall prevail:
    Connect the investment tab and users into a large table, and the table on the left shall prevail

    SELECT * FROM investmenttab
    LEFT JOIN users 
    ON investmenttab.userid = users.userid;
    # Some fields in the join table 
    SELECT a.userid,b.age FROM investmenttab AS a 
    LEFT JOIN users AS b 
    ON a.userid = b.userid;
    

    Case operation:
    Case operation: query the number and amount of investment in each region

    SELECT COUNT(DISTINCT a.userid),b.area,SUM(a.investment)
    FROM investmenttab AS a 
    LEFT JOIN users AS b 
    ON a.userid = b.userid
    GROUP BY b.area;
    

    4) RIGHT JOIN right join
    The table on the right shall prevail:
    Connect the investment tab and users into a large table, and the table on the right shall prevail

    SELECT * FROM investmenttab
    RIGHT JOIN users 
    ON investmenttab.userid = users.userid
    WHERE investmenttab.investmentdate='2016-12-31';
    

(3) Subquery (nested query)

  1. Subquery
    1) When a query is part of another query, we call the inner query a sub query (inner query) and the outer query a parent query (main query)
    2) Multi table query can be realized through sub query. The query statement may include keywords such as in, any, all and exists. In addition, it may contain comparison operators

  2. Sub query classification
    1) WHERE subquery
    2) FROM subquery

  3. WHERE subquery
    Syntax structure:

    SELECT * FROM t WHERE col > (SELECT XXX FROM t)
    

    1) Use operator
    Case operation 1:
    Query the investment sheet information in the investment tab where the investment amount is lower than the doc No.'ph-000000001 '

    SELECT * 
    FROM investmenttab
    WHERE investment < (
    	SELECT investment
    	FROM investmenttab
    	WHERE investmentlistid = 'ph-000000001'
    );
    

    Case operation 2:
    Query the investment sheet information whose investment amount and investment cycle are consistent with the doc no. 'ph-000000001' in the investment tab

    SELECT * 
    FROM investmenttab
    WHERE (investment,investdays)  = (
    	SELECT investment,investdays 
    	FROM investmenttab
    	WHERE investmentlistid = 'ph-000000001'
    );
    

    Note:
    a. The field names after where must be in the same order as the field names in the sub query
    b. The number of fields after where should be consistent with the number of fields in the sub query

    2) Use IN keyword
    When the condition of the main query is the result of the sub query, you can use the keyword in to judge. On the contrary, if the condition of the main query is not the result of the sub query, you can use the keyword not in to judge
    Case operation:
    Query the investment information of users in North China

    SELECT * 
    FROM investmenttab
    WHERE userid IN (
    	SELECT userid 
    	FROM users 
    	WHERE area='North China'
    );
    
  4. FROM subquery
    Syntax structure:

    SELECT t1.*
    FROM (SELECT field FROM Table name) AS t1
    WHERE condition
    

    Case operation:
    Query the total investment of each city, rank according to the total investment, and find the top 10 cities

    SELECT * 
    FROM (
    SELECT b.city,SUM(investment),ROW_NUMBER() over(ORDER BY SUM(investment) DESC) AS rank_num
    FROM investmenttab AS a 
    INNER JOIN users AS b 
    ON a.userid = b.userid
    GROUP BY b.city
    ) AS t 
    WHERE t.rank_num <= 10;
    
  5. WITH temporary table
    1) On mysql8 After 0, the WITH table function is added to enable corresponding sub queries
    2) The WITH temporary table, also known as the sub query part, is used to define an SQL fragment that will be used by the whole SQL statement
    3) The return result of the WITH clause is stored in the user's temporary table space. It is queried only once and used repeatedly to improve efficiency
    4) Grammatical structure

    WITH t1 AS (SELECT field FROM Table name)
    SELECT * FROM t1
    

    Case operation:
    Query the total investment of each city, rank according to the total investment, and find the top 10 cities

    WITH t AS (
    	SELECT b.city,SUM(investment),ROW_NUMBER() over(ORDER BY SUM(investment) DESC) AS rank_num
    	FROM investmenttab AS a 
    	INNER JOIN users AS b 
    	ON a.userid = b.userid
    	GROUP BY b.city
    )
    SELECT * 
    FROM t 
    WHERE t.rank_num <= 10;
    
  6. Sorting column problem
    1) There is an order by statement in MySQL to realize sorting, but in practice, we need to add a new column to the sorting result as the sorting sequence, that is, the ranking of the sorting result needs to be displayed together
    2) On mysql5 7, there is no corresponding windowing function, which needs to be compared by using programmed logic, that is, the idea of introducing variables
    3) And on mysql8 0, the corresponding windowing function is provided, which can be used directly
    4)MySQL8.0 sort
    Syntax structure:

    Usage 1: sorting without grouping
    ROW_NUMBER() OVER(ORDER BY field DESC)
    
    For example: ROW_NUMBER() OVER(ORDER BY Student achievement DESC)
    Indicates that the grades of all students are ranked from high to low, regardless of class
    
    Usage 2: Group sorting
    ROW_NUMBER() OVER(PARTITION BY Field 1 ORDER BY Field 2 DESC)
    It means grouping according to field 1 and sorting according to field 2 within the group. The value calculated by this function represents the sequence number after internal sorting of each group
    
    For example: ROW_NUMBER() OVER(PARTITION BY class ORDER BY Student achievement DESC)
    This function indicates the grouping according to "class" and sorting according to "student achievement" within each "class". The value calculated by this function represents the sequence number after sorting within each "class"
    

    5) Explain
    ROW_NUMBER() functions as a number
    PARTITION BY partitions the same data
    ORDER BY causes the data to be sorted in a certain order
    6) Three sorts
    It will be sorted according to the user's investment amount, from large to small, and the sorted results will be displayed in the query results at the same time

    # ROW_NUMBER() simple sort 
    # It will be sorted according to the user's investment amount, from large to small, and the sorted results will be displayed in the query results at the same time
    SELECT investment,ROW_NUMBER() over(ORDER BY investment DESC) AS 'row_number' 
    FROM investmenttab;
    
    # DENSE_RANK() does not occupy the same position 
    SELECT investment,DENSE_RANK() over(ORDER BY investment DESC) AS 'dense_rank' 
    FROM investmenttab;
    
    # RANK() same ranking 
    SELECT investment,RANK() over(ORDER BY investment DESC) AS 'rank' 
    FROM investmenttab;
    

    7) PARTITION BY grouping sort
    Case operation: find the ten cities with the highest investment amount in each region

    SELECT * 
    FROM (
    	SELECT b.area,b.city,ROW_NUMBER() over(PARTITION BY b.area ORDER BY SUM(a.investment) DESC) AS 'rank'
    	FROM investmenttab AS a 
    	INNER JOIN users AS b 
    	ON a.userid = b.userid
    	GROUP BY b.area,b.city
    ) AS t 
    WHERE t.rank <= 10;
    

Keywords: Database MySQL Data Analysis

Added by MicahCarrick on Wed, 05 Jan 2022 18:43:26 +0200