Hello, everyone. I'm Mr. Tony, who only talks about technology and doesn't cut his hair.
Recently, a friend shared that he participated in the internship interview of data analyst of an Internet (live delivery) company. The following is my analysis of relevant SQL test questions. The database used is MySQL 8.0.
If you find the article useful, you are welcome to comment 📝, give the thumbs-up 👍, recommend 🎁
Question 1: popularity value of live broadcasting room
Problem description
The live broadcast record table t1 contains the following fields:
- Anchor ID: author_id
- Live studio ID: live_id
- Duration: live_duration
The live viewing record table t2 contains the following fields:
- Audience ID: user_id
- Live studio ID: live_id
- Viewing duration: watching_duration
It is required to calculate the popularity value of the live broadcasting room, and the output result format is as follows:
Anchor id | Studio id | acu |
---|
Among them, ACU is the Average concurrent users. The calculation method is: the viewing time on the audience side / the start time of a live broadcast. When no one is watching, it is displayed as 0.
Problem analysis
First, the information required to calculate ACU can be obtained by associating the broadcast record table t1 and the viewing record table t2, and then the viewing duration of all viewers can be added up and divided by the broadcast duration.
-- Create sample table CREATE TABLE t1 (author_id integer, live_id integer, live_duration integer); INSERT INTO t1 VALUES (1, 1, 120), (2, 2, 180), (3, 3, 60); CREATE TABLE t2 (user_id integer, live_id integer, watching_duration integer); INSERT INTO t2 VALUES (1, 1, 30), (2, 1, 40), (3, 1, 50); INSERT INTO t2 VALUES (3, 2, 30), (4, 2, 60); -- calculation ACU SELECT t1.author_id, t1.live_id, sum(t2.watching_duration)/t1.live_duration AS acu FROM t1 JOIN t2 ON (t2.live_id = t1.live_id) GROUP BY t1.author_id, t1.live_id; author_id|live_id|acu | ---------+-------+------+ 1| 1|1.0000| 2| 2|0.5000|
The above query uses the inner connection, GROUP BY grouping and SUM aggregation functions for grouping statistics.
However, there is still a problem with the above query: there is no audience in the Live Room 3, resulting in no result returned from the internal connection query. Therefore, we also need to deal with the situation that no one is watching. The method is to use the left connection query:
SELECT t1.author_id, t1.live_id, COALESCE(sum(t2.watching_duration), 0)/t1.live_duration AS acu FROM t1 LEFT JOIN t2 ON (t2.live_id = t1.live_id) GROUP BY t1.author_id, t1.live_id; author_id|live_id|acu | ---------+-------+------+ 1| 1|1.0000| 2| 2|0.5000| 3| 3|0.0000|
In addition to the left connection, we also use COALESCE Function to convert a null value to 0. Of course, it can also be used IFNULL Function.
Question 2: cumulative sales amount
Problem description
The sales record table t3 contains the following fields:
- User ID: user_id
- Sales date: sell_day
- Sales amount: amount
It is required to calculate the cumulative sales amount 30 days after the first sales date of each user. The output result format is as follows:
User id | First sale date | Cumulative sales amount 30 days after the first sales date |
---|
Problem analysis
This problem can be divided into two steps. First, find the first sales date of each user. This can be grouped based on user id through GROUP BY, and then use the aggregation function MIN to return the earliest sales date. The implementation code is as follows:
-- Create sample table CREATE TABLE t3 (user_id integer, sell_day date, amount numeric); INSERT INTO t3 VALUES (1, '2021-01-01', 100), (1, '2021-01-02', 100), (1, '2021-01-29', 100),(1, '2021-02-01', 100); INSERT INTO t3 VALUES (2, '2021-01-10', 200), (2, '2021-01-11', 200), (2, '2021-01-12', 200),(2, '2021-01-13', 200); -- Find the first sales date for each user SELECT user_id, min(sell_day) AS first_day FROM t3 GROUP BY user_id; user_id|first_day | -------+----------+ 1|2021-01-01| 2|2021-01-10|
Then, we can count the cumulative sales amount within 30 days based on these first sales dates:
WITH s AS ( SELECT user_id, min(sell_day) AS first_day FROM t3 GROUP BY user_id ) SELECT s.user_id, s.first_day, sum(t3.amount) total_amount FROM s JOIN t3 ON (t3.user_id = s.user_id AND t3.sell_day BETWEEN s.first_day AND s.first_day + INTERVAL '29' DAY) GROUP BY s.user_id, s.first_day; user_id|first_day |total_amount| -------+----------+------------+ 1|2021-01-01| 300| 2|2021-01-10| 800|
The WITH clause in the above query defines a General table expression , including the first sales date of each user, and then conduct a connection query with t3 to return the required data.
For this problem, we can also use Window function solve. For example:
SELECT * FROM (SELECT user_id, sell_day, amount, RANK() OVER (PARTITION BY user_id ORDER BY sell_day) AS rk, first_value(amount) OVER (PARTITION BY user_id ORDER BY sell_day) AS fisrt_day, sum(amount) OVER (PARTITION BY user_id ORDER BY sell_day RANGE BETWEEN CURRENT ROW AND INTERVAL '30' DAY FOLLOWING) FROM t3) t WHERE rk = 1;
Question 3: proportion of channel sales
Problem description
The merchant sales record table t4 contains the following fields:
- Seller ID: seller_id
- Buyer ID: buyer_id
- Item ID: item_id
- Sales volume: order_cnt
- Unit price: price
- Commodity channel: source_type (1 for self built goods, 0 for others)
It is required to calculate the proportion of self built commodity sales of each merchant in the total sales. The structure of output information is as follows:
Seller id | Proportion of self built commodity sales |
---|
Problem analysis
This problem is relatively simple. You only need to count the sales volume and total sales volume of self built goods according to the merchants, and then divide the two. For example:
-- Create sample table CREATE TABLE t4 (seller_id integer, buyer_id integer, item_id integer, order_cnt integer, price numeric, source_type tinyint); INSERT INTO t4 VALUES (1, 11, 1, 5, 9.9, 1), (1, 12, 2, 15, 16.0, 0); INSERT INTO t4 VALUES (2, 13, 3, 10, 89, 0), (2, 14, 4, 1, 2000, 0); -- Proportion of self built commodity sales of each merchant in total sales SELECT seller_id, sum(CASE source_type WHEN 1 THEN order_cnt ELSE 0 END)/sum(order_cnt) AS ratio FROM t4 GROUP BY seller_id; seller_id|ratio | ---------+------+ 1|0.2500| 2|0.0000|
The above query uses two SUM functions. The first function contains one CASE expression , its function is to count the sales of self built goods. The second SUM function is used to count the total sales of all goods.
Question 4: best selling products
Problem description
The merchant sales record table t5 contains the following fields:
- Seller ID: seller_id
- Buyer ID: buyer_id
- Item ID: item_id
- Item quantity: num
It is required to calculate the goods with the highest sales volume of each seller, and the output information includes the following contents:
Seller id | Item id | Total sales of items |
---|
Problem analysis
This is a group ranking problem, which needs to be grouped according to sellers to calculate the products with the highest sales volume. Such questions use Window function The most convenient solution, for example:
-- Create sample table CREATE TABLE t5 (seller_id int, buyer_id int, item_id int, num int); INSERT INTO t5 VALUES (1, 11, 1, 100), (1, 12, 1, 200), (1, 12, 2, 300); INSERT INTO t5 VALUES (2, 11, 1, 100), (2, 12, 3, 200); -- Calculate the item with the highest sales volume for each seller WITH sales AS ( SELECT seller_id, item_id, sum(num) AS ss FROM t5 GROUP BY seller_id, item_id ), sales_rank AS ( SELECT seller_id, item_id, ss, rank() OVER (PARTITION BY seller_id ORDER BY ss DESC) AS rk FROM sales) SELECT * FROM sales_rank WHERE rk = 1; seller_id|item_id|ss |rk| ---------+-------+---+--+ 1| 1|300| 1| 1| 2|300| 1| 2| 3|200| 1|
First, the WITH Clause defines two common table expressions. Sales includes the total sales of each seller and each item_ Rank calculates the sales ranking of different items of the same seller based on this sales. The last query statement returns the highest ranked product.
Seller 1's sales of item 1 and item 2 are the same, so two records are returned.
summary
For the data analysis post, mastering SQL grouping aggregation, CASE expression and window function is the basic requirement. With the introduction of MySQL 8.0, it makes all this easier.