SQL written test for data analysis post of an Internet company

Hello, everyone. I'm Mr. Tony, who only talks about technology and doesn't cut his hair.

Recently, a friend shared that he participated in the internship interview of data analyst of an Internet (live delivery) company. The following is my analysis of relevant SQL test questions. The database used is MySQL 8.0.

If you find the article useful, you are welcome to comment 📝, give the thumbs-up 👍, recommend 🎁

Question 1: popularity value of live broadcasting room

Problem description

The live broadcast record table t1 contains the following fields:

  • Anchor ID: author_id
  • Live studio ID: live_id
  • Duration: live_duration

The live viewing record table t2 contains the following fields:

  • Audience ID: user_id
  • Live studio ID: live_id
  • Viewing duration: watching_duration

It is required to calculate the popularity value of the live broadcasting room, and the output result format is as follows:

Anchor idStudio idacu

Among them, ACU is the Average concurrent users. The calculation method is: the viewing time on the audience side / the start time of a live broadcast. When no one is watching, it is displayed as 0.

Problem analysis

First, the information required to calculate ACU can be obtained by associating the broadcast record table t1 and the viewing record table t2, and then the viewing duration of all viewers can be added up and divided by the broadcast duration.

-- Create sample table
CREATE TABLE t1 (author_id integer, live_id integer, live_duration integer);
INSERT INTO t1 VALUES (1, 1, 120), (2, 2, 180), (3, 3, 60);

CREATE TABLE t2 (user_id integer, live_id integer, watching_duration integer);
INSERT INTO t2 VALUES (1, 1, 30), (2, 1, 40), (3, 1, 50);
INSERT INTO t2 VALUES (3, 2, 30), (4, 2, 60);

-- calculation ACU
SELECT t1.author_id, t1.live_id, sum(t2.watching_duration)/t1.live_duration AS acu
FROM t1
JOIN t2 ON (t2.live_id = t1.live_id)
GROUP BY t1.author_id, t1.live_id;

author_id|live_id|acu   |
---------+-------+------+
        1|      1|1.0000|
        2|      2|0.5000|

The above query uses the inner connection, GROUP BY grouping and SUM aggregation functions for grouping statistics.

However, there is still a problem with the above query: there is no audience in the Live Room 3, resulting in no result returned from the internal connection query. Therefore, we also need to deal with the situation that no one is watching. The method is to use the left connection query:

SELECT t1.author_id, t1.live_id, COALESCE(sum(t2.watching_duration), 0)/t1.live_duration AS acu
FROM t1
LEFT JOIN t2 ON (t2.live_id = t1.live_id)
GROUP BY t1.author_id, t1.live_id;

author_id|live_id|acu   |
---------+-------+------+
        1|      1|1.0000|
        2|      2|0.5000|
        3|      3|0.0000|

In addition to the left connection, we also use COALESCE Function to convert a null value to 0. Of course, it can also be used IFNULL Function.

Question 2: cumulative sales amount

Problem description

The sales record table t3 contains the following fields:

  • User ID: user_id
  • Sales date: sell_day
  • Sales amount: amount

It is required to calculate the cumulative sales amount 30 days after the first sales date of each user. The output result format is as follows:

User idFirst sale dateCumulative sales amount 30 days after the first sales date

Problem analysis

This problem can be divided into two steps. First, find the first sales date of each user. This can be grouped based on user id through GROUP BY, and then use the aggregation function MIN to return the earliest sales date. The implementation code is as follows:

-- Create sample table
CREATE TABLE t3 (user_id integer, sell_day date, amount numeric);
INSERT INTO t3 VALUES (1, '2021-01-01', 100), (1, '2021-01-02', 100), (1, '2021-01-29', 100),(1, '2021-02-01', 100);
INSERT INTO t3 VALUES (2, '2021-01-10', 200), (2, '2021-01-11', 200), (2, '2021-01-12', 200),(2, '2021-01-13', 200);

-- Find the first sales date for each user
SELECT user_id, min(sell_day) AS first_day
FROM t3
GROUP BY user_id;

user_id|first_day |
-------+----------+
      1|2021-01-01|
      2|2021-01-10|

Then, we can count the cumulative sales amount within 30 days based on these first sales dates:

WITH s AS (
  SELECT user_id, min(sell_day) AS first_day
  FROM t3
  GROUP BY user_id
)
SELECT s.user_id, s.first_day, sum(t3.amount) total_amount
FROM s
JOIN t3
ON (t3.user_id = s.user_id AND t3.sell_day BETWEEN s.first_day AND s.first_day + INTERVAL '29' DAY)
GROUP BY s.user_id, s.first_day;

user_id|first_day |total_amount|
-------+----------+------------+
      1|2021-01-01|         300|
      2|2021-01-10|         800|

The WITH clause in the above query defines a General table expression , including the first sales date of each user, and then conduct a connection query with t3 to return the required data.

For this problem, we can also use Window function solve. For example:

SELECT *
FROM (SELECT user_id, sell_day, amount,
		RANK() OVER (PARTITION BY user_id ORDER BY sell_day) AS rk,
		first_value(amount) OVER (PARTITION BY user_id ORDER BY sell_day) AS fisrt_day,
		sum(amount) OVER (PARTITION BY user_id ORDER BY sell_day RANGE BETWEEN CURRENT ROW AND INTERVAL '30' DAY FOLLOWING)
	  FROM t3) t
WHERE rk = 1;

Question 3: proportion of channel sales

Problem description

The merchant sales record table t4 contains the following fields:

  • Seller ID: seller_id
  • Buyer ID: buyer_id
  • Item ID: item_id
  • Sales volume: order_cnt
  • Unit price: price
  • Commodity channel: source_type (1 for self built goods, 0 for others)

It is required to calculate the proportion of self built commodity sales of each merchant in the total sales. The structure of output information is as follows:

Seller idProportion of self built commodity sales

Problem analysis

This problem is relatively simple. You only need to count the sales volume and total sales volume of self built goods according to the merchants, and then divide the two. For example:

-- Create sample table
CREATE TABLE t4 (seller_id integer, buyer_id integer, item_id integer, order_cnt integer, price numeric, source_type tinyint);
INSERT INTO t4 VALUES (1, 11, 1, 5, 9.9, 1), (1, 12, 2, 15, 16.0, 0);
INSERT INTO t4 VALUES (2, 13, 3, 10, 89, 0), (2, 14, 4, 1, 2000, 0);

-- Proportion of self built commodity sales of each merchant in total sales
SELECT seller_id, 
       sum(CASE source_type WHEN 1 THEN order_cnt ELSE 0 END)/sum(order_cnt) AS ratio
FROM t4
GROUP BY seller_id;

seller_id|ratio |
---------+------+
        1|0.2500|
        2|0.0000|

The above query uses two SUM functions. The first function contains one CASE expression , its function is to count the sales of self built goods. The second SUM function is used to count the total sales of all goods.

Question 4: best selling products

Problem description

The merchant sales record table t5 contains the following fields:

  • Seller ID: seller_id
  • Buyer ID: buyer_id
  • Item ID: item_id
  • Item quantity: num

It is required to calculate the goods with the highest sales volume of each seller, and the output information includes the following contents:

Seller idItem idTotal sales of items

Problem analysis

This is a group ranking problem, which needs to be grouped according to sellers to calculate the products with the highest sales volume. Such questions use Window function The most convenient solution, for example:

-- Create sample table
CREATE TABLE t5 (seller_id int, buyer_id int, item_id int, num int);
INSERT INTO t5 VALUES (1, 11, 1, 100), (1, 12, 1, 200), (1, 12, 2, 300);
INSERT INTO t5 VALUES (2, 11, 1, 100), (2, 12, 3, 200);

-- Calculate the item with the highest sales volume for each seller
WITH sales AS (
SELECT seller_id, item_id, sum(num) AS ss
FROM t5 
GROUP BY seller_id, item_id
),
sales_rank AS (
SELECT seller_id, item_id, ss, rank() OVER (PARTITION BY seller_id ORDER BY ss DESC) AS rk
FROM sales)
SELECT *
FROM sales_rank
WHERE rk = 1;

seller_id|item_id|ss |rk|
---------+-------+---+--+
        1|      1|300| 1|
        1|      2|300| 1|
        2|      3|200| 1|

First, the WITH Clause defines two common table expressions. Sales includes the total sales of each seller and each item_ Rank calculates the sales ranking of different items of the same seller based on this sales. The last query statement returns the highest ranked product.

Seller 1's sales of item 1 and item 2 are the same, so two records are returned.

summary

For the data analysis post, mastering SQL grouping aggregation, CASE expression and window function is the basic requirement. With the introduction of MySQL 8.0, it makes all this easier.

Keywords: Database MySQL SQL Data Analysis

Added by johnbruce on Sun, 07 Nov 2021 22:43:46 +0200