Technology sharing | using graph database to reduce the delay of MySQL in processing multi-layer relationships

Author: Yang Taotao

Senior database expert, specializing in MySQL for more than ten years. He is good at backup and recovery, SQL tuning, monitoring, operation and maintenance, high availability architecture design related to MySQL, PostgreSQL, MongoDB and other open source databases. At present, he works in aikesheng, providing MySQL related technical support and MySQL related course training for major operators and banking and financial enterprises.

Source: original contribution

*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.

Mysql database can be used to handle most online business scenarios, and it can handle them well. It is very advantageous in terms of single node performance or the overall throughput after multi machine expansion. However, nothing is absolute, and MySQL performance cannot meet expectations in some scenarios. For example, in dealing with various complicated relationships, MySQL has some difficulty in dealing with them. In such scenarios, NoSQL is more suitable than relational database. In this article, we use the graph database Neo4J (a kind of NoSQL) to illustrate how to make up for the shortcomings of MySQL in this scenario.

Here is a simple character relationship to illustrate.

For simple character relationships, for example, the relationship chain between Xiao Song, Xiao Li, Xiao Yang, Xiao AI, Xiao Xu, Xiao Na and Xiao Qiao may be as follows:

  1. Xiao Song "knows" Xiao Li.
  2. Xiao Li knows Xiao Yang.
  3. Xiao Yang "knows" Xiao AI.
  4. Xiao AI "knows" Xiao Xu.
  5. Xiao Xu "knows" Xiao Na.
  6. Xiao Na "knows" Xiao Qiao.

Among them, "cognition" is the relationship between several people. There are many kinds of such relationships, such as "know", "meet", "friend", "colleague", "secret love", "lover", etc. in this article, let's first look at the basic relationship: "know". That is to say, the relationship is just "know", nothing else.

Suppose there are three requirements:
  1. Find the total number of users.
  2. Find out who "knows" Xiao Yang and who Xiao Yang "knows" again.
  3. Find out who Xiao Yang "knows" and "knows" and "knows".
For these requirements, we first design two tables based on MySQL: (if we only realize the last two requirements, we only need Table 2.)

Table 1: user table, storing user records; Table 2: user relationship table, which stores the relationship between users. For simplicity, the two tables have no primary key and are allocated internally by MySQL.

mysql:ytt>create table user1 (name varchar(64));
Query OK, 0 rows affected (0.09 sec)

mysql:ytt>insert user1 values ("Xiao Song"),("petty thief"),("Xiao Yang"),("Little love"),("Xiao Xu"),("Xiao Na"),("Little Joe");
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0


mysql:ytt>create table relation1 (user_name varchar(64),friend_name varchar(64));
Query OK, 0 rows affected (0.07 sec)

mysql:ytt>insert relation1 values ("Xiao Song","petty thief"),("petty thief","Xiao Yang"),("Xiao Yang","Little love"),("Little love","Xiao Xu"),("Xiao Xu","Xiao Na"),("Xiao Na","Little Joe");
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

Let's meet the above three requirements:

1. Find out the total number of users: it's very simple. Just count(*).

   mysql:ytt>select count(*) as total_number from user1;
   +--------------+
   | total_number |
   +--------------+
   |            7 |
   +--------------+
   1 row in set (0.00 sec)

2. Find out the people who "know" Xiao Yang and "know" Xiao Yang again: due to the small number of records, filter the whole table directly.

   mysql:ytt>select * from (select (case when friend_name = "Xiao Yang" then user_name when user_name = "Xiao Yang" then friend_name end) as user_name from relation1) as sub_table where user_name is not null;
   +-----------+
   | user_name |
   +-----------+
   | petty thief      |
   | Little love      |
   +-----------+
   2 rows in set (0.00 sec)

3. Find out the "know" person of "know" of "know" of "know" of Xiao Yang: that is, find the final user name of the four-tier relationship network starting from Xiao Yang.

  mysql:ytt>select d.friend_name from relation1 b 
       -> inner join relation1 a on  b.user_name = "Xiao Yang" and b.friend_name = a.user_name
       -> inner join relation1 c on a.friend_name = c.user_name
       -> inner join relation1 d on c.friend_name = d.user_name;
   +-------------+
   | friend_name |
   +-------------+
   | Little Joe        |
   +-------------+
   1 row in set (0.00 sec)

The above three requirements, especially the third one. Find out the relationship network starting from Xiao Yang. The more layers, the more tables to join. In MySQL, the more the number of table associations, the worse the performance. Later, I will continue to explore this issue in the topic of "SQL optimization".

We use the graph database Neo4J to solve the same needs.

Create user nodes and relationships between users,

neo4j@ytt> create (x1:user {name:"Xiao Song"}),
(x2:user {name:"petty thief"}),
(x3:user {name:"Xiao Yang"}),
(x4:user {name:"Little love"}),
(x5:user {name:"Xiao Xu"}),
(x6:user {name:"Xiao Na"}),
(x7:user {name:"Little Joe"})
with x1,x2,x3,x4,x5,x6,x7
create (x1)-[:know]->(x2),
(x2)-[:know]->(x3),
(x3)-[:know]->(x4),
(x4)-[:know]->(x5),
(x5)-[:know]->(x6),
(x6)-[:know]->(x7);
0 rows
ready to start consuming query after 269 ms, results consumed after another 0 ms
Added 7 nodes, Created 6 relationships, Set 7 properties, Added 7 labels

The corresponding diagram is shown as follows:

The realization of the above requirements in Neo4J is as follows:

  1. Demand 1:
   neo4j@ytt> match(x:user) return count(*) as total_number;
   +--------------+
   | total_number |
   +--------------+
   | 7            |
   +--------------+
   
   1 row
   ready to start consuming query after 21 ms, results consumed after another 1 ms
  1. Demand 2:
   neo4j@ytt> match (y1:user)-[:know]->(x:user {name:"Xiao Yang"})-[:know]->(y2:user) return y1.name,y2.name;
   +-------------------+
   | y1.name | y2.name |
   +-------------------+
   | "petty thief"    | "Little love"    |
   +-------------------+
   
   1 row
   ready to start consuming query after 95 ms, results consumed after another 2 ms
  1. Demand 3:
   neo4j@ytt> match(x:user {name:"Xiao Yang"})-[:know*4]->(y) return y.name;
   +--------+
   | y.name |
   +--------+
   | "Little Joe"   |
   +--------+
   
   1 row
   ready to start consuming query after 398 ms, results consumed after another 174 ms

From the query results of the three requirements alone, MySQL and Neo4J have similar performance, but there are some differences in writing. However, if you enlarge the amount of data, especially the processing of requirement 3, MySQL will be a little difficult.

Let's enlarge the data several times, increase the number of user table records to 1000 and the number of relationship table records to 100000.

Create some data for the user table and relationship table respectively: (user1.csv contains 1100 users and relation1.csv contains 1W records. Each user "knows" about 100 people and adds an index to the relationship table.)

mysql:ytt>truncate table user1;
Query OK, 0 rows affected (0.19 sec)

mysql:ytt>load data infile '/var/lib/mysql-files/user1.csv' into table user1 fields terminated by ',' enclosed by '"';
Query OK, 1100 rows affected (0.10 sec)
Records: 1100  Deleted: 0  Skipped: 0  Warnings: 0


mysql:ytt>truncate table relation1;
Query OK, 0 rows affected (0.11 sec)

mysql:ytt>load data infile '/var/lib/mysql-files/relation1.csv' into table relation1 fields terminated by ',' enclosed by '"';
Query OK, 100000 rows affected (1.60 sec)
Records: 100000  Deleted: 0  Skipped: 0  Warnings: 0

mysql:ytt>alter table relation1 add key idx_user_name (user_name), add key idx_friend_name(friend_name);
Query OK, 0 rows affected (4.09 sec)
Records: 0  Duplicates: 0  Warnings: 0

Some data in the table are as follows:

mysql:ytt>table user1 limit 2;
+-------+
| name  |
+-------+
| user1 |
| user2 |
+-------+
2 rows in set (0.00 sec)

mysql:ytt>table relation1 limit 2;
+-----------+-------------+
| user_name | friend_name |
+-----------+-------------+
| user1     | user101     |
| user2     | user101     |
+-----------+-------------+
2 rows in set (0.00 sec)

Next, realize the third requirement:

Here, the user "Xiao Yang" is replaced by "user1". Because there are many results, it takes only a total of four minutes, which is unacceptable for the user.

mysql:ytt>select count(distinct d.friend_name) as cnt from relation1 b 
    -> inner join relation1 a on  b.user_name = "user1" and b.friend_name = a.user_name
    -> inner join relation1 c on a.friend_name = c.user_name
    -> inner join relation1 d on c.friend_name = d.user_name;


+-----+
| cnt |
+-----+
| 100 |
+-----+
1 row in set (4 min 15.47 sec)


Next, import MySQL data into Neo4J to continue to meet the same requirements:

Import node:

neo4j@ytt> load csv from  "file:///user1.csv" as x create (a:user {name:x[0]});
0 rows
ready to start consuming query after 523 ms, results consumed after another 0 ms
Added 1100 nodes, Set 1100 properties, Added 1100 labels

Import relationship:

neo4j@ytt> load csv from "file:///relation1. CSV "as x with X match (A: user {Name: x [0]}), (B: user {Name: x [1]}) merge (a) - [: know] - (b);

Created 100000 relationships, completed after 94636 ms.

Index nodes:

neo4j@ytt> create index on :user(name);
0 rows
ready to start consuming query after 31 ms, results consumed after another 0 ms
Added 1 indexes

The execution time of neoj is much faster than that of the third one.

neo4j@ytt> match(x:user {name:"user1"})-[*4]->(y) 
return count(distinct y.name) as total_friend;
+--------------+
| total_friend |
+--------------+
| 100          |
+--------------+

1 row
ready to start consuming query after 44 ms,
results consumed after another 692 ms

Summary:

This article gives a brief introduction based on the fact that graph database is better than relational database in dealing with character relationships. For more relationship processing, please continue to read the following chapters.

Added by assessino on Wed, 09 Feb 2022 23:33:31 +0200