Troubleshooting of slow operation of MongoDB find getmore

This article comes from the works that won the shortlisted case award of 2021MongoDB technology practice and application case solicitation activity

Author: Zhang Jiaqiao

Problem description

This paper introduces a detailed process and solutions to help businesses troubleshoot the problem of slow online query operation.

The business is using find() The limit() command encounters a jam when querying data. The card can be operated for up to 1 minute at a time. The problem description is as follows:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom:"zhang" } ], nid: 50, status: 0 }).sort({ timetag: 1 }).limit(200)

Business feedback: the problem MongoDB version is 4.2.12. This query takes 40 or 50 seconds from the slow log, but if you run directly, the first batch of returned data will be fast, and it will be slow when getmore. Why is it slow in getmore

From the problem description, it is intuitively getmore that causes the query jam. The possible reasons are as follows:

The getmore operation kernel is stuck -- the possibility of kernel problems is low

The query plan does not match the queried data - high possibility

The following troubleshooting will start from these two main problem points.

Problem recurrence

Recover the data to a standby cluster, and then operate with the query of the business:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200)

The first query returns immediately. When the number of acquired data reaches 102, a jam occurs. The operation generates a slow log. It is found that a getmore operation is triggered. The log output is as follows:

2021-12-15T10:21:55.007+0800 I COMMAND [conn857] command xxx_db.xxx_collection appName: "MongoDB Shell" command: getMore { getMore: 1244654399487918694, collection: "xxx_collection", planSummary: IXSCAN { timetag: 1 } cursorid:1244654399487918694 keysExamined:11338296 docsExamined:11338296cursorExhausted:1 numYields:88583 nreturned:99 reslen:100170 locks:{ReplicationStateTransition: { acquireCount: { w: 88584 } }, Global: { acquireCount: { r: 88584 } }, Database: { acquireCount: { r: 88584 } }, Collection: { acquireCount: { r: 88584 } }, Mutex: { acquireCount: { r: 1 } } } storage:{ data: { bytesRead: 15442700982, timeReadingMicros: 40865619 },timeWaitingMicros: { cache: 33773 } }protocol:op_msg 65270ms

Troubleshooting

Confirm whether it is the problem of getmore

In MongoDB, the batch size of the query result is 101 by default. In other words, MongoDB will find 101 results in one query and return them at one time. When more results are needed, find another 101 results and return them through getmore.

We can set the number of data returned by MongoDB at one time through the batchSize operation, including the getmore operation.

If it is a getmore problem, theoretically adjusting its batchSize will not affect the problem, so we make the following adjustments.

Set batchSize to 150:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).batchSize(150).limit(200)

The first query immediately returns the result. After exceeding the batchSize data, getmore gets stuck again. The problem remains:

2021-12-15T10:25:54.781+0800 I COMMAND [conn859] command xxx_db.xxx_collection appName: "MongoDB Shell" command: getMore { getMore: 8826588127480667691, collection: "xxx_collection", batchSize: 150, planSummary: IXSCAN { timetag: 1 }cursorid:8826588127480667691 keysExamined:11338234 docsExamined:11338234cursorExhausted:1 numYields:88582 nreturned:50 reslen:50818 locks:{ReplicationStateTransition: { acquireCount: { w: 88583 } }, Global: { acquireCount: { r: 88583 } }, Database: { acquireCount: { r: 88583 } }, Collection: { acquireCount: { r: 88583 } }, Mutex: { acquireCount: { r: 1 } } } storage:{ data: { bytesRead: 16610295032, timeReadingMicros: 30201139 },timeWaitingMicros: { cache: 17084 } }protocol:op_msg 53826ms

After adjustment to 199, the effect is similar, but after adjustment to 200, the problem becomes stuck in the first query:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).batchSize(200).limit(200)

The corresponding slow logs are as follows:

2021-12-15T10:27:23.729+0800 I COMMAND [conn859] command xxx_db.xxx_collection appName: "MongoDB Shell" command: find { find: "xxx_collection", filter: { timetag: { $gt: 1636513802167.0 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50.0, status: 0.0 }, projection: {$sortKey: { $meta: "sortKey" } }, sort: { timetag: 1.0 }, limit: 200, batchSize: 200, planSummary: IXSCAN { timetag: 1 }keysExamined:11338445 docsExamined:11338445 cursorExhausted:1 numYields:88582nreturned:200 queryHash:ECA82717 planCacheKey:AC7EC9E3 reslen:202045 locks:{ReplicationStateTransition: { acquireCount: { w: 88583 } }, Global: { acquireCount: { r: 88583 } }, Database: { acquireCount: { r: 88583 } }, Collection: { acquireCount: { r: 88583 } }, Mutex: { acquireCount: { r: 2 } } } storage:{ data: { bytesRead: 17688667726, timeReadingMicros: 14907251 },timeWaitingMicros: { cache: 11784 } }protocol:op_msg 36654ms

Therefore, we can basically eliminate the problem of getmore operation itself. From the slow operation log, we can see that the query plan uses the timetag index for data acquisition and filtering, and traverses a total of 10 million data. The problem should be related to the data query plan and data distribution. Specifically, a jam occurs when the 199th ~ 200th results are obtained, and the traversal of the timetag index is not fast enough.

Therefore, our analysis direction turns to the query plan and the query data to confirm whether there is a problem, that is, the query plan is not suitable for the target data.

Analyze the data distribution of the query

First, we need to understand the distribution format of business data and the purpose of query. The key fields of business data are as follows:

{ "nto" : , "nfrom" : , "nid" : , "timetag" : , "status" : }

From the library table name and data format, query data is a message delivery record.

The target query is as follows:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200)

What the query needs to do is to find the message records received or sent by someone at a certain point in time, and NID = 50 & status = 0. Therefore, the important conditions of the whole query are time, timetag and user name. If the query is larger than a certain time point, there may be a large amount of target data. Please confirm below.

Data distribution on timetag

mongos>db.xxx_collection.count()
538058824

From the above, we can see that the data volume of the whole target set is more than 500 million, which is large. The timetag causing the 199th and 200th query results of Caton is specifically taken out below.

199th result timetag:

{ "_id" : ObjectId("618b380a82010a586b4265e6"), "timetag" : NumberLong("1636513802759") }

200th result timetag:

{ "_id" : ObjectId("618ce26682010a586be027f9"), "timetag" : NumberLong("1636622950801") }

Take the time point 1636513802167 of the query condition as T0, the time point 1636513802759 of the 199th result as T1, and the time point 163622950801 of the 200th result as T2. Next, take these three times as nodes to analyze the data in segments.

Distribution of query data

Total data volume after T0:

mongos>db.xxx_collection.find({ timetag: { $gt: 1636513802167 }}).count()
191829505

The total amount of data in the target time period is more than 190 million, and the amount of data is relatively large

Data volume of T0~T1:

mongos>db.xxx_collection.find({ $and: [{timetag: {$gt: 1636513802167}}, {timetag: {$lte: 1636513802759}}]}).count()
287

Data volume between T1 and T2:

mongos>db.xxx_collection.find({ $and: [{timetag: {$gt: 1636513802759}}, {timetag: {$lt: 1636622950801}}]}).count()
11338157

Data volume after T2:

mongos>db.xxx_collection.find({timetag: {$gte: 1636622950801}}).count()
180491061

Distribution of query results

Total query results:

mongos>db.xxx_collection.find({ timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).count()
428

T0~T1:

mongos>db.xxx_collection.find({ $and:[{timetag: { $gt: 1636513802167 }}, {timetag: {$lte: 1636513802759}}], $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).count()
199

T1~T2:

mongos>db.xxx_collection.find({ $and:[{timetag: { $gt: 1636513802759 }}, {timetag: {$lt: 1636622950801}}], $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).count()
0

After T2:

mongos>db.xxx_collection.find({ timetag: { $gte: 1636622950801 }, $or: [ { nto:"zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0}).sort({ timetag: 1 }).count()
229

It can be seen from the distribution of data and corresponding results that query results are mainly distributed in the time period after T0~T1 and T2. There are no qualified results in T1 ~ T2 time period, and 10 million unqualified data exist. In summary, the results are distributed at two ends in the timetag field, with a huge hole in the middle.

Analyze executed query plans

Original query plan

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200).explain("executionStats")

Query plan obtained:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$gt" : 1636513802167
}
}
]
},
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 200,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"timetag" : 1
},
"indexName" : "timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"timetag" : [
"(1636513802167.0, inf.0]"
]
}
}
}
}

From the above results, we can see that the query plan of the original query is to scan by using the timetag index, and then filter the records that meet the timetag conditions one by one. In the data distribution analysis, we know that there are 190 million data that meet the target timetag, and scanning these data will be very slow, even if the index is used. Because the cursor used in the getmore operation is generated from the original query plan, only the same query plan will be used in the same query. Next, the original query plan is executed in segments to prove that the scanning timetag is slow.

T0 ~ T1 data segment executes the original plan

Use the above query plan to query the data of T0~T1:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$lte" : 1636513802759
}
},
{
"timetag" : {
"$gt" : 1636513802167
}
}
]
},
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 200,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"timetag" : 1
},
"indexName" : "timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"timetag" : [
"(1636513802167.0, 1636513802759.0]"
]
}
}
}
},

The result is obtained immediately, because the total amount of data to be scanned is only 287.

T1 ~ T2 data segments execute the original plan

Use the above query plan to query T1 ~ T2 data:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$lt" : 1636622950801
}
},
{
"timetag" : {
"$gt" : 1636513802759
}
}
]
},
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"timetag" : 1
},
"limitAmount" : 200,
"inputStage" : {
"stage" : "SORT_KEY_GENERATOR",
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "OR",
"inputStages" : [
{
"stage" : "IXSCAN",
"keyPattern" : {
"nto" : 1,
"validflag" : 1,
"timetag" : 1
},
"indexName" : "nto_1_validflag_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nto" : [ ],
"validflag" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nto" : [
"[\"zhang\", \"zhang\"]"
],
"validflag" : [
"[MinKey, MaxKey]"
],
"timetag" : [
"(1636513802759.0, 1636622950801.0)"
]
}
},
{
"stage" : "IXSCAN",
"keyPattern" : {
"nfrom" : 1,
"timetag" : 1
},
"indexName" : "nfrom_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nfrom" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nfrom" : [
"[\"zhang\", \"zhang\"]"
],
"timetag" : [
"(1636513802759.0, 1636622950801.0)"
]
}
}
]
}
}
}
},

The query plan has changed. Timetag is mandatory_ 1 index to use the original query plan:

mongos>db.xxx_collection.find({ $and:[{timetag: { $gt: 1636513802759 }}, {timetag:{$lt: 1636622950801}}], $or: [ { nto: "zhang" }, { nfrom:"zhang" } ], nid: 50, status: 0 }).sort({ timetag: 1}).limit(200).hint("timetag_1").explain("executionStats")

Query plan:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$lt" : 1636622950801
}
},
{
"timetag" : {
"$gt" : 1636513802759
}
}
]
},
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 200,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"timetag" : 1
},
"indexName" : "timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"timetag" : [
"(1636513802759.0, 1636622950801.0)"
]
}
}
}
},

Query time:

2021-12-15T11:18:43.650+0800 I COMMAND [conn913] command xxx_db.xxx_collection appName: "MongoDB Shell" command: find { find: "xxx_collection", filter: { $and: [ { timetag: { $gt: 1636513802759.0 } }, { timetag: { $lt: 1636622950801.0 } } ], $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50.0, status: 0.0 }, projection: {$sortKey: { $meta: "sortKey" } }, sort: { timetag: 1.0 }, hint: { $hint: "timetag_1" }, limit: 200, runtimeConstants: { localNow: new Date(1639538294423), clusterTime:Timestamp(1639538291, 1) },shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], planSummary: IXSCAN {timetag: 1 } keysExamined:11338157 docsExamined:11338157 cursorExhausted:1numYields:88579 nreturned:0 reslen:386 locks:{ ReplicationStateTransition: {acquireCount: { w: 88580 } }, Global: {acquireCount: { r: 88580 } }, Database:{ acquireCount: { r: 88580 } },Collection: { acquireCount: { r: 88580 } }, Mutex: {acquireCount: { r: 2 } } } storage:{data: { bytesRead: 16223299833,timeReadingMicros: 9431804 },timeWaitingMicros: { cache: 14082 } }protocol:op_msg 29226ms

We found that querying the empty area of T1 ~ T2 is very slow and takes 29 seconds, because it needs to scan more than 10 million data.

The data segment after T2 executes the original plan

mongos> db.xxx_collection.find({timetag: { $gt: 1636622950801 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200).explain("executionStats")

Query plan:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$gt" : 1636622950801
}
}
]
},
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"timetag" : 1
},
"limitAmount" : 200,
"inputStage" : {
"stage" : "SORT_KEY_GENERATOR",
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "OR",
"inputStages" : [
{
"stage" : "IXSCAN",
"keyPattern" : {
"nto" : 1,
"validflag" : 1,
"timetag" : 1
},
"indexName" : "nto_1_validflag_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nto" : [ ],
"validflag" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nto" : [
"[\"zhang\", \"zhang\"]"
],
"validflag" : [
"[MinKey, MaxKey]"
],
"timetag" : [
"(1636622950801.0, inf.0]"
]
}
},
{
"stage" : "IXSCAN",
"keyPattern" : {
"nfrom" : 1,
"timetag" : 1
},
"indexName" : "nfrom_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nfrom" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nfrom" : [
"[\"zhang\", \"zhang\"]"
],
"timetag" : [
"(1636622950801.0, inf.0]"
]
}
}
]
}
}
}
},

The query plan has changed. Timetag is mandatory_ 1 index to use the original query plan:

mongos>db.xxx_collection.find({ timetag: { $gt: 1636622950801 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200).hint("timetag_1").explain("executionStats")

Query plan:

"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$gt" : 1636622950801
}
}
]
},
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 200,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"$or" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nto" : {
"$eq" : "zhang"
}
}
]
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"timetag" : 1
},
"indexName" : "timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"timetag" : [
"(1636622950801.0, inf.0]"
]
}
}
}
},

Log time output:

2021-12-15T11:36:34.388+0800 I COMMAND [conn918] command xxx_db.xxx_collection appName: "MongoDB Shell" command: explain { explain: { find: "xxx_collection", filter: { timetag: { $gt: 1636622950801.0 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50.0, status: 0.0 }, sort: { timetag: 1.0 }, hint: { $hint: "timetag_1" }, limit: 200, ****** $db: "xxx_db" } numYields:1109015 reslen:2691 locks:{ReplicationStateTransition: { acquireCount: { w: 1109016 } }, Global: { acquireCount: { r: 1109016 } }, Database: { acquireCount: { r: 1109016 } }, Collection: { acquireCount: { r: 1109016 } }, Mutex: { acquireCount: { r: 2 } } } storage:{ data: { bytesRead: 195293544507, timeReadingMicros: 518472952 },timeWaitingMicros: { cache: 272870 } }protocol:op_msg 801697ms

As can be seen from the above, it takes 800 seconds to query the data after T2, because 180 million data needs to be scanned. Because the limit(200) used in the original query limits the number of summary results of the query, and only one result is within this time period, the query can return one result without scanning the whole time period.

Summary of problem causes

In conclusion, the reason for the problem is that the query plan given by MongoDB does not match the query of the target data.

The original query plan is to use the timetag index for scanning, and then filter the results according to other conditions.

We find that the query results are segmented on the timetag, distributed at both ends of the timetag, and there is a large piece of invalid data in the middle. The results from 1st to 199 are distributed in T0 ~ T1 time period, and the results after 200th are distributed after T2 time period.

If the original query plan is used, that is, only the timetag index is used to scan the whole table, it will be very slow to query the results after 199th because these results are distributed in 190 million data. This also explains that the first query is fast to obtain 101 results, because it only needs to query and scan more than 400 data in the time period T0 ~ T1, while the second query needs to obtain the 102nd ~ 202 data is slow, because it needs to scan more than 10 million invalid data before returning the results. The original query plan is not efficient enough, resulting in slow scanning.

Problem solution

Only using timetag index for data traversal is inefficient. Aggregate index is used to filter the traversal data and reduce the amount of traversal data. Therefore, it is expected to use the joint indexes of timetag and nfrom and timetag and nto for parallel queries, and finally merge the two query results. Since the query optimizer of MongoDB cannot convert the original query into the above expected query plan, we need to rewrite the query plan to facilitate the query optimizer of MongoDB to identify it.

Change the original query to:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200)

Convert to equivalent optimized query:

db.xxx_collection.find({$or: [ {$and: [{nto: "zhang"}, {timetag: {$gt: 1636513802167}}, { nid:50}, {status: 0}]}, {$and: [{nfrom: "zhang"}, {timetag: {$gt: 1636513802167}}, { nid:50}, {status: 0}]} ] }).sort({timetag:1}).limit(200)

And establish a joint index on nto and timetag and nfrom and timetag (Note: the original library already has a joint index of nfrom and timetag):

{
"v" : 2,
"key" : {
"nfrom" : 1,
"timetag" : 1
},
"name" : "nfrom_1_timetag_1",
"ns" : "xxx_db.xxx_collection"
},
{
"v" : 2,
"key" : {
"nto" : 1,
"timetag" : 1
},
"name" : "nto_1_timetag_1",
"ns" : "xxx_db.xxx_collection"
},

Query plan obtained:

"parsedQuery" : {
"$or" : [
{
"$and" : [
{
"nfrom" : {
"$eq" : "zhang"
}
},
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$gt" : 1636513802167
}
}
]
},
{
"$and" : [
{
"nid" : {
"$eq" : 50
}
},
{
"nto" : {
"$eq" : "zhang"
}
},
{
"status" : {
"$eq" : 0
}
},
{
"timetag" : {
"$gt" : 1636513802167
}
}
]
}
]
},
"winningPlan" : {
"stage" : "SUBPLAN",
"inputStage" : {
"stage" : "LIMIT",
"limitAmount" : 200,
"inputStage" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"timetag" : 1
},
"inputStage" : {
"stage" : "SORT_MERGE",
"sortPattern" : {
"timetag" : 1
},
"inputStages" : [
{
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"nfrom" : 1,
"timetag" : 1
},
"indexName" : "nfrom_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nfrom" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nfrom" : [
"[\"zhang\", \"zhang\"]"
],
"timetag" : [
"(1636513802167.0, inf.0]"
]
}
}
},
{
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"nid" : {
"$eq" : 50
}
},
{
"status" : {
"$eq" : 0
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"nto" : 1,
"timetag" : 1
},
"indexName" : "nto_1_timetag_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"nto" : [ ],
"timetag" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"nto" : [
"[\"zhang\", \"zhang\"]"
],
"timetag" : [
"(1636513802167.0, inf.0]"
]
}
}
}
]
}
}
}
},

This query plan meets our expectations.

Query equivalence proof

The optimized query plan can be obtained by Boolean equivalent operation.

The original query is:

db.xxx_collection.find({timetag: { $gt: 1636513802167 }, $or: [ { nto: "zhang" }, { nfrom: "zhang" } ], nid: 50, status: 0 }).sort({ timetag:1 }).limit(200)

We abstract the query criteria of find as follows:

a: timetag > 1636513802167

b: nto = "zhang"

c: nfrom = "zhang"

d: nid = 50

e: status = 0

Therefore, the original query condition is equivalent to a conjunctive paradigm:

$$ a \bigwedge (b \bigvee c) \bigwedge d \bigwedge e $$

Through logical operation, it can be transformed into a disjunctive normal form:

$$ (a \bigwedge b \bigwedge d \bigwedge e) \bigvee (a \bigwedge c \bigwedge d\bigwedge e) $$

The corresponding query is as follows:

db.xxx_collection.find({$or: [ {$and: [{nto: "zhang"}, {timetag: {$gt: 1636513802167}}, { nid:50}, {status: 0}]}, {$and: [{nfrom: "zhang"}, {timetag: {$gt: 1636513802167}}, { nid:50}, {status: 0}]} ] }).sort({timetag:1}).limit(200)

Query plan equivalence certification

About the author: Netease game zhangjiaqiao

Keywords: MongoDB

Added by Tainted Kitten on Wed, 12 Jan 2022 11:03:38 +0200

Programming VIP

Troubleshooting of slow operation of MongoDB find getmore

Problem description

Problem recurrence

Troubleshooting

Analyze the data distribution of the query

Data distribution on timetag

Distribution of query data

Distribution of query results

Analyze executed query plans

Popular Keywords