TiDB source code reading: generation of physical plan

Copyright notice: This article is compiled by the team of Digital China cloud base. If you reprint it, please indicate the source.

preface

I don't know how many days have passed since the last source code learning note. You may think I've gone from entry to soil, but I didn't! I'm still studying hard!

There are a lot of things to deal with in the middle of this period. Although most of them are still related to TiDB, it is difficult to have time to calm down and complete a learning note. Just recently, some students asked some questions about the previous notes, so they decided to continue to record their own source code reading notes. In the previous TiDB source code reading series, they only briefly explained the SQL processing process, and there was little in-depth understanding of some specific modules in the middle.

Earlier, we roughly divided the SQL execution engine into four modules: SQL parsing - making plans - optimizing plans - executing plans. In the process of making and optimizing plans, two plans will be generated, one is logical plan and the other is physical plan. When SQL is parsed into abstract syntax tree, it will be transformed into logical plan and optimized (RBO), and then the logical plan will be optimized based on cost (CBO) into a physical plan. The whole process is briefly introduced in the previous article. You can read this article after you have a certain understanding of this piece.

TiDB source code learning notes: making plans

The topic of this article is the generation of physical plan, which is not explained in depth in the previous article. The core of physical plan generation is cost based optimization. In the generation process, each child node (called logical operator) in the logical plan is transformed into a corresponding physical operator, and finally a physical plan tree is formed. In this process, each logical operator can be transformed into multiple physical operators, so there will be a problem, How to judge which logical operator is converted to which physical operator, and what conditions are used to select, at this time, it is necessary to conduct cost evaluation. The cost of executing different physical operators is different. Here, the cost can be simply understood as resource consumption and time consumption, Then when choosing, we must choose the operator with the least cost to generate an optimal physical plan, which is the fastest to execute! Database, what you want is fast!

An official TiDB article has introduced this part and explained it clearly, but its code version is relatively old. Now it is difficult to find the method in the new version. This article will refer to this article for explanation, and the TiDB code version of this article is 4.0.11.

TiDB source code reading series (8) cost based optimization

Logical operator & physical operator

First, introduce the nodes that make up the plan tree, also known as operators. Logical plans are composed of logical operators and physical plans are composed of physical operators. Common logical operators include Projection, Aggregation, Join and DataSource, which are Projection (e.g. Select *), Aggregation (e.g. Sum, Count), Join table (e.g. Left Join) and scan table (e.g. From table). Corresponding. When each logical operator is actually executed, the execution method is different, and it will be converted into different physical operators.

For the simplest example, the same logical operator DataSource and table scanning can have different strategies in different situations. For example, if you can use the index, you can use the index to scan the table (IndexReader) or scan the table (TableReader) through the row ID normally. When selecting, we need to estimate which of the two table scanning methods is better, calculate the cost of the two table scanning methods through the meta information of the table, and select the operator with less cost. Here you can take a diagram of the official document for your reference, listing three common logical operators and their physical operators that can be converted:

Finally, let me ask you a question: when scanning the table, is it faster to use the index to scan the table than the whole table? Less expensive?

Hints & Physical Property

More problems need to be considered in the generation of actual physical cost, such as Hints and Physical Property. As we all know, although the database will choose the fastest and best plan to execute, the database is only a system for storing data, and requires certain performance and speed (emphasizing speed), without strong AI support, There is no time to provide complex algorithms to calculate, so not every time you execute SQL, you can make the best choice and generate the best execution plan.

Sometimes Hints will be used in the scenario of database tuning, that is, Hints should be considered in the generation process of physical plan. For example, when the user specifies to use HashJoin when converting from logical operator Join to physical operator, HashJoin will be preferentially (forcibly) selected as the physical operator during conversion, Then there is no need to consider the cost at this time.

In addition to Hints as an additional condition in selecting a physical plan, there is also a Physical Property, that is, the data requirements (Prop) of the parent node for the child node. In the construction of a physical plan, the physical operator of the parent node will have certain requirements for the data returned by the physical operator of the child node, Therefore, when the child node is selected and transformed into a physical operator, it is necessary to consider whether the physical operator meets the requirements of its parent node.

Take a sub column as an example: referring to the above figure, a parent node is now a Join operator, which is SortMergeJoin when it is converted to a physical operator, so the requirement is that the two tables of the Join are ordered according to the Join Key, and each table should read data through IndexReader for matching, then the requirement will be saved in the Prop and transferred to the child node DataSource, When the child node DataSource considers converting to IndexReader, TableReader and IndexLoopUpReader, it needs to add the conditions in the Prop, and finally select the physical operator IndexReader. At this time, it will not consider its cost.

Physical Property

// PhysicalProperty stands for the required physical property by parents.
// It contains the orders and the task types.
type PhysicalProperty struct {
 Items []Item
 
 // TaskTp means the type of task that an operator requires.
 //
 // It needs to be specified because two different tasks can't be compared
 // with cost directly. e.g. If a copTask takes less cost than a rootTask,
 // we can't sure that we must choose the former one. Because the copTask
 // must be finished and increase its cost in sometime, but we can't make
 // sure the finishing time. So the best way to let the comparison fair is
 // to add TaskType to required property.
 TaskTp TaskType
 
 // ExpectedCnt means this operator may be closed after fetching ExpectedCnt
 // records.
 ExpectedCnt float64
 
 // hashcode stores the hash code of a PhysicalProperty, will be lazily
 // calculated when function "HashCode()" being called.
 hashcode []byte
 
 // whether need to enforce property.
 Enforced bool
}

The physical property structure is used in TiDB to store the parent node's requirements for child nodes. The main properties are Items to store the requirement list. TaskTp stores the task types, including rootTask (TiDB execution) and copTash (TiKV execution). ExpectedCnt represents the number of record lines expected to be read.

Task & Cost

Task is a new version of a physical plan, which mainly stores the cost information of the plan. A task may be RootTask, CopTask, MPPTask or ParellelTask. In fact, only RootTask and CopTask are implemented for the task interface in the TiDB source code. The latter may be derived from the two tasks, which can be understood from the simple structure, A task stores the two most basic attributes of plan and cost. In the implementation of RootTask, there are only these two attributes, while in the implementation of CopTask, there are more attributes. Because it needs to be executed in TiKV, it needs more parameters.

Cost is the cost. In the generation of physical plan, each physical operator will calculate its cost, and then select the physical calculation with the lowest cost for conversion. The calculation of the cost depends on IO overhead, CPU overhead, memory overhead and network overhead. In the code, we can have a careful look at how it is calculated.

Because a physical plan is composed of multiple physical operators, the final comparison is that the sum of the costs of all physical operators is the smallest, that is, the optimal physical plan.

In this place, you can learn that in a complex plan, there may be repeated physical operators for calculation. If you calculate it once every time, the performance will be very poor. Then, use the memory search algorithm in TiDB to save the calculated values. Next time, you can do no more calculation to improve the performance.

Source code

The above introduces a series of concepts involved in the physical plan. Let's go deep into the source code to see the whole physical plan generation process.

The process from logical plan to physical plan is TiDB source code Planner / core / optimizer Go, the annotation of this method is very direct: optimize the logical plan into a physical plan.

DoOptimize

// DoOptimize optimizes a logical plan to a physical plan.
func DoOptimize(ctx context.Context, sctx sessionctx.Context, flag uint64, logic LogicalPlan) (PhysicalPlan, float64, error) {
 // Rule based optimization logic plan
 logic, err := logicalOptimize(ctx, flag, logic)
 
 // Optimize logical plan into physical plan based on cost calculation
 physical, cost, err := physicalOptimize(logic)
 
 
 // Post optimization of physical plan
 finalPlan := postOptimize(sctx, physical)
 return finalPlan, cost, nil
}

This method mainly includes three steps: logical plan optimization (rule-based optimization), logical plan optimization into physical plan (cost based optimization) and some final processing. Here we focus on the second method, physicalOptimize.

physcialOptimize

func physicalOptimize(logic LogicalPlan) (PhysicalPlan, float64, error) {
 // Initialize an empty Prop
 prop := &property.PhysicalProperty{
 TaskTp:      property.RootTaskType,
 ExpectedCnt: math.MaxFloat64,
 }
 // Find the best Task 
 t, err := logic.findBestTask(prop)
 
 return t.plan(), t.cost(), err
}

The PhysicalOptimize method initializes a Prop, that is, the PhysicalProperty structure mentioned earlier is empty at first. When constantly traversing child nodes, the Prop will continue to fill in data.

The following findBestTask method is a key method for physical plan conversion. Its function is to convert the logical plan into the optimal physical plan. You can see that there are seven implementations of this interface, which are different for different logical plans.

You can select one of the following implementations to understand the implementation process. Here, select base logical plan.

baseLogicalPlan.FindBestTask

// findBestTask implements LogicalPlan interface.
func (p *baseLogicalPlan) findBestTask(prop *property.PhysicalProperty) (bestTask task, err error) {
 
   // Look up the task with this prop in the task map.
   // It's used to reduce double counting.
   bestTask = p.getTask(prop)
 
   var plansFitsProp, plansNeedEnforce []PhysicalPlan
   // Maybe the plan can satisfy the required property,
   // so we try to get the task without the enforced sort first.
   plansFitsProp, hintWorksWithProp = p.self.exhaustPhysicalPlans(newProp)
   if !hintWorksWithProp && !newProp.IsEmpty() {
      // If there is a hint in the plan and the hint cannot satisfy the property,
      // we enforce this property and try to generate the PhysicalPlan again to
      // make sure the hint can work.
      newProp.Enforced = true
   }
 
   if newProp.Enforced {
 
 
      plansNeedEnforce, hintCanWork = p.self.exhaustPhysicalPlans(newProp)
 
      if hintCanWork && !hintWorksWithProp {
         plansFitsProp = nil
      }
 
      if !hintCanWork && !hintWorksWithProp && !prop.Enforced {
         plansNeedEnforce = nil
      }
   }
 
   newProp.Enforced = false
   if bestTask, err = p.enumeratePhysicalPlans4Task(plansFitsProp, newProp); err != nil {
      return nil, err
   }
   newProp.Enforced = true
   curTask, err := p.enumeratePhysicalPlans4Task(plansNeedEnforce, newProp)
 
   if curTask.cost() < bestTask.cost() || (bestTask.invalid() && !curTask.invalid()) {
      bestTask = curTask
   }
 
   p.storeTask(prop, bestTask)
   return bestTask, nil
}

The implementation of the method is relatively long. Part of the code is omitted in the paper, and only important codes and comments are retained. You have a basic understanding of the code logic through parameter names and comments. You can see that the Hints props mentioned earlier appear in the code. One is a prompt and the other is a parent node requirement. These two factors should be considered when generating a physical plan. You can also read from the code.

There are two key methods in this code, exhaustPhysicalPlans() and enumerationphysicalplans4task(), plus the method itself findBestTask(), which together constitute the whole process of physical plan generation.

The exhastphysicalplans () method will exhaustively discuss all physical plans that can be converted by a logical plan. For example, if it is a logical operator of a Join, it will discuss all physical operators that can be converted by a Join, such as HashJoin, IndexJoin, MergeJoin, etc.

You can see the annotation and implementation of the interface from the source code. The interface will return two values, one is all physical plans that meet the Prop, and the other is whether Hint works. In the process of findBestTask method above, we can know that when Hints does not work, that is, if Prop is not satisfied, the EnForced property of Prop will be modified to True, and the Prop will be forced to regenerate the physical operator to ensure that Hints can work. If both Prop and Hint cannot be satisfied at the same time, Then there is no physical operator to convert.

You can see that there are 16 implementations of this interface, including various logic plans. Here we choose the above LogicalJoin operator implementation to learn whether all Join physical operators have been discussed.

LogicalJoin.exhaustPhysicalPlans

// LogicalJoin can generates hash join, index join and sort merge join.
// Firstly we check the hint, if hint is figured by user, we force to choose the corresponding physical plan.
// If the hint is not matched, it will get other candidates.
// If the hint is not figured, we will pick all candidates.
func (p *LogicalJoin) exhaustPhysicalPlans(prop *property.PhysicalProperty) ([]PhysicalPlan, bool) {
 
   joins := make([]PhysicalPlan, 0, 8)
   if p.ctx.GetSessionVars().AllowBCJ {
      broadCastJoins := p.tryToGetBroadCastJoin(prop)
      if (p.preferJoinType & preferBCJoin) > 0 {
         return broadCastJoins, true
      }
      joins = append(joins, broadCastJoins...)
   }
   if prop.IsFlashOnlyProp() {
      return joins, true
   }
   // Discuss MergeJoin. If the preferJoinType specifies the physical operator, MergeJoin will be returned directly
   mergeJoins := p.GetMergeJoin(prop, p.schema, p.Stats(), p.children[0].statsInfo(), p.children[1].statsInfo())
   if (p.preferJoinType&preferMergeJoin) > 0 && len(mergeJoins) > 0 {
      return mergeJoins, true
   }
   joins = append(joins, mergeJoins...)
 
   // Discuss IndexJoin. If the preferJoinType specifies the physical operator, IndexJoin will be returned directly
   indexJoins, forced := p.tryToGetIndexJoin(prop)
   if forced {
      return indexJoins, true
   }
   joins = append(joins, indexJoins..
 
   // Discuss HashJoins. If the preferJoinType specifies the physical operator, HashJoins will be returned directly
   hashJoins := p.getHashJoins(prop)
   if (p.preferJoinType&preferHashJoin) > 0 && len(hashJoins) > 0 {
      return hashJoins, true
   }
   joins = append(joins, hashJoins...)
 
   if p.preferJoinType > 0 {
      // If we reach here, it means we have a hint that doesn't work.
      // It might be affected by the required property, so we enforce
      // this property and try the hint again.
      return joins, false
   }
   return joins, true
}

It can be clearly seen that the source code discusses four convertible physical operators: boradCastJoins, MergeJoins, IndexJoins and HashJoins, which will be added to the joins array and returned after each discussion. Of course, if Hint specifies a physical operator (p.preferJoinType), the physical operator will be returned directly after the discussion.

The last key method is enumerationphysicalplans4Task(), which is used to enumerate the tasks of all physical plans, that is, to calculate the cost of physical plans through enumeration. The input parameter is all qualified physical operators generated by the exhaustPhysicalPlans() method.

enumeratePhysicalPlans4Task

func (p *baseLogicalPlan) enumeratePhysicalPlans4Task(physicalPlans []PhysicalPlan, prop *property.PhysicalProperty) (task, error) {
   var bestTask task = invalidTask
   childTasks := make([]task, 0, len(p.children))
   // Traverse the child node and call the child node findBestTask method
   for _, pp := range physicalPlans {
      // find best child tasks firstly.
      childTasks = childTasks[:0]
      for i, child := range p.children {
         childTask, err := child.findBestTask(pp.GetChildReqProps(i))
         ......
         childTasks = append(childTasks, childTask)
      }
 
      // combine best child tasks with parent physical plan.
      curTask := pp.attach2Task(childTasks...)
 
      // Enforce curTask property
      if prop.Enforced {
         curTask = enforceProperty(prop, curTask, p.basePlan.ctx)
      }
 
      // optimize by shuffle executor to running in parallel manner.
      if prop.IsEmpty() {
         curTask = optimizeByShuffle(pp, curTask, p.basePlan.ctx)
      }
 
      // get the most efficient one.
      if curTask.cost() < bestTask.cost() || (bestTask.invalid() && !curTask.invalid()) {
         bestTask = curTask
      }
   }
   return bestTask, nil
}

The code logic is very clear, and we can see a very important information in this method, that is, in this method, we will traverse all the child nodes of the current node and call the findBestTask method of its child nodes to obtain the optimal physical operator that can be converted by the child nodes.

It should be noted that when a leaf node is reached, it is not necessary to continue to traverse its child nodes, because there are no child nodes. In a logical plan, the last leaf node is generally a table scanning by the logical operator DataSource. Therefore, enumerationphysicalplans4task will not be called in the implementation of findBestTask method of DataSource, Instead, we directly discuss the physical operators that DataSource can convert and calculate the cost.

Then, you should have a basic process in mind. How does the whole logical plan tree transform into a physical plan tree

Starting from the root node, call findBestTask to find all convertible physical operators of the root node through the exhastphysicalplans method
Traverse all its child nodes in enumerationphysicalplans4task, and then call findBestTask method on each child node to find all convertible physical operators of the current child node.
Until the last leaf node (usually DataSource), calculate the cost of physical operators from the leaf node
Recursively calculate the total cost of the parent node (including child nodes) and select the path with the lowest cost.
Finally, it returns to the root node and obtains the physical plan with the lowest cost.

On the whole, it can be understood as the cost of depth first traversal and recursive calculation of a tree.

There are too many contents in this article, so more in-depth code will not continue to follow. In fact, after understanding the three key methods findBestTask(), exhastphysicalplans() and enumerationphysicalplans4task(), you basically have a deep understanding of the generation of physical plans, As for the detailed process of each physical operator generation and cost calculation, you need to study it in the code yourself.

This paper does not provide a flow chart for your reference at the end, because the process generated by the physical plan will have different processes according to the complexity of the logical plan. A simple flow chart can not take into account all. The complete flow chart is too complex to help you understand quickly, and even lose the reason to look at the source code after reading it.

Keywords: Database

Added by adguru on Fri, 21 Jan 2022 10:23:18 +0200

Programming VIP