Deep Understanding of Go-Garbage Recycling Mechanism

Go's GC has been criticized since it was born, but after introducing the three-color marker of v1.5 and the mixed writing barrier of v1.8, the normal GC has been shortened to about 10us, and has become very excellent. It's amazing. Let's explore the principle of Go's GC.

Principle of Tricolor Marking

Let's first look at a picture, and we will probably have a general understanding of tricolor marking.

Principle:

  1. First, put all the objects in the white set.
  2. Traversing objects from the root node, traversing white objects from the white set into the grey set
  3. Traversing the objects in the grey set, putting the objects in the white set referenced by the grey object into the grey set, and at the same time putting the objects in the grey set traversed into the black set
  4. Cycle Step 3, knowing that there are no objects in the grey set
  5. At the end of Step 4, the objects in the white set are unreachable objects, i.e. garbage, which are recycled.

Writing barrier

Go does not have STW when it does trichrome markup, that is to say, the object can still be modified at this time.

So let's consider the following

When we scan the gray set in a tricolor tag, we scan object A and mark all references to object A. At this point, we start scanning for references to object D. At this point, another goroutine modifies the references to D - > E, as shown in the figure below.

Does this cause E objects to be scanned out and mistaken for white objects, i.e. garbage?

Writing barrier is to solve such a problem. After the introduction of writing barrier, E will be considered alive after the above steps. Even if the latter E is abandoned by A object, E will be recycled in the next round of GC. In this round of GC, E will not be recycled.

Hybrid write barrier has been enabled in Go1.9. Pseudo-code is as follows

writePointer(slot, ptr):
    shade(*slot)
    if any stack is grey:
        shade(ptr)
    *slot = ptr

The hybrid write barrier marks both the "original pointer" and the "new pointer" for the pointer to write to the target.

The reason for marking the original pointer is that other running threads may copy the value of the pointer to local variables on registers or stacks at the same time.
Because local variables that copy pointers to registers or stacks do not pass through the write barrier, it is possible to cause pointers not to be marked. Consider the following scenario:

[go] b = obj
[go] oldx = nil
[gc] scan oldx...
[go] oldx = b.x // Copy b.x to local variables without crossing the write barrier
[go] b.x = ptr // Write barriers should mark the original value of b.x
[gc] scan b...
//If the write barrier does not mark the original value, oldx will not be scanned.

The reason for marking the new pointer is that other running threads may shift the pointer's position. Imagine the following situation:

[go] a = ptr
[go] b = obj
[gc] scan b...
[go] b.x = a // Write barriers should mark new values of b.x
[go] a = nil
[gc] scan a...
//If the write barrier does not mark the new value, the ptr will not be scanned.

Hybrid write barrier can reduce STW time in Mark Termination by eliminating the need for GC to re-scan the stacks of each G after parallel markup.

In addition to the write barrier, all newly allocated objects will immediately turn black during GC, as can be seen in the mallocgc function above.

Recycling process

The GC of GO is parallel GC, that is, most of the processing of GC runs at the same time as the ordinary go code, which makes the GC process of GO more complex.
First of all, GC has four stages, which are:

  • Sweep Termination: Uncleaned span s can only be cleaned up by the last GC cycle before a new GC cycle can be started.
  • Mark: Scan all root objects, and all objects that root objects can reach, and mark that they are not recycled
  • Mark Termination: Complete markup and re-scan part of the root object (STW required)
  • Sweep: Sweep span by marker result

The following figure is a relatively complete GC process, and classifies the four stages by color:

There are two kinds of background tasks (G) in the GC process. One is the background task for marking and the other is the background task for cleaning.
Markup background tasks will be started when needed. The number of background tasks that can work at the same time is about 25% of the number of P tasks, which is the basis for using 25% of the cpu on GC as described by go.
A background task for cleaning starts when the program starts and wakes up when it enters the cleaning stage.

At present, the whole GC process will undergo two STW(Stop The World), the first is the beginning of the Mark phase, and the second is the Mark Termination phase.
The first STW prepares the scanning of the root object, starts Write Barrier and assists GC.
The second STW scans part of the root object, disables Write Barrier and mutator assist.
It should be noted that not all scanning of root objects requires STW, such as scanning objects on the stack only needs to stop the G that owns the stack.
The implementation of write barrier uses Hybrid Write Barrier, which greatly reduces the time of the second STW.

Source code analysis

gcStart

func gcStart(mode gcMode, trigger gcTrigger) {
    // Since this is called from malloc and malloc is called in
    // the guts of a number of libraries that might be holding
    // locks, don't attempt to start GC in non-preemptible or
    // potentially unstable situations.
    // Determine whether the current g can preempt and not trigger GC when it is not preemptive
    mp := acquirem()
    if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
        releasem(mp)
        return
    }
    releasem(mp)
    mp = nil

    // Pick up the remaining unswept/not being swept spans concurrently
    //
    // This shouldn't happen if we're being invoked in background
    // mode since proportional sweep should have just finished
    // sweeping everything, but rounding errors, etc, may leave a
    // few spans unswept. In forced mode, this is necessary since
    // GC can be forced at any point in the sweeping cycle.
    //
    // We check the transition condition continuously here in case
    // this G gets delayed in to the next GC cycle.
    // Clean up the remaining unwashed garbage
    for trigger.test() && gosweepone() != ^uintptr(0) {
        sweep.nbgsweep++
    }

    // Perform GC initialization and the sweep termination
    // transition.
    semacquire(&work.startSema)
    // Re-check transition condition under transition lock.
    // Judging whether the condition of gcTrriger is valid
    if !trigger.test() {
        semrelease(&work.startSema)
        return
    }

    // For stats, check if this GC was forced by the user
    // To determine and record whether GC is enforced, runtime.GC() can be invoked and enforced by users.
    work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle

    // In gcstoptheworld debug mode, upgrade the mode accordingly.
    // We do this after re-checking the transition condition so
    // that multiple goroutines that detect the heap trigger don't
    // start multiple STW GCs.
    // Setting the mode of gc
    if mode == gcBackgroundMode {
        if debug.gcstoptheworld == 1 {
            mode = gcForceMode
        } else if debug.gcstoptheworld == 2 {
            mode = gcForceBlockMode
        }
    }

    // Ok, we're doing it! Stop everybody else
    semacquire(&worldsema)

    if trace.enabled {
        traceGCStart()
    }
    // Start the background markup task
    if mode == gcBackgroundMode {
        gcBgMarkStartWorkers()
    }
    // Reset the status associated with gc markers
    gcResetMarkState()

    work.stwprocs, work.maxprocs = gomaxprocs, gomaxprocs
    if work.stwprocs > ncpu {
        // This is used to compute CPU time of the STW phases,
        // so it can't be more than ncpu, even if GOMAXPROCS is.
        work.stwprocs = ncpu
    }
    work.heap0 = atomic.Load64(&memstats.heap_live)
    work.pauseNS = 0
    work.mode = mode

    now := nanotime()
    work.tSweepTerm = now
    work.pauseStart = now
    if trace.enabled {
        traceGCSTWStart(1)
    }
    // STW, Stop the World
    systemstack(stopTheWorldWithSema)
    // Finish sweep before we start concurrent scan.
    // Clean up the garbage in the last round to ensure the completion of the last GC.
    systemstack(func() {
        finishsweep_m()
    })
    // clearpools before we start the GC. If we wait they memory will not be
    // reclaimed until the next GC cycle.
    // Clean up sync.pool sched.sudogcache, sched.deferpool, not expanded here, sync.pool has said, the remaining articles will cover
    clearpools()

    // Increasing GC Technology
    work.cycles++
    if mode == gcBackgroundMode { // Do as much work concurrently as possible
        gcController.startCycle()
        work.heapGoal = memstats.next_gc

        // Enter concurrent mark phase and enable
        // write barriers.
        //
        // Because the world is stopped, all Ps will
        // observe that write barriers are enabled by
        // the time we start the world and begin
        // scanning.
        //
        // Write barriers must be enabled before assists are
        // enabled because they must be enabled before
        // any non-leaf heap objects are marked. Since
        // allocations are blocked until assists can
        // happen, we want enable assists as early as
        // possible.
        // Set the status of GC to gcMark
        setGCPhase(_GCmark)

        // Update the status of bgmark
        gcBgMarkPrepare() // Must happen before assist enable.
        // Calculate and queue root scanning tasks and initialize the status of related scanning tasks
        gcMarkRootPrepare()

        // Mark all active tinyalloc blocks. Since we're
        // allocating from these, they need to be black like
        // other allocations. The alternative is to blacken
        // the tiny block on every allocation from it, which
        // would slow down the tiny allocator.
        // Marking tiny objects
        gcMarkTinyAllocs()

        // At this point all Ps have enabled the write
        // barrier, thus maintaining the no white to
        // black invariant. Enable mutator assists to
        // put back-pressure on fast allocating
        // mutators.
        // Set gcBlackenEnabled to 1 to enable write barrier
        atomic.Store(&gcBlackenEnabled, 1)

        // Assists and workers can start the moment we start
        // the world.
        gcController.markStartTime = now

        // Concurrent mark.
        systemstack(func() {
            now = startTheWorldWithSema(trace.enabled)
        })
        work.pauseNS += now - work.pauseStart
        work.tMark = now
    } else {
        // Non-parallel mode
        // Record the start time of the completion markup phase
        if trace.enabled {
            // Switch to mark termination STW.
            traceGCSTWDone()
            traceGCSTWStart(0)
        }
        t := nanotime()
        work.tMark, work.tMarkTerm = t, t
        work.heapGoal = work.heap0

        // Perform mark termination. This will restart the world.
        // stw, mark, sweep and start the world
        gcMarkTermination(memstats.triggerRatio)
    }

    semrelease(&work.startSema)
}

gcBgMarkStartWorkers

This function prepares some goroutines to perform bg mark work, but these goroutines do not work immediately, but wait until the state of GC is marked as gcMark to start working, as shown in line 119 of the previous function.

func gcBgMarkStartWorkers() {
    // Background marking is performed by per-P G's. Ensure that
    // each P has a background GC G.
    for _, p := range allp {
        if p.gcBgMarkWorker == 0 {
            go gcBgMarkWorker(p)
            // Wait for the bgMarkReady signal of gcBgMarkWorker goroutine to continue
            notetsleepg(&work.bgMarkReady, -1)
            noteclear(&work.bgMarkReady)
        }
    }
}

gcBgMarkWorker

Background Markup Task Functions

func gcBgMarkWorker(_p_ *p) {
    gp := getg()
    // For retrieving p and m after hibernation
    type parkInfo struct {
        m      muintptr // Release this m on park.
        attach puintptr // If non-nil, attach to this p on park.
    }
    // We pass park to a gopark unlock function, so it can't be on
    // the stack (see gopark). Prevent deadlock from recursively
    // starting GC by disabling preemption.
    gp.m.preemptoff = "GC worker init"
    park := new(parkInfo)
    gp.m.preemptoff = ""
    // Setting up the park's m and p information and passing it back to gopark for easy retrieval when awakened by gcController. find Runnable
    park.m.set(acquirem())
    park.attach.set(_p_)
    // Inform gcBgMarkStartWorkers that this worker is ready.
    // After this point, the background mark worker is scheduled
    // cooperatively by gcController.findRunnable. Hence, it must
    // never be preempted, as this would put it into _Grunnable
    // and put it on a run queue. Instead, when the preempt flag
    // is set, this puts itself into _Gwaiting to be woken up by
    // gcController.findRunnable at the appropriate time.
    // Let gcBgMarkStartWorkers notetsleepg stop waiting and continue and quit
    notewakeup(&work.bgMarkReady)

    for {
        // Go to sleep until woken by gcController.findRunnable.
        // We can't releasem yet since even the call to gopark
        // may be preempted.
        // Let g go to sleep
        gopark(func(g *g, parkp unsafe.Pointer) bool {
            park := (*parkInfo)(parkp)

            // The worker G is no longer running, so it's
            // now safe to allow preemption.
            // Release the currently preempted m
            releasem(park.m.ptr())

            // If the worker isn't attached to its P,
            // attach now. During initialization and after
            // a phase change, the worker may have been
            // running on a different P. As soon as we
            // attach, the owner P may schedule the
            // worker, so this must be done after the G is
            // stopped.
            // Set the association p, which has been set up above
            if park.attach != 0 {
                p := park.attach.ptr()
                park.attach.set(nil)
                // cas the worker because we may be
                // racing with a new worker starting
                // on this P.
                if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
                    // The P got a new worker.
                    // Exit this worker.
                    return false
                }
            }
            return true
        }, unsafe.Pointer(park), waitReasonGCWorkerIdle, traceEvGoBlock, 0)

        // Loop until the P dies and disassociates this
        // worker (the P may later be reused, in which case
        // it will get a new worker) or we failed to associate.
        // Check if the gcBgMarkWorker of P is consistent with the current G and terminate the current task when inconsistent
        if _p_.gcBgMarkWorker.ptr() != gp {
            break
        }

        // Disable preemption so we can use the gcw. If the
        // scheduler wants to preempt us, we'll stop draining,
        // dispose the gcw, and then preempt.
        // The first function of gopark releases m, and here it preempts back.
        park.m.set(acquirem())

        if gcBlackenEnabled == 0 {
            throw("gcBgMarkWorker: blackening not enabled")
        }

        startTime := nanotime()
        // Setting the start time of gcmark
        _p_.gcMarkWorkerStartTime = startTime

        decnwait := atomic.Xadd(&work.nwait, -1)
        if decnwait == work.nproc {
            println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
            throw("work.nwait was > work.nproc")
        }
        // Switch to g0
        systemstack(func() {
            // Mark our goroutine preemptible so its stack
            // can be scanned. This lets two mark workers
            // scan each other (otherwise, they would
            // deadlock). We must not modify anything on
            // the G stack. However, stack shrinking is
            // disabled for mark workers, so it is safe to
            // read from the G stack.
            // Set the state of G to wait so that another G can scan its stack (two G can scan each other's stack)
            casgstatus(gp, _Grunning, _Gwaiting)
            switch _p_.gcMarkWorkerMode {
            default:
                throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
            case gcMarkWorkerDedicatedMode:
                // Patterns for concentrating on markup
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
                if gp.preempt {
                    // Preempted, put all local runtime G in the global runtime queue
                    // We were preempted. This is
                    // a useful signal to kick
                    // everything out of the run
                    // queue so it can run
                    // somewhere else.
                    lock(&sched.lock)
                    for {
                        gp, _ := runqget(_p_)
                        if gp == nil {
                            break
                        }
                        globrunqput(gp)
                    }
                    unlock(&sched.lock)
                }
                // Go back to draining, this time
                // without preemption.
                // Continue marking
                gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
            case gcMarkWorkerFractionalMode:
                // Execute markup until it is preempted
                gcDrain(&_p_.gcw, gcDrainFractional|gcDrainUntilPreempt|gcDrainFlushBgCredit)
            case gcMarkWorkerIdleMode:
                // Perform markup work at leisure
                gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
            }
            // Converting the waiting state of G to the runing state
            casgstatus(gp, _Gwaiting, _Grunning)
        })

        // If we are nearing the end of mark, dispose
        // of the cache promptly. We must do this
        // before signaling that we're no longer
        // working so that other workers can't observe
        // no workers and no work while we have this
        // cached, and before we compute done.
        // Handle local caches in time and submit them to global queues
        if gcBlackenPromptly {
            _p_.gcw.dispose()
        }

        // Account for time.
        // Cumulative time-consuming
        duration := nanotime() - startTime
        switch _p_.gcMarkWorkerMode {
        case gcMarkWorkerDedicatedMode:
            atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
            atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
        case gcMarkWorkerFractionalMode:
            atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
            atomic.Xaddint64(&_p_.gcFractionalMarkTime, duration)
        case gcMarkWorkerIdleMode:
            atomic.Xaddint64(&gcController.idleMarkTime, duration)
        }

        // Was this the last worker and did we run out
        // of work?
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait > work.nproc {
            println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
                "work.nwait=", incnwait, "work.nproc=", work.nproc)
            throw("work.nwait > work.nproc")
        }

        // If this worker reached a background mark completion
        // point, signal the main GC goroutine.
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // Make this G preemptible and disassociate it
            // as the worker for this P so
            // findRunnableGCWorker doesn't try to
            // schedule it.
            // Remove the association of p m
            _p_.gcBgMarkWorker.set(nil)
            releasem(park.m.ptr())

            gcMarkDone()

            // Disable preemption and prepare to reattach
            // to the P.
            //
            // We may be running on a different P at this
            // point, so we can't reattach until this G is
            // parked.
            park.m.set(acquirem())
            park.attach.set(_p_)
        }
    }
}
gcDrain

Major Implementation of Tricolor Markup

gcDrain scans all roots and objects and displays black and gray objects until all roots and objects are marked

func gcDrain(gcw *gcWork, flags gcDrainFlags) {
    if !writeBarrier.needed {
        throw("gcDrain phase incorrect")
    }

    gp := getg().m.curg
    // See if the preemptive flag is returned
    preemptible := flags&gcDrainUntilPreempt != 0
    // Whether to wait for a task when there is no task
    blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainFractional|gcDrainNoBlock) == 0
    // Whether to calculate the amount of backstage scanning to reduce auxiliary GC and G in wake-up waiting
    flushBgCredit := flags&gcDrainFlushBgCredit != 0
    // Whether to perform markup tasks in idle time
    idle := flags&gcDrainIdle != 0
    // Record the initial scanned tasks that have been performed
    initScanWork := gcw.scanWork

    // checkWork is the scan work before performing the next
    // self-preempt check.
    // Setting Work Check Function for Corresponding Mode
    checkWork := int64(1<<63 - 1)
    var check func() bool
    if flags&(gcDrainIdle|gcDrainFractional) != 0 {
        checkWork = initScanWork + drainCheckThreshold
        if idle {
            check = pollWork
        } else if flags&gcDrainFractional != 0 {
            check = pollFractionalWorkerExit
        }
    }

    // Drain root marking jobs.
    // If the root object is not scanned, then the scan
    if work.markrootNext < work.markrootJobs {
        for !(preemptible && gp.preempt) {
            job := atomic.Xadd(&work.markrootNext, +1) - 1
            if job >= work.markrootJobs {
                break
            }
            // Perform root scan tasks
            markroot(gcw, job)
            if check != nil && check() {
                goto done
            }
        }
    }

    // Drain heap marking jobs.
    // Cycle until preempted
    for !(preemptible && gp.preempt) {
        // Try to keep work available on the global queue. We used to
        // check if there were waiting workers, but it's better to
        // just keep work available than to make workers wait. In the
        // worst case, we'll do O(log(_WorkbufSize)) unnecessary
        // balances.
        if work.full == 0 {
            // Balance work, if the global tag queue is empty, then work in part into the global queue
            gcw.balance()
        }

        var b uintptr
        if blocking {
            b = gcw.get()
        } else {
            b = gcw.tryGetFast()
            if b == 0 {
                b = gcw.tryGet()
            }
        }
        // Getting the task failed and jumping out of the loop
        if b == 0 {
            // work barrier reached or tryGet failed.
            break
        }
        // Scanning the acquired object
        scanobject(b, gcw)

        // Flush background scan work credit to the global
        // account if we've accumulated enough locally so
        // mutator assists can draw on it.
        // If the current number of scans exceeds gcCreditSlack, add the number of scanned objects to the global number and update them in batches.
        if gcw.scanWork >= gcCreditSlack {
            atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
            if flushBgCredit {
                gcFlushBgCredit(gcw.scanWork - initScanWork)
                initScanWork = 0
            }
            checkWork -= gcw.scanWork
            gcw.scanWork = 0
            // If the number of objects scanned has reached the target number of checkWork to execute the next preemption, the function of the corresponding pattern is called
            // idle mode is pollWork, Fractional mode is pollFractional Worker Exit, in line 20
            if checkWork <= 0 {
                checkWork += drainCheckThreshold
                if check != nil && check() {
                    break
                }
            }
        }
    }

    // In blocking mode, write barriers are not allowed after this
    // point because we must preserve the condition that the work
    // buffers are empty.

done:
    // Flush remaining scan work credit.
    if gcw.scanWork > 0 {
        // Add the number of scanned objects to the global
        atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
        if flushBgCredit {
            gcFlushBgCredit(gcw.scanWork - initScanWork)
        }
        gcw.scanWork = 0
    }
}
markroot

This is used for root object scanning

func markroot(gcw *gcWork, i uint32) {
    // TODO(austin): This is a bit ridiculous. Compute and store
    // the bases in gcMarkRootPrepare instead of the counts.
    baseFlushCache := uint32(fixedRootCount)
    baseData := baseFlushCache + uint32(work.nFlushCacheRoots)
    baseBSS := baseData + uint32(work.nDataRoots)
    baseSpans := baseBSS + uint32(work.nBSSRoots)
    baseStacks := baseSpans + uint32(work.nSpanRoots)
    end := baseStacks + uint32(work.nStackRoots)

    // Note: if you add a case here, please also update heapdump.go:dumproots.
    switch {
    // Release span in mcache
    case baseFlushCache <= i && i < baseData:
        flushmcache(int(i - baseFlushCache))
    // Scanning Readable and Writable Global Variables
    case baseData <= i && i < baseBSS:
        for _, datap := range activeModules() {
            markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData))
        }
    // Scanning read-only global queue
    case baseBSS <= i && i < baseSpans:
        for _, datap := range activeModules() {
            markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS))
        }
    // Scanning Finalizer Queue
    case i == fixedRootFinalizers:
        // Only do this once per GC cycle since we don't call
        // queuefinalizer during marking.
        if work.markrootDone {
            break
        }
        for fb := allfin; fb != nil; fb = fb.alllink {
            cnt := uintptr(atomic.Load(&fb.cnt))
            scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw)
        }
    // Release the terminated stack
    case i == fixedRootFreeGStacks:
        // Only do this once per GC cycle; preferably
        // concurrently.
        if !work.markrootDone {
            // Switch to the system stack so we can call
            // stackfree.
            systemstack(markrootFreeGStacks)
        }
    // Scanning MSpan.specials
    case baseSpans <= i && i < baseStacks:
        // mark MSpan.specials
        markrootSpans(gcw, int(i-baseSpans))

    default:
        // the rest is scanning goroutine stacks
        // Get the g that needs to be scanned
        var gp *g
        if baseStacks <= i && i < end {
            gp = allgs[i-baseStacks]
        } else {
            throw("markroot: bad index")
        }

        // remember when we've first observed the G blocked
        // needed only to output in traceback
        status := readgstatus(gp) // We are not in a scan state
        if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {
            gp.waitsince = work.tstart
        }

        // scang must be done on the system stack in case
        // we're trying to scan our own stack.
        // Transfer to g0 for scanning
        systemstack(func() {
            // If this is a self-scan, put the user G in
            // _Gwaiting to prevent self-deadlock. It may
            // already be in _Gwaiting if this is a mark
            // worker or we're in mark termination.
            userG := getg().m.curg
            selfScan := gp == userG && readgstatus(userG) == _Grunning
            // If it scans itself, it changes the state of its g
            if selfScan {
                casgstatus(userG, _Grunning, _Gwaiting)
                userG.waitreason = waitReasonGarbageCollectionScan
            }

            // TODO: scang blocks until gp's stack has
            // been scanned, which may take a while for
            // running goroutines. Consider doing this in
            // two phases where the first is non-blocking:
            // we scan the stacks we can and ask running
            // goroutines to scan themselves; and the
            // second blocks.
            // Scanning g's stack
            scang(gp, gcw)

            if selfScan {
                casgstatus(userG, _Gwaiting, _Grunning)
            }
        })
    }
}
markRootBlock

Scanning the [b0, B0 + n0] region according to ptrmask 0

func markrootBlock(b0, n0 uintptr, ptrmask0 *uint8, gcw *gcWork, shard int) {
    if rootBlockBytes%(8*sys.PtrSize) != 0 {
        // This is necessary to pick byte offsets in ptrmask0.
        throw("rootBlockBytes must be a multiple of 8*ptrSize")
    }

    b := b0 + uintptr(shard)*rootBlockBytes
    // If the block area to be scanned is beyond b0+n0, return directly
    if b >= b0+n0 {
        return
    }
    ptrmask := (*uint8)(add(unsafe.Pointer(ptrmask0), uintptr(shard)*(rootBlockBytes/(8*sys.PtrSize))))
    n := uintptr(rootBlockBytes)
    if b+n > b0+n0 {
        n = b0 + n0 - b
    }

    // Scan this shard.
    // Scanning shard for a given block
    scanblock(b, n, ptrmask, gcw)
}
scanblock
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
    // Use local copies of original parameters, so that a stack trace
    // due to one of the throws below shows the original block
    // base and extent.
    b := b0
    n := n0

    for i := uintptr(0); i < n; {
        // Find bits for the next word.
        // Find the corresponding bits in the bitmap
        bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
        if bits == 0 {
            i += sys.PtrSize * 8
            continue
        }
        for j := 0; j < 8 && i < n; j++ {
            if bits&1 != 0 {
                // If the address contains a pointer
                // Same work as in scanobject; see comments there.
                obj := *(*uintptr)(unsafe.Pointer(b + i))
                if obj != 0 {
                    // If the corresponding object is found under this address, the gray mark
                    if obj, span, objIndex := findObject(obj, b, i); obj != 0 {
                        greyobject(obj, b, i, span, gcw, objIndex)
                    }
                }
            }
            bits >>= 1
            i += sys.PtrSize
        }
    }
}
greyobject

The gray object is actually finding the corresponding bitmap, marking it alive and throwing it into the queue.

func greyobject(obj, base, off uintptr, span *mspan, gcw *gcWork, objIndex uintptr) {
    // obj should be start of allocation, and so must be at least pointer-aligned.
    if obj&(sys.PtrSize-1) != 0 {
        throw("greyobject: obj not pointer-aligned")
    }
    mbits := span.markBitsForIndex(objIndex)

    if useCheckmark {
        // Here is debug to ensure that all objects are properly identified
        if !mbits.isMarked() {
            // This object is not marked
            printlock()
            print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "\n")
            print("runtime: found obj at *(", hex(base), "+", hex(off), ")\n")

            // Dump the source (base) object
            gcDumpObject("base", base, off)

            // Dump the object
            gcDumpObject("obj", obj, ^uintptr(0))

            getg().m.traceback = 2
            throw("checkmark found unmarked object")
        }
        hbits := heapBitsForAddr(obj)
        if hbits.isCheckmarked(span.elemsize) {
            return
        }
        hbits.setCheckmarked(span.elemsize)
        if !hbits.isCheckmarked(span.elemsize) {
            throw("setCheckmarked and isCheckmarked disagree")
        }
    } else {
        if debug.gccheckmark > 0 && span.isFree(objIndex) {
            print("runtime: marking free object ", hex(obj), " found at *(", hex(base), "+", hex(off), ")\n")
            gcDumpObject("base", base, off)
            gcDumpObject("obj", obj, ^uintptr(0))
            getg().m.traceback = 2
            throw("marking free object")
        }

        // If marked we have nothing to do.
        // Objects are correctly marked and no other operations are required
        if mbits.isMarked() {
            return
        }
        // mbits.setMarked() // Avoid extra call overhead with manual inlining.
        // Markup object
        atomic.Or8(mbits.bytep, mbits.mask)
        // If this is a noscan object, fast-track it to black
        // instead of greying it.
        // If the object is not a pointer, it only needs tokens and does not need to be placed in the queue, which is equivalent to direct black marking.
        if span.spanclass.noscan() {
            gcw.bytesMarked += uint64(span.elemsize)
            return
        }
    }

    // Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
    // seems like a nice optimization that can be added back in.
    // There needs to be time between the PREFETCH and the use.
    // Previously we put the obj in an 8 element buffer that is drained at a rate
    // to give the PREFETCH time to do its work.
    // Use of PREFETCHNTA might be more appropriate than PREFETCH
    // Determine whether the object is put in the queue, if not, put in the gray step is completed.
    if !gcw.putFast(obj) {
        gcw.put(obj)
    }
}
gcWork.putFast

work has two queues of wbuf1 wbuf2 to save grey objects. First, grey objects will be added to wbuf1 queue. When wbuf1 is full, wbuf1 and wbuf2 will be exchanged. This matter wbuf2 will be promoted to wbuf1 and continue to store grey objects. When both queues are full, they want to apply globally.

Put Fast here to try putting objects in the wbuf1 queue

func (w *gcWork) putFast(obj uintptr) bool {
    wbuf := w.wbuf1
    if wbuf == nil {
        // No cache queue is requested, false is returned
        return false
    } else if wbuf.nobj == len(wbuf.obj) {
        // wbuf1 queue full, return false
        return false
    }

    // Add objects to a queue that is not full of wbuf1
    wbuf.obj[wbuf.nobj] = obj
    wbuf.nobj++
    return true
}
gcWork.put

Put not only tries to put objects into wbuf1, but also tries to change the role of wbuf1 wbuf2 when wbuf1 is full. If wbuf1 wbuf2 is full, it wants to apply globally and submit the full queue to the global queue.

func (w *gcWork) put(obj uintptr) {
    flushed := false
    wbuf := w.wbuf1
    if wbuf == nil {
        // If wbuf1 does not exist, initialize wbuf1 wbuf2 two queues
        w.init()
        wbuf = w.wbuf1
        // wbuf is empty at this point.
    } else if wbuf.nobj == len(wbuf.obj) {
        // Wbuf1 is full, change the role of wbuf1 wbuf2
        w.wbuf1, w.wbuf2 = w.wbuf2, w.wbuf1
        wbuf = w.wbuf1
        if wbuf.nobj == len(wbuf.obj) {
            // After changing roles, wbuf1 is full, indicating that both queues are full.
            // Hand wbuf1 over globally and get an empty queue
            putfull(wbuf)
            wbuf = getempty()
            w.wbuf1 = wbuf
            // Set the flag bit for queue submission
            flushed = true
        }
    }

    wbuf.obj[wbuf.nobj] = obj
    wbuf.nobj++

    // If we put a buffer on full, let the GC controller know so
    // it can encourage more workers to run. We delay this until
    // the end of put so that w is in a consistent state, since
    // enlistWorker may itself manipulate w.
    // At this point, the global queue is marked full, and the GC controller chooses to schedule more work to work.
    if flushed && gcphase == _GCmark {
        gcController.enlistWorker()
    }
}

Now, let's go on to analyze the functions in gcDrain and track how our gray objects are blackened.

gcw.balance()

Continue to analyze the 58 lines of gcDrain, what is balance work

func (w *gcWork) balance() {
    if w.wbuf1 == nil {
        // Here the wbuf1 wbuf2 queue has not been initialized
        return
    }
    // If wbuf2 is not empty, it is handed over to the global and an empty Island queue is obtained for wbuf2.
    if wbuf := w.wbuf2; wbuf.nobj != 0 {
        putfull(wbuf)
        w.wbuf2 = getempty()
    } else if wbuf := w.wbuf1; wbuf.nobj > 4 {
        // Divide the unfilled wbuf1 into two halves and submit half of them to the global queue
        w.wbuf1 = handoff(wbuf)
    } else {
        return
    }
    // We flushed a buffer to the full list, so wake a worker.
    // Here, the global queue is full, and other work can work.
    if gcphase == _GCmark {
        gcController.enlistWorker()
    }
}
gcw.get()

Continue to analyze the 63 rows of gcDrain. Here is to get an object from the local queue first. If the wbuf1 of the local queue does not exist, try to get it from wbuf2. If neither of them exists, try to get a full queue from the global queue and get an object.

func (w *gcWork) get() uintptr {
    wbuf := w.wbuf1
    if wbuf == nil {
        w.init()
        wbuf = w.wbuf1
        // wbuf is empty at this point.
    }
    if wbuf.nobj == 0 {
        // Wbuf1 is empty, change the role of wbuf1 wbuf2
        w.wbuf1, w.wbuf2 = w.wbuf2, w.wbuf1
        wbuf = w.wbuf1
        // The original wbuf2 is also empty, trying to get a full queue from the global queue
        if wbuf.nobj == 0 {
            owbuf := wbuf
            wbuf = getfull()
            // If not, return
            if wbuf == nil {
                return 0
            }
            // Upload the empty queue to the global empty queue and take the full queue as its own wbuf1
            putempty(owbuf)
            w.wbuf1 = wbuf
        }
    }

    // TODO: This might be a good place to add prefetch code

    wbuf.nobj--
    return wbuf.obj[wbuf.nobj]
}

Gcw. tryGet () gcw. tryGetFast () has similar logic and is relatively simple, so we will not continue to analyze it.

scanobject

We continue to analyze the L76 of gcDrain, where we have got b and started the consumption queue.

func scanobject(b uintptr, gcw *gcWork) {
    // Find the bits for b and the size of the object at b.
    //
    // b is either the beginning of an object, in which case this
    // is the size of the object to scan, or it points to an
    // oblet, in which case we compute the size to scan below.
    // Get bits corresponding to b
    hbits := heapBitsForAddr(b)
    // Get the span where b is located
    s := spanOfUnchecked(b)
    n := s.elemsize
    if n == 0 {
        throw("scanobject n == 0")
    }
    // If the object is too large, it will be cut and scanned again. The maxOblet Bytes is 128k.
    if n > maxObletBytes {
        // Large object. Break into oblets for better
        // parallelism and lower latency.
        if b == s.base() {
            // It's possible this is a noscan object (not
            // from greyobject, but from other code
            // paths), in which case we must *not* enqueue
            // oblets since their bitmaps will be
            // uninitialized.
            // If it's not a pointer, direct marking returns, which is equivalent to marking black.
            if s.spanclass.noscan() {
                // Bypass the whole scan.
                gcw.bytesMarked += uint64(n)
                return
            }

            // Enqueue the other oblets to scan later.
            // Some oblets may be in b's scalar tail, but
            // these will be marked as "no more pointers",
            // so we'll drop out immediately when we go to
            // scan those.
            // Cut by maxObletBytes and put in the queue
            for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {
                if !gcw.putFast(oblet) {
                    gcw.put(oblet)
                }
            }
        }

        // Compute the size of the oblet. Since this object
        // must be a large object, s.base() is the beginning
        // of the object.
        n = s.base() + s.elemsize - b
        if n > maxObletBytes {
            n = maxObletBytes
        }
    }

    var i uintptr
    for i = 0; i < n; i += sys.PtrSize {
        // Find bits for this word.
        // Get the corresponding bits
        if i != 0 {
            // Avoid needless hbits.next() on last iteration.
            hbits = hbits.next()
        }
        // Load bits once. See CL 22712 and issue 16973 for discussion.
        bits := hbits.bits()
        // During checkmarking, 1-word objects store the checkmark
        // in the type bit for the one word. The only one-word objects
        // are pointers, or else they'd be merged with other non-pointer
        // data into larger allocations.
        if i != 1*sys.PtrSize && bits&bitScan == 0 {
            break // no more pointers in this object
        }
        // Not a pointer. Continue.
        if bits&bitPointer == 0 {
            continue // not a pointer
        }

        // Work here is duplicated in scanblock and above.
        // If you make changes here, make changes there too.
        obj := *(*uintptr)(unsafe.Pointer(b + i))

        // At this point we have extracted the next potential pointer.
        // Quickly filter out nil and pointers back to the current object.
        if obj != 0 && obj-b >= n {
            // Test if obj points into the Go heap and, if so,
            // mark the object.
            //
            // Note that it's possible for findObject to
            // fail if obj points to a just-allocated heap
            // object because of a race with growing the
            // heap. In this case, we know the object was
            // just allocated and hence will be marked by
            // allocation itself.
            // Find the object corresponding to the pointer and gray it
            if obj, span, objIndex := findObject(obj, b, i); obj != 0 {
                greyobject(obj, b, i, span, gcw, objIndex)
            }
        }
    }
    gcw.bytesMarked += uint64(n)
    gcw.scanWork += int64(i)
}

In summary, we can find that the gray is the mark and put it in the queue, and the black is the mark, so when the gray object is removed from the queue, we can think that the object is a black object.

So far, the analysis of gcDrain's labeling work has been completed, and we're going back to the gcBgMarkWorker analysis.

gcMarkDone

gcMarkDone will move mark1 to mark2 and mark2 to mark termination

Mark 1 phase: Includes all root tags, global cache queues and local cache queues

Mark 2 phase: Local cache queues will be disabled

func gcMarkDone() {
top:
    semacquire(&work.markDoneSema)

    // Re-check transition condition under transition lock.
    if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
        semrelease(&work.markDoneSema)
        return
    }

    // Disallow starting new workers so that any remaining workers
    // in the current mark phase will drain out.
    //
    // TODO(austin): Should dedicated workers keep an eye on this
    // and exit gcDrain promptly?
    // Prohibit new markup tasks
    atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
    prevFractionalGoal := gcController.fractionalUtilizationGoal
    gcController.fractionalUtilizationGoal = 0

    // If the gcBlackenPromptly table name requires all local cache queues to be immediately submitted to the global queue, and local cache queues are disabled
    if !gcBlackenPromptly {
        // Transition from mark 1 to mark 2.
        //
        // The global work list is empty, but there can still be work
        // sitting in the per-P work caches.
        // Flush and disable work caches.

        // Disallow caching workbufs and indicate that we're in mark 2.
        // Disable local caching queues and enter mark2 phase
        gcBlackenPromptly = true

        // Prevent completion of mark 2 until we've flushed
        // cached workbufs.
        atomic.Xadd(&work.nwait, -1)

        // GC is set up for mark 2. Let Gs blocked on the
        // transition lock go while we flush caches.
        semrelease(&work.markDoneSema)
        // Switch to g0 for execution, local cache upload to global operation
        systemstack(func() {
            // Flush all currently cached workbufs and
            // ensure all Ps see gcBlackenPromptly. This
            // also blocks until any remaining mark 1
            // workers have exited their loop so we can
            // start new mark 2 workers.
            forEachP(func(_p_ *p) {
                wbBufFlush1(_p_)
                _p_.gcw.dispose()
            })
        })

        // Check that roots are marked. We should be able to
        // do this before the forEachP, but based on issue
        // #16083 there may be a (harmless) race where we can
        // enter mark 2 while some workers are still scanning
        // stacks. The forEachP ensures these scans are done.
        //
        // TODO(austin): Figure out the race and fix this
        // properly.
        // Check that all root s are marked
        gcMarkRootCheck()

        // Now we can start up mark 2 workers.
        atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
        gcController.fractionalUtilizationGoal = prevFractionalGoal

        incnwait := atomic.Xadd(&work.nwait, +1)
        // If there are no more tasks, the second call is executed, from mark2 to mark termination
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // This loop will make progress because
            // gcBlackenPromptly is now true, so it won't
            // take this same "if" branch.
            goto top
        }
    } else {
        // Transition to mark termination.
        now := nanotime()
        work.tMarkTerm = now
        work.pauseStart = now
        getg().m.preemptoff = "gcing"
        if trace.enabled {
            traceGCSTWStart(0)
        }
        systemstack(stopTheWorldWithSema)
        // The gcphase is _GCmark, it will transition to _GCmarktermination
        // below. The important thing is that the wb remains active until
        // all marking is complete. This includes writes made by the GC.

        // Record that one root marking pass has completed.
        work.markrootDone = true

        // Disable assists and background workers. We must do
        // this before waking blocked assists.
        atomic.Store(&gcBlackenEnabled, 0)

        // Wake all blocked assists. These will run when we
        // start the world again.
        // Wake up all auxiliary GC
        gcWakeAllAssists()

        // Likewise, release the transition lock. Blocked
        // workers and assists will run when we start the
        // world again.
        semrelease(&work.markDoneSema)

        // endCycle depends on all gcWork cache stats being
        // flushed. This is ensured by mark 2.
        // Calculate the threshold of next gc departure
        nextTriggerRatio := gcController.endCycle()

        // Perform mark termination. This will restart the world.
        // start the world and enter the completion phase
        gcMarkTermination(nextTriggerRatio)
    }
}
gcMarkTermination

End marking and clean up, etc.

func gcMarkTermination(nextTriggerRatio float64) {
    // World is stopped.
    // Start marktermination which includes enabling the write barrier.
    atomic.Store(&gcBlackenEnabled, 0)
    gcBlackenPromptly = false
    // Setting the phase identifier of GC
    setGCPhase(_GCmarktermination)

    work.heap1 = memstats.heap_live
    startTime := nanotime()

    mp := acquirem()
    mp.preemptoff = "gcing"
    _g_ := getg()
    _g_.m.traceback = 2
    gp := _g_.m.curg
    // Set the current g state to wait state
    casgstatus(gp, _Grunning, _Gwaiting)
    gp.waitreason = waitReasonGarbageCollection

    // Run gc on the g0 stack. We do this so that the g stack
    // we're currently running on will no longer change. Cuts
    // the root set down a bit (g0 stacks are not scanned, and
    // we don't need to scan gc's internal state).  We also
    // need to switch to g0 so we can shrink the stack.
    systemstack(func() {
        // Scan the current g stack through g0
        gcMark(startTime)
        // Must return immediately.
        // The outer function's stack may have moved
        // during gcMark (it shrinks stacks, including the
        // outer function's stack), so we must not refer
        // to any of its variables. Return back to the
        // non-system stack to pick up the new addresses
        // before continuing.
    })

    systemstack(func() {
        work.heap2 = work.bytesMarked
        if debug.gccheckmark > 0 {
            // Run a full stop-the-world mark using checkmark bits,
            // to check that we didn't forget to mark anything during
            // the concurrent mark process.
            // If gccheckmark is enabled, check that all reachable objects have markers
            gcResetMarkState()
            initCheckmarks()
            gcMark(startTime)
            clearCheckmarks()
        }

        // marking is complete so we can turn the write barrier off
        // Setting the phase identifier of gc, GCoff closes the write barrier
        setGCPhase(_GCoff)
        // Start cleaning
        gcSweep(work.mode)

        if debug.gctrace > 1 {
            startTime = nanotime()
            // The g stacks have been scanned so
            // they have gcscanvalid==true and gcworkdone==true.
            // Reset these so that all stacks will be rescanned.
            gcResetMarkState()
            finishsweep_m()

            // Still in STW but gcphase is _GCoff, reset to _GCmarktermination
            // At this point all objects will be found during the gcMark which
            // does a complete STW mark and object scan.
            setGCPhase(_GCmarktermination)
            gcMark(startTime)
            setGCPhase(_GCoff) // marking is done, turn off wb.
            gcSweep(work.mode)
        }
    })

    _g_.m.traceback = 0
    casgstatus(gp, _Gwaiting, _Grunning)

    if trace.enabled {
        traceGCDone()
    }

    // all done
    mp.preemptoff = ""

    if gcphase != _GCoff {
        throw("gc done but gcphase != _GCoff")
    }

    // Update GC trigger and pacing for the next cycle.
    // Update the growth ratio of next departure gc
    gcSetTriggerRatio(nextTriggerRatio)

    // Update timing memstats
    // Update time
    now := nanotime()
    sec, nsec, _ := time_now()
    unixNow := sec*1e9 + int64(nsec)
    work.pauseNS += now - work.pauseStart
    work.tEnd = now
    atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
    atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
    memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
    memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
    memstats.pause_total_ns += uint64(work.pauseNS)

    // Update work.totaltime.
    sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
    // We report idle marking time below, but omit it from the
    // overall utilization here since it's "free".
    markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
    markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
    cycleCpu := sweepTermCpu + markCpu + markTermCpu
    work.totaltime += cycleCpu

    // Compute overall GC CPU utilization.
    totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
    memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)

    // Reset sweep state.
    // Reset the state of cleaning
    sweep.nbgsweep = 0
    sweep.npausesweep = 0

    // If the gc is mandatory, the logo is added
    if work.userForced {
        memstats.numforcedgc++
    }

    // Bump GC cycle count and wake goroutines waiting on sweep.
    // Count the number of times GC is executed and wake up the G waiting to be cleaned
    lock(&work.sweepWaiters.lock)
    memstats.numgc++
    injectglist(work.sweepWaiters.head.ptr())
    work.sweepWaiters.head = 0
    unlock(&work.sweepWaiters.lock)

    // Finish the current heap profiling cycle and start a new
    // heap profiling cycle. We do this before starting the world
    // so events don't leak into the wrong cycle.
    mProf_NextCycle()
    // start the world
    systemstack(func() { startTheWorldWithSema(true) })

    // Flush the heap profile so we can start a new cycle next GC.
    // This is relatively expensive, so we don't do it with the
    // world stopped.
    mProf_Flush()

    // Prepare workbufs for freeing by the sweeper. We do this
    // asynchronously because it can take non-trivial time.
    prepareFreeWorkbufs()

    // Free stack spans. This must be done between GC cycles.
    systemstack(freeStackSpans)

    // Print gctrace before dropping worldsema. As soon as we drop
    // worldsema another cycle could start and smash the stats
    // we're trying to print.
    if debug.gctrace > 0 {
        util := int(memstats.gc_cpu_fraction * 100)

        var sbuf [24]byte
        printlock()
        print("gc ", memstats.numgc,
            " @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
            util, "%: ")
        prev := work.tSweepTerm
        for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
            if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
            prev = ns
        }
        print(" ms clock, ")
        for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
            if i == 2 || i == 3 {
                // Separate mark time components with /.
                print("/")
            } else if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
        }
        print(" ms cpu, ",
            work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
            work.heapGoal>>20, " MB goal, ",
            work.maxprocs, " P")
        if work.userForced {
            print(" (forced)")
        }
        print("\n")
        printunlock()
    }

    semrelease(&worldsema)
    // Careful: another GC cycle may start now.

    releasem(mp)
    mp = nil

    // now that gc is done, kick off finalizer thread if needed
    // If it's not parallel GC, let the current M start scheduling
    if !concurrentSweep {
        // give the queued finalizers, if any, a chance to run
        Gosched()
    }
}
goSweep

Cleaning tasks

func gcSweep(mode gcMode) {
    if gcphase != _GCoff {
        throw("gcSweep being done but phase is not GCoff")
    }

    lock(&mheap_.lock)
    // Swepgen grows by 2 after each GC, and the roles of sweepSpans change after each GC.
    mheap_.sweepgen += 2
    mheap_.sweepdone = 0
    if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
        // We should have drained this list during the last
        // sweep phase. We certainly need to start this phase
        // with an empty swept list.
        throw("non-empty swept list")
    }
    mheap_.pagesSwept = 0
    unlock(&mheap_.lock)
    // If not parallel GC, or forced GC
    if !_ConcurrentSweep || mode == gcForceBlockMode {
        // Special case synchronous sweep.
        // Record that no proportional sweeping has to happen.
        lock(&mheap_.lock)
        mheap_.sweepPagesPerByte = 0
        unlock(&mheap_.lock)
        // Sweep all spans eagerly.
        // Clean up all span s
        for sweepone() != ^uintptr(0) {
            sweep.npausesweep++
        }
        // Free workbufs eagerly.
        // Release all workbufs
        prepareFreeWorkbufs()
        for freeSomeWbufs(false) {
        }
        // All "free" events for this mark/sweep cycle have
        // now happened, so we can make this profile cycle
        // available immediately.
        mProf_NextCycle()
        mProf_Flush()
        return
    }

    // Background sweep.
    lock(&sweep.lock)
    // Wake up the background cleaning task, which is the bgsweep function, and the cleaning process is similar to the non-parallel cleaning above.
    if sweep.parked {
        sweep.parked = false
        ready(sweep.g, 0, true)
    }
    unlock(&sweep.lock)
}
sweepone

Next, we will analyze the sweepone cleaning process.

func sweepone() uintptr {
    _g_ := getg()
    sweepRatio := mheap_.sweepPagesPerByte // For debugging

    // increment locks to ensure that the goroutine is not preempted
    // in the middle of sweep thus leaving the span in an inconsistent state for next GC
    _g_.m.locks++
    // Check if the cleaning has been completed
    if atomic.Load(&mheap_.sweepdone) != 0 {
        _g_.m.locks--
        return ^uintptr(0)
    }
    // Increase the number of cleaners
    atomic.Xadd(&mheap_.sweepers, +1)

    npages := ^uintptr(0)
    sg := mheap_.sweepgen
    for {
        // Cyclic access to span to be cleaned
        s := mheap_.sweepSpans[1-sg/2%2].pop()
        if s == nil {
            atomic.Store(&mheap_.sweepdone, 1)
            break
        }
        if s.state != mSpanInUse {
            // This can happen if direct sweeping already
            // swept this span, but in that case the sweep
            // generation should always be up-to-date.
            if s.sweepgen != sg {
                print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n")
                throw("non in-use span in unswept list")
            }
            continue
        }
        // Sweepgen = H - > sweepgen - 2, which means the span needs cleaning
        // Sweepgen = H - > sweepgen - 1, indicating that the span is being cleaned
        // This is where you determine the state of the span and try to convert the state of the span.
        if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            continue
        }
        npages = s.npages
        // Cleaning of a single span
        if !s.sweep(false) {
            // Span is still in-use, so this returned no
            // pages to the heap and the span needs to
            // move to the swept in-use list.
            npages = 0
        }
        break
    }

    // Decrement the number of active sweepers and if this is the
    // last one print trace information.
    // Current worker cleaning tasks completed, update the number of sweepers
    if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
        if debug.gcpacertrace > 0 {
            print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n")
        }
    }
    _g_.m.locks--
    return npages
}
mspan.sweep
func (s *mspan) sweep(preserve bool) bool {
    // It's critical that we enter this function with preemption disabled,
    // GC must not start while we are in the middle of this function.
    _g_ := getg()
    if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 {
        throw("MSpan_Sweep: m is not locked")
    }
    sweepgen := mheap_.sweepgen
    // Only span s in the state of being cleaned can be executed properly
    if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
        print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
        throw("MSpan_Sweep: bad span state")
    }

    if trace.enabled {
        traceGCSweepSpan(s.npages * _PageSize)
    }
    // Update the number of page s cleaned first
    atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))

    spc := s.spanclass
    size := s.elemsize
    res := false

    c := _g_.m.mcache
    freeToHeap := false

    // The allocBits indicate which unmarked objects don't need to be
    // processed since they were free at the end of the last GC cycle
    // and were not allocated since then.
    // If the allocBits index is >= s.freeindex and the bit
    // is not marked then the object remains unallocated
    // since the last GC.
    // This situation is analogous to being on a freelist.

    // Unlink & free special records for any objects we're about to free.
    // Two complications here:
    // 1. An object can have both finalizer and profile special records.
    //    In such case we need to queue finalizer for execution,
    //    mark the object as live and preserve the profile special.
    // 2. A tiny object can have several finalizers setup for different offsets.
    //    If such object is not marked, we need to queue all finalizers at once.
    // Both 1 and 2 are possible at the same time.
    specialp := &s.specials
    special := *specialp
    // Determine whether an object in special ty survives, whether there is at least one finalizer, release an object without finalizer, and queue objects with finalizer
    for special != nil {
        // A finalizer can be set for an inner byte of an object, find object beginning.
        objIndex := uintptr(special.offset) / size
        p := s.base() + objIndex*size
        mbits := s.markBitsForIndex(objIndex)
        if !mbits.isMarked() {
            // This object is not marked and has at least one special record.
            // Pass 1: see if it has at least one finalizer.
            hasFin := false
            endOffset := p - s.base() + size
            for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next {
                if tmp.kind == _KindSpecialFinalizer {
                    // Stop freeing of object if it has a finalizer.
                    mbits.setMarkedNonAtomic()
                    hasFin = true
                    break
                }
            }
            // Pass 2: queue all finalizers _or_ handle profile record.
            for special != nil && uintptr(special.offset) < endOffset {
                // Find the exact byte for which the special was setup
                // (as opposed to object beginning).
                p := s.base() + uintptr(special.offset)
                if special.kind == _KindSpecialFinalizer || !hasFin {
                    // Splice out special record.
                    y := special
                    special = special.next
                    *specialp = special
                    freespecial(y, unsafe.Pointer(p), size)
                } else {
                    // This is profile record, but the object has finalizers (so kept alive).
                    // Keep special record.
                    specialp = &special.next
                    special = *specialp
                }
            }
        } else {
            // object is still live: keep special record
            specialp = &special.next
            special = *specialp
        }
    }

    if debug.allocfreetrace != 0 || raceenabled || msanenabled {
        // Find all newly freed objects. This doesn't have to
        // efficient; allocfreetrace has massive overhead.
        mbits := s.markBitsForBase()
        abits := s.allocBitsForIndex(0)
        for i := uintptr(0); i < s.nelems; i++ {
            if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) {
                x := s.base() + i*s.elemsize
                if debug.allocfreetrace != 0 {
                    tracefree(unsafe.Pointer(x), size)
                }
                if raceenabled {
                    racefree(unsafe.Pointer(x), size)
                }
                if msanenabled {
                    msanfree(unsafe.Pointer(x), size)
                }
            }
            mbits.advance()
            abits.advance()
        }
    }

    // Count the number of free objects in this span.
    // Get the total number of alloc objects to be released
    nalloc := uint16(s.countAlloc())
    // If the sizeclass is 0, but the total amount allocated is 0, it is released to mheap.
    if spc.sizeclass() == 0 && nalloc == 0 {
        s.needzero = 1
        freeToHeap = true
    }
    nfreed := s.allocCount - nalloc
    if nalloc > s.allocCount {
        print("runtime: nelems=", s.nelems, " nalloc=", nalloc, " previous allocCount=", s.allocCount, " nfreed=", nfreed, "\n")
        throw("sweep increased allocation count")
    }

    s.allocCount = nalloc
    // Judging whether span is empty
    wasempty := s.nextFreeIndex() == s.nelems
    // Reset freeindex
    s.freeindex = 0 // reset allocation index to start of span.
    if trace.enabled {
        getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize
    }

    // gcmarkBits becomes the allocBits.
    // get a fresh cleared gcmarkBits in preparation for next GC
    // Reset allocBits to gcMarkBits
    s.allocBits = s.gcmarkBits
    // Reset gcMarkBits
    s.gcmarkBits = newMarkBits(s.nelems)

    // Initialize alloc bits cache.
    // Update allocCache
    s.refillAllocCache(0)

    // We need to set s.sweepgen = h.sweepgen only when all blocks are swept,
    // because of the potential for a concurrent free/SetFinalizer.
    // But we need to set it before we make the span available for allocation
    // (return it to heap or mcentral), because allocation code assumes that a
    // span is already swept if available for allocation.
    if freeToHeap || nfreed == 0 {
        // The span must be in our exclusive ownership until we update sweepgen,
        // check for potential races.
        if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
            print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
            throw("MSpan_Sweep: bad span state after sweep")
        }
        // Serialization point.
        // At this point the mark bits are cleared and allocation ready
        // to go so release the span.
        atomic.Store(&s.sweepgen, sweepgen)
    }

    if nfreed > 0 && spc.sizeclass() != 0 {
        c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed)
        // Release span to mcentral
        res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty)
        // MCentral_FreeSpan updates sweepgen
    } else if freeToHeap {
        // Here's the span release of the big object, echoing 117 lines
        // Free large span to heap

        // NOTE(rsc,dvyukov): The original implementation of efence
        // in CL 22060046 used SysFree instead of SysFault, so that
        // the operating system would eventually give the memory
        // back to us again, so that an efence program could run
        // longer without running out of memory. Unfortunately,
        // calling SysFree here without any kind of adjustment of the
        // heap data structures means that when the memory does
        // come back to us, we have the wrong metadata for it, either in
        // the MSpan structures or in the garbage collection bitmap.
        // Using SysFault here means that the program will run out of
        // memory fairly quickly in efence mode, but at least it won't
        // have mysterious crashes due to confused memory reuse.
        // It should be possible to switch back to SysFree if we also
        // implement and then call some kind of MHeap_DeleteSpan.
        if debug.efence > 0 {
            s.limit = 0 // prevent mlookup from finding this span
            sysFault(unsafe.Pointer(s.base()), size)
        } else {
            // Release sapn to mheap
            mheap_.freeSpan(s, 1)
        }
        c.local_nlargefree++
        c.local_largefree += size
        res = true
    }
    if !res {
        // The span has been swept and is still in-use, so put
        // it on the swept in-use list.
        // If span is not released to mcentral or mheap, it means that span is still in-use state
        mheap_.sweepSpans[sweepgen/2%2].push(s)
    }
    return res
}

ok, so far, the GC process of Go has been analyzed, and it may be easier to understand with the top starting figure.

Reference Documents

  1. Exploration of Golang Source Code (3) Realization Principle of GC
  2. Go Language Learning Notes
  3. A picture for tricolor marking

Keywords: Go REST Unix

Added by nemo on Thu, 15 Aug 2019 17:14:37 +0300