Fix the problem that WebClient calls cannot be tracked through Skywalking in the Spring Cloud Gateway project

Solve the problem that WebClient calls cannot be tracked in the Spring Cloud Gateway project

Problem description

Skywalking brings non intrusive distributed link collection to java applications through java agent.

In the microservice architecture, as a business Gateway, Spring Cloud Gateway generally needs to customize the Filter and call other service interfaces to verify user identity or judge permissions. The Gateway process is configured with Skywalking Agent(8.8.0), but using WebClient to call remote services in Filter may generate multiple call links and cannot be tracked correctly.

Principle of Skywalking Jave Agent collecting link information

Before troubleshooting, learn how Skywalking Jave Agent collects link information.

Plugin Development Guide

Maintenance of trace status for synchronous call in a single process

Skywalking Jave Agent via org apache. skywalking. apm. agent. core. context. Contextmanager to manage Trace context.

Create a Span through ContextManager#createEntrySpan, ContextManager#createLocalSpan, ContextManager#createExistSpan and other methods.

  • EntrySpan indicates a remote adjusted span
  • LocalSpan represents the local span within a process
  • ExistSpan refers to the span of one remote main dispatching

When creating a Span, if the link context TraceContext has not been created, the Trace will be created first and the TracerContext will be saved to the ThreadLocal ContextManager managed by the ContextManager In context. The newly created Span will use the context information of TracerContext.

Because TracerContext exists in ThreadLocal, multiple spans created in the same thread will be concatenated using the same TracerContext.

State maintenance of asynchronous call trace in a single process

When using Spring WebFlux or vert X and other asynchronous frameworks, the logic of a call transaction may be scheduled in different threads.

Because the ContextManager uses ThreadLocal to maintain TracerContext, each time a Span is created in a single call transaction chain, it may correspond to a different TracerContext Finally, multiple links appear in the Skyawalking console.

For example, when Spring Mvc receives an Http request, it creates an EntrySpan and needs to call a remote service in the next business logic, so it needs to create an ExitSpan. However, when creating an ExitSpan, it has been switched to other threads due to multiple asynchronous calls, and the ContextManager cannot obtain the original TracerContext, so it creates a new one, At this time, the EntrySpan and ExitSpan do not belong to the same Trace.

For asynchronous calls, Skywalking Agent provides a ContextSnapshot for sharing TracerContext between threads

When implementing the plug-in of asynchronous framework, you need to use contextmanager after creating the first Span Capture () obtains the ContextSnapshot and places it in the context of the asynchronous framework itself for transmission.

Then, when creating subsequent spans, you need to obtain contextsnapshots from the context of the framework, and then use contextmanger The continued method restores the ContextSnapshot to the current Span.

Cross process call trace state transfer

Skywalking passes the link context according to different network protocols or frameworks (such as Http Header or Kafka Message Header). The implementation steps are as follows:

  1. The calling end creates an ExitSpan through contextmanger Inject (contextcarrier carrier) injects the context information into the carrier. Through the carrier, you can obtain the Header information to be transmitted, and then inject the Header information into the corresponding calling framework (such as HttpRequest).
  2. The called party obtains the Header information from the framework and encapsulates it into a ContextCarrier, and then uses the carrier to call the ContextManager#createEntrySpan to create an EntrySpan, which can string the main call with the called Trace context.

Implementation logic and problem recurrence of Spring WebFlux Webclient plug-in.

Since the WebClient call will cause multiple traces to be generated, check spring-weblux-5 The code of x-WebClient plugin (version 8.8.0).

Via org apache. skywalking. apm. plugin. spring. webflux. v5. webclient. define. Webflux web client instrumentation can see the plug-in through org apache. skywalking. apm. plugin. spring. webflux. v5. webclient. Webfluxwebclientinterceptor intercepted org springframework. web. reactive. function. client. Exchangefunctions $defaultexchangefunction #exchange method.

Then continue to check the code of WebFluxWebClientInterceptor:

@Override
    public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, MethodInvocationContext context) throws Throwable {
        //..

        
        ClientRequest request = (ClientRequest) allArguments[0];
        final ContextCarrier contextCarrier = new ContextCarrier();

        URI uri = request.url();
        final String requestURIString = getRequestURIString(uri);
        final String operationName = requestURIString;
        final String remotePeer = getIPAndPort(uri);
        
        // Create ExitSpan directly without using contextmanager Continued to recover the context
        AbstractSpan span = ContextManager.createExitSpan(operationName, contextCarrier, remotePeer);

        //...

        //user async interface
        span.prepareForAsync();
        ContextManager.stopSpan();
        context.setContext(span);
    }

    @Override
    public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Object ret, MethodInvocationContext context) throws Throwable {
        // fix the problem that allArgument[0] may be null
        if (allArguments[0] == null) {
            return ret;
        }
        Mono<ClientResponse> ret1 = (Mono<ClientResponse>) ret;
        AbstractSpan span = (AbstractSpan) context.getContext();
        return ret1.doOnSuccess(clientResponse -> {
            //...
        }).doOnError(error -> {
            span.log(error);
        }).doFinally(s -> {
            span.asyncFinish();
        });
    }

You can see that ExitSpan is directly created in WebFluxWebClientInterceptor#beforeMethod without using contextmanger Continue to recover the context. Because Spring WebFlux is based on the Reactor asynchronous framework, the creation of the current Span and the pre created EntrySpan are not in the same thread, and the two spans belong to two different tracks

Bug recurrence, create a Spring WebFlux project, and write the test Controller as follows

@SpringBootApplication
@RestController
public class SpringWebfluxProjectApplication {

    public static void main(String[] args) {
        SpringApplication.run(SpringWebfluxProjectApplication.class, args);
    }

    @GetMapping("test")
    public Mono<String> hello() {
        return WebClient.create("http://localhost:8080/foo")
                .get()
                .retrieve()
                .bodyToMono(String.class)
                .flatMap(s -> {
                    return WebClient.create("http://localhost:8080/bar")
                            .get()
                            .retrieve()
                            .bodyToMono(String.class);
                });

    }

    @GetMapping("foo")
    public Mono<String> baidu(ServerWebExchange exchange) {
        return Mono.just("foo");

    }

    @GetMapping("bar")
    public Mono<String> qq(ServerWebExchange exchange) throws InterruptedException {
        return Mono.just("bar").delayElement(Duration.ofMillis(100));

    }

}

Configure the relevant JVM parameters of skywalking agent, run the project and request http://localhost:8080/test , check the skywalking panel. Multiple spans are indeed generated, but the TraceId of each Span is different.

Solution

Based on the analysis in the previous section, the fundamental problem is that the context is not restored when creating ExitSpan, so you need to find a method to obtain the upstream ContextSnapshot and restore it.

Spring Webflux is based on the Reactor framework and can be accessed through Reactor Context To pass the ContextSnapshot

The default plug-in for Skywalking includes mvc-annotation-5 X-plugin, check the corresponding code, and find that the plug-in intercepts Spring Mvc related annotation methods, creates EntrySpan before the annotation methods, and uses the synchronous method. Moreover, the return value of the interception method is not necessarily Mono or Flux. It is difficult to put ContextSnapshot into Reactor Context in this place. Spring-weblux-5 is also available in the optional plugin X-plugin plug-in, which intercepts org springframework. web. reactive. Dispatcherhandler #handle to create an EntrySpan. Dispatcherhandler #handle returns Mono, where you can insert a ContextSnapshot

The specific implementation is as follows:

//org.apache.skywalking.apm.plugin.spring.webflux.v5.DispatcherHandlerHandleMethodInterceptor#afterMethod
 @Override
    public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes,
                              Object ret) throws Throwable {

        ServerWebExchange exchange = (ServerWebExchange) allArguments[0];

        AbstractSpan span = (AbstractSpan) exchange.getAttributes().get("SKYWALKING_SPAN");
        Mono<Void> monoReturn = (Mono<Void>) ret;

        
        // add skywalking context snapshot to reactor context.
        EnhancedInstance instance = getInstance(allArguments[0]);
        if (instance != null && instance.getSkyWalkingDynamicField() != null) {
            monoReturn = monoReturn.subscriberContext(
                    c -> c.put("SKYWALKING_CONTEXT_SNAPSHOT", instance.getSkyWalkingDynamicField()));
        }

        return monoReturn.doFinally(s -> {

            if (span != null) {
                maybeSetPattern(span, exchange);
                try {

                    HttpStatus httpStatus = exchange.getResponse().getStatusCode();
                    // fix webflux-2.0.0-2.1.0 version have bug. httpStatus is null. not support
                    if (httpStatus != null) {
                        Tags.HTTP_RESPONSE_STATUS_CODE.set(span, httpStatus.value());
                        if (httpStatus.isError()) {
                            span.errorOccurred();
                        }
                    }
                } finally {
                    span.asyncFinish();
                }
            }
        });
    }

Personalized WebFluxWebClientInterceptor obtains ContextSnapshot from Reactor Context:

public class WebFluxWebClientInterceptor implements InstanceMethodsAroundInterceptorV2 {

    @Override
    public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, MethodInvocationContext context) throws Throwable {
        // The Reactor context cannot be obtained in the before method, and the original logic is deleted directly
        // In ExchangeFunctions$DefaultExchangeFunction, only the Reactor chain is built and not really executed, so the original logic can be delayed until the context is obtained in the subscriberContext.
    }

    @Override
    public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Object ret, MethodInvocationContext context) throws Throwable {
        // fix the problem that allArgument[0] may be null
        if (allArguments[0] == null) {
            return ret;
        }
        Mono<ClientResponse> ret1 = (Mono<ClientResponse>) ret;
        // Get snapshot from context 
        return Mono.subscriberContext().flatMap(ctx -> {

            ClientRequest request = (ClientRequest) allArguments[0];
            URI uri = request.url();
            final String operationName = getRequestURIString(uri);
            final String remotePeer = getIPAndPort(uri);
            AbstractSpan span = ContextManager.createExitSpan(operationName, remotePeer);

            // get ContextSnapshot from reactor context,  the snapshot is set to reactor context by any other plugin
            // such as DispatcherHandlerHandleMethodInterceptor in spring-webflux-5.x-plugin
            final Optional<Object> optional = ctx.getOrEmpty("SKYWALKING_CONTEXT_SNAPSHOT");
            optional.ifPresent(snapshot -> ContextManager.continued((ContextSnapshot) snapshot));

            //set components name
            span.setComponent(ComponentsDefine.SPRING_WEBCLIENT);
            Tags.URL.set(span, uri.toString());
            Tags.HTTP.METHOD.set(span, request.method().toString());
            SpanLayer.asHttp(span);

            final ContextCarrier contextCarrier = new ContextCarrier();
            ContextManager.inject(contextCarrier);
            if (request instanceof EnhancedInstance) {
                ((EnhancedInstance) request).setSkyWalkingDynamicField(contextCarrier);
            }

            //user async interface
            span.prepareForAsync();
            ContextManager.stopSpan();
            return ret1.doOnSuccess(clientResponse -> {
                HttpStatus httpStatus = clientResponse.statusCode();
                if (httpStatus != null) {
                    Tags.HTTP_RESPONSE_STATUS_CODE.set(span, httpStatus.value());
                    if (httpStatus.isError()) {
                        span.errorOccurred();
                    }
                }
            }).doOnError(span::log).doFinally(s -> {
                span.asyncFinish();
            });
        });
    }
    
}

Recompile the plug-in and add spring-weblux-5 X-plugin and spring-weblux-5 Copy the two plug-ins of x-webclient-plugin to the directory of Skywalking Agent plugin, re run the test code, find and solve the problem, and all calls are connected.

Repair code merged into skywalking-java Trunk( #114 ), expected to be released in version 8.10.0.

Note 1: because spring-weblux-5 The x-plugin is in the optional plugins directory and needs to be manually copied to the plugins directory.

The Spring Cloud Gateway project needs to manually copy gateway-3 x-plugin.

Note 2: Srping MVC plug-in apm-springmvc-annotation-5 The x-plugin takes effect by default when compared with spring-weblux-5 When x-plugins exist at the same time, two entryspans will be generated in one call, and although the EntrySpan generated by mvc plug-in can be linked with the ExitSpan generated by Webclient with the same TraceId, it still has no Parent/Child relationship. If you mind, put spring-weblux-5 in Spring Webflux project Move x-plugin out of ${agetn/path}/plugin directory.

reference resources

  1. Plugin Development Guide
  2. Reactor Context

Keywords: Spring WebFlux Spring Cloud reactor skywalking

Added by Maknis on Sat, 05 Mar 2022 11:40:11 +0200