Correct handling method after registry downtime
Microservice is a major trend at present. The registry is not only the most basic component of microservice, but also the most core component. It provides the client with a list of all callable services. Because of this, the availability of the registry is highly required in the microservice architecture. In the actual development, although we can continuously improve the availability of the registry through some means, we can not guarantee the 100% availability of the registry SLA (Service Level Agreement). In that case, what should we do to mitigate the impact when the registry is down and unavailable?
1. Responsibilities of the registry
Before analyzing the impact of registry downtime, let's take a look at the responsibilities of the registry. The following figure can explain the main responsibilities of the registry.
- Registration Center: used to register remote services and client discovery services on the server side.
- Server: provide external background services and register their own service information to the registration center.
- Client: obtain the registration information of the remote service from the registry, and then make a remote procedure call.
It can be clearly seen that when the registry goes down, service registration, service discovery and service node push will be affected. Without any cache, calls between services will be affected.
2. Service registration and discovery of spring cloud
In order to better explain the problems caused by the unavailability of the registry, let's take a look at the service registration discovery of Spring Cloud. Here we take the code of Spring Cloud consult as an example.
Service registration
public void register(ConsulRegistration reg) { log.info("Registering service with consul: " + reg.getService()); try { client.agentServiceRegister(reg.getService(), properties.getAclToken()); if (heartbeatProperties.isEnabled() && ttlScheduler != null) { ttlScheduler.add(reg.getInstanceId()); } } catch (ConsulException e) { if (this.properties.isFailFast()) { log.error("Error registering service with consul: " + reg.getService(), e); ReflectionUtils.rethrowRuntimeException(e); } log.warn("Failfast is false. Error registering service with consul: " + reg.getService(), e); } }
The source code of service registration does three things in the process of registration:
- Initiate service registration using the consult CLIENT SDK.
- Use a scheduled task to initiate a heartbeat at a regular time.
- When an exception occurs in service registration, an exception will be thrown according to the configured FailFast value.
Service discovery
private List<ConsulServer> getServers() { if (this.client == null) { return Collections.emptyList(); } String tag = getTag(); // null is ok Response<List<HealthService>> response = this.client.getHealthServices( this.serviceId, tag, this.properties.isQueryPassing(), createQueryParamsForClientRequest(), this.properties.getAclToken()); if (response.getValue() == null || response.getValue().isEmpty()) { return Collections.emptyList(); } return transformResponse(response.getValue()); } /** * Transforms the response from Consul in to a list of usable {@link ConsulServer}s. * * @param healthServices the initial list of servers from Consul. Guaranteed to be non-empty list * @return ConsulServer instances * @see ConsulServer#ConsulServer(HealthService) */ protected List<ConsulServer> transformResponse(List<HealthService> healthServices) { List<ConsulServer> servers = new ArrayList<>(); for (HealthService service : healthServices) { ConsulServer server = new ConsulServer(service); if (server.getMetadata().containsKey(this.properties.getDefaultZoneMetadataName())) { server.setZone(server.getMetadata().get(this.properties.getDefaultZoneMetadataName())); } servers.add(server); } return servers; }
The service discovery part is simpler. Use the consult client to obtain the service name node. Now that we know the process of service registration and discovery, let's take a look at service registration and discovery in the downtime scenario.
3. Service registration and discovery under registry downtime
Here, we describe the service exceptions when the registry is down in three scenarios:
- Exception under newly started service
- Exception of running service
- Running and restarted services
(1) Exception of newly started service
- Service registration
When the registry goes down, the user may have a demand for service expansion or need to start a new service. When the service is started, an exception will be thrown when the Consul Client initiates the service registration call. When FailFast is turned on (turned on by default), the service will throw an exception and the service will fail to start. When FailFast is not turned on, The service only outputs a warn log, which does not affect the startup of the service.
When the registry is down, for service expansion or newly started services, you can turn off FailFast to start the application, and then automatically register through heartbeat to reduce the impact on the service.
- Service discovery
Service discovery is unavailable because the registry cannot be connected and the newly started service has no data source of any caller's service node to read data.
(2) Exception of running service
- Service registration
For a running service, the service has been registered successfully, but all heartbeat requests will fail during downtime.
- Service discovery
During registry downtime, service discovery requests fail.
Here, we can downgrade by creating cache, and create memory cache + file cache. When the registry is found to be unavailable, the data in the memory cache is used. If the memory cache has no data, the file cache is accessed. Through the degradation logic of the cache, the calls between currently running service nodes are not affected, but there are imperceptible conditions for newly started nodes and down or offline nodes.
(3) Running and restarted services
The running and restarted services refer to the normal services that are restarted during the downtime of the registry.
- Service registration
During downtime, all HeartBeat requests and service registration requests will fail.
When the service is restarted, you can start the service by turning off fail fast. In this way, the node can normally handle the upstream traffic.
- Service discovery
During registry downtime, service discovery requests fail.
Downgrade through memory cache + file cache. In the process of service discovery, if the registry is not available, try to use the data in the memory cache, but if the data in the memory cache is empty, try to use the local file data for recovery.
4. Exceptions found in service recovery of the registry
Through the above degradation methods, the impact of registry exceptions on business can be effectively reduced. Is that enough? Let's take another look at the exception case when the registry is restored:
- Registry exception.
- If the service caller invokes the registry abnormally, the data in the cache will be degraded.
- The registry is restored. At this time, a large number of nodes may not have initiated the request for timed heartbeat, so the status of all service nodes has become unhealthy (the scenario of Consul Health Check configuring TTL check).
- The service caller pulls new data from the registry again, but all nodes are unhealthy nodes.
- Service call failed.
What should be done for this scenario? By setting a cooling off period, when the Consul Client successfully initiates the pull of the service node, we can wait for two heartbeat cycles from the degraded data previously used by the Cache to the real service node data. During the waiting period, we can use the Cache data. When the called party can complete the heartbeat normally and update the health status of its node, we can use the real data.