Construction and optimization of unified push platform

preface

Demand background

Snowball has seen a surge in users and product lines in recent years. In order to better meet the company's business development and user's personalized needs, the following objectives are achieved:

  • Meet the timeliness of information control by users
  • Increase the coverage of user terminal models
  • Improve user satisfaction and product experience

Snowball unified push platform came into being. Push, as a key channel in APP operation, can well promote the realization of goals through its rational use. At present, it has been: paying attention to posting, replying to comments, stock price reminders, individual stock announcements, portfolio position adjustment and other business scenarios to serve users, and helping operators deliver selected content such as 7 * 24-hour information and market express to target users at the first time.

product design

Snowball's early self built push was based on self built long connection and third-party dependency. In the early stage, due to historical business redundancy and poor code maintainability, there were the following problems:

problemreason
Lack of ACK mechanismPush is asynchronous, and it is impossible to know whether it will be delivered. The third party may be affected by other push parties, which may cause delay and loss
Lack of message persistenceMessages are pushed one by one, and history and status cannot be traced
Lack of idempotent retransmission mechanismThere are too many push links. Any problem in any link will cause message loss and can no longer be received
Complex client access logicEvery time a new APP is accessed, repeated work is required, and the code cannot be reused
Strong coupling between client and SDK of push serviceThe interfaces provided by the push end are not uniform. If the Client needs to be replaced, it needs to be rewritten
Lack of data monitoring and statisticsHow many push, how many success, how many failure

Based on the above problems, through investigation and comparison of the implementation schemes in the industry, and based on the current situation of snowball, the self built push channel based on major mainstream manufacturers is designed and realized, which fundamentally solves the practical difficulties faced by the push scene.

Channel capacity building

Android channel

There are many Android mobile phone manufacturers, and the push service inevitably needs to face the problem of fragmentation. At present, snowball push has integrated the channels of Huawei, Xiaomi, OPPO, VIVO and Meizu native mobile phone manufacturers, and the access of other devices depends on the third-party Youmeng channel.

In terms of push content audit, quota restriction and flow control, major mobile phone manufacturers have their own different platform rules. In the face of these common problems, the platform has been following up the development promoted by the China Academy of communications and communications since the establishment of the platform Unified push Alliance At present, it is not an appropriate time to implement landing in combination with the current situation of snowball.

The following is the optimization scheme of the snowball push platform, which does not mention that the manufacturer has not reached or found similar problems at the current stage.

Limit on the total amount of push in a single day

The total push limit of mobile phone manufacturers is shown in the following table:

passagewayStatus codeOfficial Brief
millet200001When the push quantity exceeds the limit of the current day, the call request fails and the error code 200001 is returned
OPPO33If The number of messages exceeds the daily limit, the interface returns: the number of messages exceeded the daily limit
VIVO10070The number of users specified for single push and group push messages that can be sent shall not exceed the daily limit of the total number of push messages

Solution:

  1. Optimize the business content push logic, formulate different strategies for the distribution of channel content of each manufacturer, and ensure the distribution of key content
  2. Different push limits can be added by submitting applications according to the type of APP application and the rules of the manufacturer

According to the message distribution, identify the business type to which you belong, and prioritize to ensure the distribution of key content

message Message {
    int64 messageId = 1; // Push message batch number
    string title = 2; // Push message title
    string payload = 3; // Push content body
    string description = 4; // Description above notification bar (summary)
    string callback = 5; // "eg:  http://example.com "Callback address"
    string summary_callback = 6; // "eg:  http://img.com "Notification picture address" 
    Type type = 7; // Business type: distribution priority is divided according to business type
    Application app = 8; // For the pushed client, multi terminal apps are distributed by the same platform
    repeated int64 target = 9; // Push target user (detailed push user, array format)
    int64 created = 10; // Message creation time
    int32 ttl = 11; // Expiration time of message (unit: ms)
    map<string, string> ext = 12; // Other custom fields
    map<string, string> version_filter = 13; // Version filtering, funnel mode
    Application targetType = 14; // id type of push target user
}

Real time push rate limit

Snowball is a wealth management application, in which transaction, market and content information have always been the primary content concerned by users. The real-time requirement for push is high, and the push service faces many problems such as data volume and QPS. The flow control restrictions are as follows:

passagewayStatus codeOfficial Brief
millet200002The allocation of Xiaomi push to push rate (QPS) is mainly calculated hierarchically according to the number of MIUI daily networking devices of App. When the QPS exceeds the limit, the error code 200002 will be returned
HuaweiHTTP-503Limit of push times: no more than 3000 messages are sent to an application on a device every day. If more than 3000 messages are sent, the current is limited (recovered after 24 hours of current limit)
VIVO10072Push QPS is automatically adjusted according to the number of SDK subscriptions. The default value is 3000 pieces / s

Solution:

  1. Similar to the quota solution, optimize the distribution logic and ensure the distribution of key contents
  2. Make full use of batch distribution interfaces provided by major manufacturers
  • The QPS restricted by Xiaomi and Huawei is the interface access frequency. Therefore, before the data reaches the manufacturer's channel, it is aggregated in advance according to the user's transmission channel, and batch transmission is used as much as possible. (for example, according to Xiaomi's official description, a request can carry up to 1000 target devices. For example, at 3000QPS, up to 3 million devices can be pushed in one second. The highest sending speed can be 300w / S.)
  • The batch distribution interface of OPPO and VIVO has different access frequency restrictions from the single distribution interface. During data distribution, it is identified according to the message content. When the upper limit is triggered by the batch distribution interface, it is switched to the single distribution interface.
  1. Set the message validity time. After triggering the manufacturer's QPS upper limit, the channel layer enters the push and release queue again
//Xiaomi push channel triggers flow control restriction, and the return retry is judged according to the status code

...

String responseBody = URLDecoder.decode(response.body().string(), "UTF-8");
JsonNode obj = MAPPER.readTree(responseBody);

...

if ("200002".equals(obj.get("code").asText())) {
    // 200002 speed limit, try again later
    limitCounter.increment();
    LOGGER.warn("millet api Interface call triggers frequency control restriction, retransmission user uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList, responseBody, message.getApp().name(), message.getMessageId());
    pushStatusProducer.sendMessageRetry(message.toBuilder().clearTarget().addAllTarget(uidList).build());
    return;

}

This scheme should pay attention to:

  • The deadline of the message is required, otherwise the final distribution is successful, and the timeliness of the content will be compromised in the user experience
  • Message retransmission needs to consider idempotency. In the case of weak network and other boundaries, retransmission will lead to repeated push and affect the user experience. The solutions given by major manufacturers of message idempotency are as follows:
passagewayIdempotent parameterdescribe
milletnotify_idIf the notification bar wants to display multiple push messages, you need to set different notify for different messages_ ID (the notification bar message with the same notify_id will overwrite the previous one), and notify_id is required to be an integer with a value of 0 ~ 2147483647
Huaweinotify_idPush NC automatically generates a unique identifier for each message; Different notification bar messages can have the same notifyId to realize the function of overwriting the previous message with a new message.
OPPOapp_message_idAPI push, please check the app_message_id is user-defined. The API pushes the same app_message_id is pushed only once.

In addition to the solutions in the table, the solutions given by the above manufacturers actually have similar solutions. For example, Xiaomi's' extra The 'Jobkey' field or Huawei's' group 'field can realize message folding and improve the user experience.

//Millet API interface request body encapsulation, using notify_ The ID parameter guarantees idempotent message distribution
RequestBody requestBody = new FormBody.Builder()
        .add("payload", MAPPER.valueToTree(messageTemplate).toString())
        .add("restricted_package_name", packageName)
        .add("description", (messageTemplate.getDescription().length() > 120 ? messageTemplate.getDescription().substring(0, 120) + CutString.SUB_TAIL: messageTemplate.getDescription()))
        .add("extra.notification_large_icon_uri", StringUtils.trimToEmpty(message.getSummaryCallback()))
        .add("title", messageTemplate.getTitle().length() > 50 ? messageTemplate.getTitle().substring(0, 50) : messageTemplate.getTitle())
        .add("pass_through", "0")
        .add("notify_type", "-1")
        // When sending a message, developers can set the group ID(JobKey) of the message. Messages with the same group ID will be aggregated into a message group
        .add("extra.jobkey", String.valueOf(messageTemplate.getMessageId() & Integer.MAX_VALUE))
        .add("registration_id", StringUtils.join(deviceTokens, ","))
        //By default, only one push message is displayed in the notification bar. If you want to display multiple push messages in the notification bar, you need to set different notify for different messages_ id
        .add("notify_id", String.valueOf(messageTemplate.getMessageId() & Integer.MAX_VALUE))
        .build();

IOS & other channels

The channel distribution of Apple manufacturers is implemented according to the APNs officially provided. In the early stage, it was implemented based on JDK. Due to poor performance, it currently adopts the open-source third-party SDK: push

There are occasional problems during use, but most of them are caused by the network link environment. Through the research, we get a scheme: deploy the service node where the iOS push task is located to the APNs server nearby. However, based on the actual use status and current iOS business requirements, it is only discussed here.

Meizu channel is accessed according to the official API document, which can meet the current QPS and total usage. I won't talk about it here.

Other third-party channels such as Youmeng channel or aurora channel can optimize the two capabilities of the channel on the premise of channel access of the above major manufacturers:

  1. Access of other mobile phone users to improve the coverage of push distribution
  2. Assume a fallback role in the construction of the system to ensure the robustness of the system

Platform capacity building

At present, the push platform enriches the system, data and business capabilities of the platform on the basis of providing channel capabilities

System capability

At present, the push platform consists of 8 4vCPU 8GiB servers: 80+w/s total messages are distributed Meet the business index of 1 + billion / day (the current performance bottleneck is limited by the manufacturer). How to ensure the high availability and stability of the system, in addition to good initial architecture design, it is also necessary to carry out lasting optimization iteration and tracking of the system. The index system is convenient for early warning and problem analysis.

There are many problems during the distribution of push channel optimization. Two representative problems are posted here:

Manufacturer channel call selection

The selection of channel distribution is initially integrated with the SDK provided by various manufacturers. Most of the packages conflict with the company's infrastructure, and there are many problems in performance optimization and business compatibility. For example, log component conflict, compatibility difficulty in SD K thread pool adjustment and version upgrade, incomplete returned content of HTTP interface data, etc. Therefore, the API interface is finally selected for encapsulation, the multi-channel message protocol is analyzed by itself, and the push channel connection standard is unified.

For the above reasons, using the asynchronous request of message bus and OkHttp, the data format, code model and performance objectives are unified.

//call_before, OkHttp is packaged in a unified format before distribution
public static RequestBody requestBodyFormat(MessageProto.Message message, String packageName, List<String> deviceTokens, boolean channelSwitch) throws UnsupportedEncodingException {
    MessageTemplate messageTemplate = MessageTemplate.messageConvert(message, MessageProto.Platform.XIAOMI);
    messageTemplate.setTitle(StringUtils.isEmpty(messageTemplate.getTitle()) ? PushTitleUtils.getTitleFromAPP(message.getApp()) : messageTemplate.getTitle());
    RequestBody requestBody = new FormBody.Builder()
    .add("payload", MAPPER.valueToTree(messageTemplate).toString())
    .add("restricted_package_name", packageName)
    .add("description", (messageTemplate.getDescription().length() > 120 ? messageTemplate.getDescription().substring(0, 120) + CutString.SUB_TAIL: messageTemplate.getDescription()))
    .add("extra.notification_large_icon_uri", StringUtils.trimToEmpty(message.getSummaryCallback()))
    .add("title", messageTemplate.getTitle().length() > 50 ? messageTemplate.getTitle().substring(0, 50) : messageTemplate.getTitle())
    .add("pass_through", "0")
    .add("notify_type", "-1")
    // When sending a message, developers can set the group ID(JobKey) of the message. Messages with the same group ID will be aggregated into a message group
    .add("extra.jobkey", String.valueOf(messageTemplate.getMessageId() & Integer.MAX_VALUE))
    //The batch interface is used to issue up to 1000 devicetokens at a time, making full use of the batch mechanism to improve the system throughput
    .add("registration_id", StringUtils.join(deviceTokens, ","))
    //By default, only one push message is displayed in the notification bar. If you want to display multiple push messages in the notification bar, you need to set different notify for different messages_ id
    .add("notify_id", String.valueOf(messageTemplate.getMessageId() & Integer.MAX_VALUE))
    .build();
    return requestBody;
}

//call, OkHttp to send channel messages
public void send(List<UserStateProto.Device> deviceList, MessageProto.Message message, RequestBody requestBody) {
    List<Long> uidList_GE = deviceList.stream().map(m -> m.getUid()).collect(Collectors.toList());
    try {
        LOGGER.info("millet api User to be sent before interface call uid List:{} | Message sent:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList_GE, OkHttp3ConvertUtils.requestBodyURLToString(requestBody), message.getApp().name(), message.getMessageId());
        Request request = new Request.Builder()
                .url(xiaomiSendUrl)
                .addHeader("Authorization", String.format("key=%s", accessToken))
                .post(requestBody)
                .build();
        Call call = okHttpClient.newCall(request);
        call.enqueue(new XiaomiResponseCall(deviceList, message, pushStatusProducer));
    } catch (Exception e) {
        exceptionCounter.increment(deviceList.size());
        LOGGER.error("millet api Interface calling process exception, failed user uid List:{} | Reason for failure:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList_GE, e.getMessage(), message.getApp().name(), message.getMessageId(), e);
        pushStatusProducer.sendByDeviceList(PushResultEnum.FAIL, PushFailedTypeEnum.SYSTEM_ERROR, e.getMessage(), deviceList, message);
    }
}

//call_back, OkHttp asynchronous result callback
public void onResponse(Call call, Response response) throws IOException {
    String responseBody = URLDecoder.decode(response.body().string(), "UTF-8");
    if (response.isSuccessful()) {
        JsonNode obj = MAPPER.readTree(responseBody);
        if ("0".equals(obj.get("code").asText())) {
            JsonNode jsonNode = obj.findPath("data").findPath("bad_regids");
            if (jsonNode.isMissingNode()) {
                successCounter.increment(deviceList.size());
                LOGGER.info("millet api The interface call returns all successful users uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList, responseBody, message.getApp().name(), message.getMessageId());
                pushStatusProducer.sendByDeviceList(PushResultEnum.SUCCESS, PushFailedTypeEnum.NULL, "SUCCESS", deviceList, message);
            } else {
                List<String> failedTokenList = new ArrayList<>();
                for (String objNode : jsonNode.textValue().split(",")) {
                    failedTokenList.add(objNode);
                }
                List<UserStateProto.Device> failedList = deviceList.stream().filter(f -> failedTokenList.contains(f.getDeviceToken())).collect(Collectors.toList());
                failedCounter.increment(failedList.size());
                LOGGER.info("millet api The interface call returns some failed users uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", failedList.stream().map(m -> m.getUid()).collect(Collectors.toList()), responseBody, message.getApp().name(), message.getMessageId());
                pushStatusProducer.sendByDeviceList(PushResultEnum.IGNORE, PushFailedTypeEnum.CHANNEL_ERROR, responseBody, failedList, message);
                List<UserStateProto.Device> successedList = deviceList.stream().filter(f -> !failedTokenList.contains(f.getDeviceToken())).collect(Collectors.toList());
                successCounter.increment(successedList.size());
                LOGGER.info("millet api The interface call returns partially successful users uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", successedList.stream().map(m -> m.getUid()).collect(Collectors.toList()), responseBody, message.getApp().name(), message.getMessageId());
                pushStatusProducer.sendByDeviceList(PushResultEnum.SUCCESS, PushFailedTypeEnum.NULL, "SUCCESS", successedList, message);
            }
        } else if ("200002".equals(obj.get("code").asText())) {
            // 200002 speed limit, try again later
            limitCounter.increment();
            LOGGER.warn("millet api Interface call triggers frequency control restriction, retransmission user uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList, responseBody, message.getApp().name(), message.getMessageId());
            pushStatusProducer.sendMessageRetry(message.toBuilder().clearTarget().addAllTarget(uidList).build());
            return;
        } else {
            failedCounter.increment(deviceList.size());
            LOGGER.warn("millet api The interface call returns all failed users uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList, responseBody, message.getApp().name(), message.getMessageId());
            pushStatusProducer.sendByDeviceList(PushResultEnum.IGNORE, PushFailedTypeEnum.CHANNEL_ERROR, responseBody, deviceList, message);
        }
    } else {
        failedCounter.increment(deviceList.size());
        LOGGER.error("millet api The interface call returned an exception. The failed user uid List:{} | Returned message body:{} | Push APP: {} | Of this batch of messages messageId: {}", uidList, responseBody, message.getApp().name(), message.getMessageId());
        pushStatusProducer.sendByDeviceList(PushResultEnum.IGNORE, PushFailedTypeEnum.CHANNEL_ERROR, responseBody, deviceList, message);
    }
}

Push message whole chain tracking

Because the offline push is not sent by the self built long connection channel, how to locate the current state of each push message of each user is a problem that can not be ignored. Each manufacturer's push background integrates a corresponding problem Debug tool, so the return data of the API interface in the push platform data embedding point needs to record the manufacturer's corresponding trace_id for problem location and data analysis.

For example, Xiaomi manufacturers need IMEI and the batch ID returned by the interface. The link status issued by the manufacturer can be known through Xiaomi background query

Data capability

The next step to complete message push is to further carry out closed-loop management and effect tracking for different businesses and scenarios, and quantify the push effect through the data market. The data market currently covers dozens of business scenarios of three apps, providing real-time data and offline data analysis.

In the data capacity-building, the architecture directly transmits all data layers on the system link through the message bus. Refine the message format of each message, which is specified by msg_id + uid is used as the unique identifier, and event is used uniformly at the application end_ As a buried point field of the push platform, tracking realizes the specification and access standard of the data index system.

//Message bus real-time push data format specification

public void sendByDevice(PushResultEnum pushResultEnum, PushFailedTypeEnum pushFailedTypeEnum, String reason, UserStateProto.Device device, MessageProto.Message message) {
    MessageAck messageAck = new MessageAck();
    messageAck.setUploadTime(System.currentTimeMillis());
    messageAck.setMsgId(message.getMessageId());
    messageAck.setUid(device.getUid());
    messageAck.setChannel(device.getDeviceChannel());
    messageAck.setResult(pushResultEnum.getTypeName());
    messageAck.setFailedType(pushFailedTypeEnum.getTypeName());
    messageAck.setFailedReason(reason);
    messageAck.setAppVersion(device.getAppVersion());
    messageAck.setToken(device.getDeviceToken());
    messageAck.setDescription(message.getDescription());
    messageAck.setApp(message.getApp().name());
    messageAck.setBizType(message.getExtMap().get(TrackingExtKey.BIZ_TYPE));
    //Expand the K/V field to meet the temporary change requirements
    messageAck.setExt(message.getExtMap());
    messageAck.setCallback(message.getCallback());
    sendMessageACK(messageAck);
}

Relying on the push data capability, we can achieve: analysis of APP unloading rate (depending on the manufacturer's push token, and the data can be used as a reference), optimization of push content heat label, optimization of manufacturer channel delivery rate index, optimization of user experience of push business, etc.

Business capability

A powerful push operation console not only provides basic push distribution function, but also provides push effect analysis for operation. For each push message, record the detailed data of each push stage to form funnel analysis. Operators understand the life cycle of a message through the operation console, quantify the push effect, and optimize subsequent topics and groups.

Operation side

Operational decisions are ever-changing. In addition to basic functions such as regular task distribution, the platform has isolated the functional level and data level in architecture design, so as to facilitate the dynamic target selection and algorithm personalization with big data and algorithms.

Audit side

Manufacturers have their own strict standards for push content, the regulatory environment for domestic operation and strict management of user data. The push platform modularizes data flow processing in the platform construction to meet the dynamic adjustment of audit content.

Review summary

The above is mainly to share some problems faced and solved in the process of building and optimizing the push platform, focusing on architecture technology selection and manufacturer channel optimization, mainly including the following two points:

  • In terms of architecture, try to decouple business functions from data system, and separate business logic and data analysis by using message bus
  • In the selection of channel distribution, API interfaces are used for interaction to facilitate subsequent maintenance, performance optimization and access to personalized business needs

Based on the above solutions and skills, the problems at the beginning of the article can be solved in the following ways:

problemrealization
Lack of ACK mechanismThe ACK status of the manufacturer channel is fed back in real time by using the callback result called by the HTTP interface
Lack of message persistenceUse MSG for each message_ ID + uid mechanism to build a message tracking and interception mechanism through data capability
Lack of retransmission mechanismThe idempotent parameters provided by the manufacturer are directly used to achieve the retransmission and distribution of exception messages
Complex client access logicCooperate with front-end infrastructure to precipitate basic capabilities and components to achieve reuse and rapid access
Strong coupling between client and SDK of push serviceStandardize the data embedding point fields of all manufacturers' interfaces, lightweight front-end code and achieve standardized data flow at the same time
Lack of data monitoring and statisticsEnrich system monitoring and link tracking, and split data and function codes to facilitate quantitative indicators

Future outlook

Intelligent frequency control and disturbance free design

Improve the utilization efficiency of the overall resources of the platform, reduce unnecessary interruptions of users, and give resources to the parts most concerned by users.

Design of push synchronization in and out of stations

Cooperate with the waterfall reminder in the station to achieve the combination of offline and long-term connection online push of the manufacturer, so as to reduce the pressure of the push platform.

Complementary distribution design of SMS and PUSH

Cooperate with SMS reminder to improve the arrival rate of key information and improve user product experience.

Reference link

APNs / MiPush / HMS / Opush / Vpush / meizu push

Introduction to the author

He Kuang Province, Wang Wenwen, from snowball community platform / basic components.

recruitment information

Snowball business is developing by leaps and bounds, and the engineer team looks forward to the participation of Niu Ren. If you are interested in "being the preferred online wealth management platform for Chinese people", I hope you can make contributions together. Click "read the original" to view the hot positions, waiting for you.

Keywords: Java

Added by Trent Hatred on Fri, 31 Dec 2021 08:34:44 +0200