From Code Quality to Integration: Optimizing Alibaba’s Blink Testing Framework

Alibaba completes testing and optimization for a revolutionary distributed open-source computing framework in just one year.

This article is part of Alibaba’s Flink series.

As demands on network equipment and infrastructures grow, there is an ever-increasing need for powerful solutions that can still operate when certain components are down. As one such solution, distributed computing frameworks effectively overcome potential network downtime by spreading loads over the entire network.

Blink is a distributed open-source computing framework for data stream processing and batch processing, supporting the real-time processing of big data for thousands of services in the Alibaba Group. To ensure its reliability, the Blink test team was set up in 2017 to establish a complete Blink test system from scratch and guarantee the quality of Blink.

In this article, we look at the fundamentals of open-source computing frameworks before exploring the Blink test platform in detail.

Open-source Computing Frameworks

Apache Flink is a widely used distributed open-source computing framework for data stream processing and batch processing. In 2016, using Flink as a foundation, Alibaba got to work in creating the Blink framework. In 2017, Alibaba integrated its entire stream computing products and decided to build a world-leading real-time computing engine based on Blink.

For the 2017 11.11 Global Shopping Festival, Blink supported more than 20 business units and groups by running thousands of real-time computing jobs simultaneously, with processed logs per second reaching an astonishing 470 million. Therefore, the assurance of Blink’s reliability and stability has become extremely important. To this end, the Blink test team has established a comprehensive testing system, progressing from code quality to continuous integration and pre-testing in just over one year and greatly improving Blink’s quality.

Blink Test Platform

The test team has tailored the Blink test platform to optimize its quality, as shown in the following figure:

As outlined above, the Blink test platform consists of three test phases.

First, in code quality verification, the static code scan, unit test, and minicluster-based test are performed.

Second, in integration testing, the functional test, performance test, and destructive stability test are performed.

Finally, in the pre-test, the simulation test is performed by simulating the user’s jobs, and the final version compatibility test is performed before the release.

The platform selects part of the test collection to be included in the precommit verification to detect problems in the code as early as possible. Meanwhile, large-scale functional, performance, and stability tests are usually used as a collection of dailybuilds. In addition, the Blink test platform has established a relatively complete quality measurement system. Except for statistical analysis and change analysis of code coverage, the system can generate test reports with one click and horizontally compare the quality of different versions.

Code quality verification phase

The code quality verification phase is the basis for the entire Blink quality assurance. It mainly includes unit testing, code style scanning, as well as the standalone minicluster-based integration test. Only when the Blink code passes all these three tests is it allowed to be submitted to the project git, as shown in the following figure:

Functional Test

The Blink functional test framework uses Defender, is derived from pytest and supports the features of the Blink SQL test and the introduction of third-party plugins. Moreover, it can be used for end-to-end accurate testing on a scenario in a test cluster (as shown in the figure below). It supports two trigger modes: IDE and Jenkins; and three dispatch modes for case running: yarn_job, yarn_session, and local. After the execution, it displays the results in the form of web pages or emails, and performs persistence on the results. Its advantages include:

1. Case unified dispatch and refined management

At present, Blink has more than 4,000 cases in 12 scenes on the Defender, which can perform dailyrun at a fixed time every day. If a problem occurs in the case of a certain category, the case can be executed separately, and details can be displayed on the page.

2. Three running modes meeting the test requirements of different scenarios

Among these, the yarn_session mode is more suitable for the scenario where sql case exists in a module, as it can greatly reduce the interaction time with Yarn.

3. Flexible case configuration

It can not only support the system configuration, but also separately configure the required resources (slot, memory, and so on) of each case set or other configurations of the cluster.

4. A case can support both batch and stream types of processing

5. Flexible extension of the client type

Integration and extension of existing data storage and services are possible. Multi-type data store read and write services, start of yarn_session, Blink job interaction, and so on are now supported.

Performance Test

As a real-time big data processing engine, Blink puts very stringent requirements on the data processing capability per unit time and the real-time performance of data processing. Therefore, performance testing is a very important part of the entire Blink test and is one of the core standards for measuring whether a new version of Blink can be released.

Blink’s performance test mainly includes operator performance testing, SQL performance testing, and runtime performance testing:

Operator performance testing

Operator refers to an atomic operation that constitutes the semantics of SQL, such as Sum, Aggregate, and so on, which can no longer be split. The operator performance test is mainly used to monitor the performance changes of a single operator throughout the development process to ensure the optimization and improvement of local processing. Currently, Operator’s testing is divided into the performance test of a single operator and performance test of a combination of operators. The Operator test feeds back performance changes in the way of Daily Run.

SQL performance testing

This is mainly used to monitor the performance changes of a single SQL during the version development. TPCH and TPCDS are the industry SQL standard performance test sets and have 22 and 103 test cases, respectively. The test platform introduces it into the Blink performance test to more fully measure Blink’s performance changes.

Runtime performance testing

This is mainly to ensure that the performance at the operational level has no regression. It mainly includes end-to-end performance testing and module performance testing. The end-to-end performance test firstly figures out the test scenario and pays attention to the job processing time of each scenario under a specified data volume. The module performance test mainly includes network layer performance testing, dispatching layer performance testing, failover performance testing, and so on. It pays more attention to the job processing time in a specific scenario.

The future plan of performance testing is to integrate the E2E performance test, module level performance test, and parameter adjustment to make them better able to help develop the root cause of the performance problem and view the parameter tuning effect (as shown below).

Stability Test

For distributed systems that support high concurrency, multi-node, and complex physical environments, it is difficult to avoid physical node anomalies such as disk fullness and network latency. As a highly available distributed system, Blink must ensure the stable operation of the system and the normal output of data under abnormal conditions. By taking on “the best way to avoid failure is to continually fail” approach, the team simulated all possible exceptions during the Blink task run to thereby verify the stability of Blink.

Abnormal scenes were divided into two categories:

· The first is related to the running environment, including machine restart, network exception, disk exception, CPU exception, and so on. These exceptions are mainly simulated by shell command.

· The second is related to Blink jobs, including RPC message timeout, task exception, heart beat message timeout, and so on, and are mainly implemented through the integrations of byteman software. In the stability test, those pre-defined abnormal scenarios will be randomly selected and combined for testing.

The stability test is set to an iterative loop. Each round of iterations involves releasing the abnormal scene, submitting the task, waiting for the job recovery, and verifying. The verification is mainly to check whether the checkpoint, container, and slot resources meet the expectations. If the check fails, the alarm will be issued. Once the verification succeeds and passes approval, the next iteration starts and the stability of the task is verified under long-term operation.

The stability test architecture is divided into four layers (as shown in the figure below):

· The component layer mainly includes testing Blink jobs, monkeys, and the dumper.

· The action layer includes job start, status check, output check, and so on.

· The execution layer includes services, abnormal scene operations, and so on, and the abnormal scene operates based on SSH to the specific machine.

· The top layer is the UI.

Pre-test

The Blink pre-test phase mainly tests complex business logic and large data volumes by cloning the real tasks and data online. Therefore, the Blink pre-test is a supplement to the code quality verification and integration test, and is the last guard before the release of a Blink version.

The Blink pre-test is divided into the simulation testing and compatibility testing.

Simulation testing

The simulation test further measures the basic test indicators such as Blink’s functionality, performance, and stability, and compares the developing version with the current online version. As a result, the simulation test can identify various problems related to features, performance degradation, and stability at an earlier stage, improving the quality of the online version.

The simulation test is divided into three stages: environment cloning, environment adaptation, and test run.

Environment cloning

Environment cloning is the basis for implementing the entire simulation testing, including the selection and cloning of online tasks and the sampling of test data (as shown below).

Blink’s online tasks are numerous and are scattered across many different projects. Although each online task has its own internal business logic, different tasks can be classified according to their main processing logic, for example with a task set based on Agg operations or a task set based on Sum operations, and so on. Therefore, the Blink simulation test needs to distinguish online tasks and select the most representative ones.

The test data set of the simulation test is a sample of the input data of the current online task, which differs only in data scale, and can be dynamically adjusted according to the test requirements, thereby achieving accurate measurement of the test target.

Environment Adaptation

Environment adaptation is the initialization phase in the simulation test process, and is mainly to modify the test cases to make it run normally. The process consists of two main steps: Changing the test data input source and test result output address and updating the resource configuration of the task.

Test Run

Test run is the actual execution module in the simulation test process, including the test case run and result feedback.

Blink simulation testing includes the functional test, performance test, and stability test. Different test modules have different metrics and feedback methods. The test results of these test modules, together with the results of the code quality check and integration test, form the result set of the Blink test (as shown below).

The performance test and the functional test use simulation tasks and sample data as input to compare and analyze the execution and output of tasks on different execution engines. The performance test focuses on performance metrics such as resource utilization and throughput of different execution engines during execution; while the functional test compares the final results of the execution. It should be noted that in the functional test, the running result of the online version is assumed to be true; that is, when the execution result of the online version is different from the execution result of the developing version, the execution of the developing version is considered wrong, and the error in the developing version needs to be repaired.

The stability test focuses on the stability of simulation tasks under the conditions of the online cloning environment, large data volume and long-time running. It uses the Blink developing version as the only execution engine, and measures stability metrics such as resource utilization, throughput, and failover during execution.

Compatibility Test

The Blink compatibility test is mainly used to discover compatibility issues between new and old Blink versions. At present, compatibility testing is mainly divided into two stages: static checking and dynamic execution checking.

Static Check

Static check is mainly used to analyze differences in the generated execution plan of the online task under different execution engines and includes two aspects:

· The correctness of the execution plan generated by the new execution engine and the length of generation time.

· The compatibility of plans generated by new and old execution engines, respectively.

The static check can be considered a failure if the new execution engine cannot generate an execution plan correctly, or if the generation of the execution plan takes longer time than expected, which indicates that there is an exception or defect in the new version of Blink. When the new version can generate the execution plan correctly, if the execution plan generated by the new-version execution engine is not compatible with the one generated by the old version then the comparison result needs to be fed back to the developer to determine whether the change of the execution plan is expectable; if the execution plans are compatible or the change is expectable, then the runtime test can be directly run for the new version.

Dynamic Runtime Test

The Blink dynamic runtime test uses the functional test module in the simulation test to run the task, and is the last round of testing before upgrading the new version of Blink. If the task can be started normally and the test result is in line with expectations, the task is considered to be capable of automatic upgrade; otherwise, manual intervention is required for manual upgrading.

Outlook

With just over a year of hard work, the overall quality of Blink has been greatly improved, and the testing methods and tools are becoming increasingly mature. As the Blink user base grows, demand from Blink service developers for quality assurance of their service tasks is increasing dramatically. Moving forward, the Blink test team aims to provide more quality assurances and development tools to further improve Blink developers’ engineering efficiency.

( Original article by Zhao Li赵丽)

This article is part of Alibaba’s Flink series.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.