When Machines Fix Themselves: Alibaba’s Self-healing Hardware Solution

Alibaba is realizing a closed-loop self-healing hardware strategy to automatically resolve failures.

In big data systems, operation and maintenance work plays a crucial role in ensuring that service interruptions from hardware and software failures do not threaten the overall stability of platforms. Given the challenges of doing so in massive data environments, groups such as Alibaba are increasingly seeking automated solutions that simplify response efforts from their responsible personnel, with one recent effort being self-healing hardware systems.

For Alibaba, offline computing platform MaxCompute handles a massive 95 percent of all data storage and computing needs, which at the growing scale Alibaba’s business already encompasses hundreds of thousands of server units. Among these units, hardware failures are difficult to detect at the software level due to their offline operation. Meanwhile, Alibaba’s unified threshold for hardware malfunction alarming often misses hardware failures that impact applications, with each such failure posing a significant threat to the stability of the cluster.

Dealing with these issues first of all involves the timely discovery of hardware failures, followed by effective service migration of failed machines. This article looks at how Alibaba has enabled machines to perform these tasks independently, first exploring the Tianji application management system in Alibaba’s Aspara operating system before introducing Alibaba’s automated self-healing hardware platform DAM.

Application Management in Tianji

The MaxCompute platform is built on the Alibaba Data Center’s Apsara operating system, within which all applications are managed by Tianji; as an automated data center management system, Tianji manages the hardware lifecycle of various types of static resources in the data center including programs, configurations, operating system images, and data, among others.

Alibaba’s self-healing hardware system is closely integrated with Tianji, using Tianji’s healing mechanism to build a closed-loop system oriented to detecting hardware failures and performing automatic repair for complex services.

Tianji makes it possible to issue restart, reload, and repair instructions from physical machines in the system. Tianji translates these instructions for each application on the physical machine, allowing the application to decide how to respond to them based on its own business characteristics and self-healing scenarios.

Discovering Hardware Failures

The most important hardware failures to detect are those which concern hard disk, memory, CPU, and network card power supply issues. The following list details some of the tools and approaches available for dealing with these common problems.

Hard disk failures account for more than half of all hardware failures. Of these, the most common type is hard disk media failure, which is usually due to file read/write functions failing, becoming stuck, or running slowly. However, this does not mean that the read/write problem is the cause of the media failure, and the appearance of the media failure first needs to be explained at each layer, as discussed in the following sections.

Kernel log

Kernel log error refers to the ability to detect an error such as the following in /var/log/messages:

Tsar io

A tsar io indicator change refers to any change or mutation of indicators such as rs/ws/await/svctm/util. Since read/write operations will pause during the error-alarming period, the pause will usually be reflected in iostat and then collected into tsar.

In the tsar io indicator, there is a rule that allows for distinguishing whether the hard disk is working properly, expressed qps=ws+rs<100 & util>90. If there is no large-scale kernel problem present, this generally indicates a hard disk failure.

System indicator

System indicator changes are usually caused by io changes. For instance, the uninterruptible sleep of processes leads to an increase of load.

Smart value hopping

Smart value hopping specifically refers to the hop of 197 (Current_Pending_Sector) /5 (Reallocated_Sector_Ct). The relationship between these two values and the read/write exception is twofold. First, after the media read/write exception occurs, 197 (pending) +1 can be observed on the smart, indicating there is a sector to be confirmed. Second, when the hard disk is idle, it will confirm the various pending sectors in the 197 (pending). If the read/write is passed, then 197 (pending) -1 will be confirmed; if the read/write fails, then 197 (pending) -1 and 5 (reallocate) +1 will be confirmed.

General discovery conclusions

Throughout the entire error-alarming link, it is inadequate to observe just one stage in isolation, as comprehensive analysis of multiple stages is needed to conclusively prove a hardware problem. Since media failures can be proven rigorously, the derivation can be reversed so as to quickly distinguish between software and hardware problems when an unknown problem occurs.

The tools discussed above are based on operation and maintenance experience and lessons learned from failure scenarios. Meanwhile, a single source of discovery is far from enough, for which reasons Alibaba has introduced other sources of hardware failure detection and combined various inspection methods.

Converging on Hardware Issues

As the tools and pathways discussed in the previous section are applied to discover hardware failures, not every discovery should necessarily be reported. Alibaba Group observes two key principles for converging on hardware issues.

First, indicators are regarded as much as possible as unrelated to applications and services. Some application indicators which are highly relevant in hardware failures are monitored without being treated as an actual source of discovery. For instance, when the io util rises above 90 percent, this indicates that the hard disk is especially busy but does not necessarily mean it is having problems, such as when this is caused by reading and writing hotspots. Actual problems are weighed when the hard disk is in a state of “io util>90 and iops< 30” for more than 10 minutes.

Second, Alibaba observes a practice of being sensitive in collection and cautious in convergence. While all possible attributes of a hardware fault should be collected, most should only serve for reference and not as a basis for repair in the final automatic convergence analysis. To continue with the previous example, if io util > 90 and iops < 30 are the lone conditions present, the hard disk will not be automatically repaired due to the fact that a kernel problem may also lead to such a scenario. Only when explicit faults appear in relation to the indicators, such as the smartctl timeout/fault sector, can the hard disk be diagnosed as faulty. Short of these conditions, the hard disk will be isolated but not repaired.

Rate of coverage

Taking the IDC work order of a production cluster in a given month x of a given year 20xx as an example, the following statistics can be seen for hardware failures and work orders:

Excluding out-of-band (oob) faults, the discovery rate for hardware failures in this month is 97.6 percent.

Self-healing for Hardware Failures

For every machine that experiences a hardware problem, Alibaba will open an automatic rotation work order to follow up.

Currently, Alibaba uses two sets of self-healing processes: the with-application repair process and the no-application repair process; the former is used for hot-plugging hard drive failures, while the latter is used for all hardware failures throughout the rest of the machine.

Alibaba’s automation process features several subtly effective designs.

First, the process is designed for diskless diagnosis. For downed machines, it is not possible to enter ramos to open the “downtime without a reason” maintenance work order. This design can greatly reduce false positives and in turn the workload of service desk workers. Further, diskless stress tests can completely eliminate the impact of the current version of the kernel or software, and thus truly determine whether or not hardware is experiencing performance issues.

Second, the process is designed for determining the impact range of problems and for impact upgrades. For with-application repairs, the method determines whether a process is or is not down. If it is down for more the 10 minutes, it can be inferred that the impact of the hard disk failure has extended to the whole machine, which requires rebooting the machine. If the machine cannot restart when this is happens, there is still no need for manual intervention as an impact upgrade will be performed directly. That is to say, the with-application repair process will be directly upgraded to a no-application repair process.

Third, the process features automatic backup for unknown problems. During operation of some machines it is possible for diskless diagnosis to be accessed after the machine is already down while the stress test will not be able to detect any hardware problems. In such cases, the only solution is to reinstall the machine. In some cases, the hardware problem will be detected and fixed during reinstallation of the machine.

Finally, the process is designed to enable hardware downtime analysis during the handling of hardware failures. It should be noted that the overall process is mainly oriented to solving problems and that the downtime analysis is only a by-product of the system. Meanwhile, Alibaba has also automatically introduced downtime diagnosis results for analysis, achieving the “1+1>2” effect.

Statistical analysis of processes

When a single hardware problem repeatedly triggers the self-healing process, then the statistics of process work orders become a means of discovering the problem. For example, in the case of a virtual serial port problem with a Lenovo RD640 machine, it was discovered through statistics that self-healing occurs repeatedly on machines of the same type, and that even after a machine is reinstalled the same problem will reappear. Alibaba thus isolated those machines to ensure the stability of the cluster while working to accommodate an investigation into the problem.

Misconceptions about service association

With Alibaba’s complete self-healing system, the category of “unknown problems” can be used to deal with problems affecting some services, kernels, and software. However, solving service problems with hardware self-healing methods tends to perpetuate a dependence on this as a backup solution for problems that continue to remain unclear.

Presently, Alibaba is gradually removing processing of non-hardware problems and returning its orientation to self-healing for hardware. While general self-healing software can also be run on some systems, the strong coupling of such scenarios and services makes them unsuitable for generalization across the entire Alibaba Group; this approach is more conducive to the classification of hardware and software problems and the discovery of unknown problems.

Evolving Architectures

Alibaba Group’s self-healing architecture has gone through extensive changes over the course of its development. Its original version was implemented on the control machine of each cluster, as operation and maintenance personnel at that time dealt with various problems occurring on the control machine. With continuous deepening in its automation, the architecture proved a serious hindrance to the opening of data, leading Alibaba to rebuild it on a more centralized basis; this in turn led to the challenge of processing massive data, where a small number of servers proved inadequate.

Cloudization

To address the above challenges, Alibaba further reconstructed its system in a distributed, service-oriented way that supports large-scale business scenarios, disassembles the various modules comprising the architecture, and introduces three key artifacts: Alibaba Cloud Log Service (sls), Alibaba Cloud StreamCompute (blink), and Alibaba Cloud Analysis Database (ads). In this arrangement, the various collection and analysis tasks are shared by cloud products, and the server only retains the most central functions of hardware analysis and decision-making.

The following diagram compares the architectures of self-healing platform versions DAM1 and DAM3:

Data-oriented architecture

With sustained exploration of its self-healing system, Alibaba has found that the data of each stage yields a stable output, and higher-dimension analysis of this data reveals clear, valuable information. Meanwhile, the group has also reduced the dimensionality of these higher-dimensional analysis results and scored the health of each machine. With this health score, operation and maintenance personnel can quickly determine the hardware situation on a single machine or an entire cabinet or cluster.

Service-oriented architecture

Based on control of full-link data, Alibaba is able to provide the entire failure self-healing system as a hardware lifecycle standardization service for different product lines. Based on full abstraction of decision-making, the system provides specific thresholds for perception, supports customization of different product lines, and forms a full lifecycle service suited to individualization.

A Closed-loop Self-healing System for Failures

The most common application scenario in the AIOps closed-loop perception, decision-making, and execution system is self-healing hardware and software. Industry-wide, enabling automatic solutions to failures is the first implementation point of AIOps. For Alibaba, providing a common closed-loop system for self-healing is the cornerstone of its work with both AIOps and NoOps (unmanned operation and maintenance). Such a system is especially important to coping with operation and maintenance needs in a massive system.

In a complex distributed system, operational conflicts between different architectures are inevitable, due to information asymmetry between them. The cause of information asymmetry is that each type of distributed software architecture is designed to run as a closed loop. Such issues can be resolved through a strategic use of various mechanisms and operation and maintenance tools. However, these methods are akin to patching, and amid continuous architecture upgrades the need for such patches continues to grow relentlessly. Thus, the process must be abstracted to work on a self-healing basis and must be explicitly declared at the architectural level, so that software can take part in the entire process of self-healing and transform what are originally conflicts into a form of synergy.

Presently, Alibaba is dedicating the focus of its efforts to the largest issues in operation and maintenance scenarios, as well as hardware and software conflicts, architecture construction and product design, and improvement of the overall robustness of complex distributed systems through self-healing.

Key findings from Alibaba’s work with self-healing hardware across a large number of machines include that the side effects of operation and maintenance tools incorporated into the self-healing system gradually decrease due to their steady refinement. Further, the manual operation and maintenance activities incorporated in the system have gradually become automated. Each O&M operation has a stable SLA commitment time, and is no longer an operation and maintenance script that can run an error at any time.

In all, the DAM self-healing system’s function is to achieve universality by fully abstracting operation and maintenance automation on the basis of complex distributed systems. By then constructing a closed-loop architecture for this purpose, its architecture ecology becomes more fully coordinated as a whole.

(Original article by Zhong Jiong’en钟炯恩)

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.