Why ‘Miles per Disengagement’ misses the point as an autonomous vehicle success metric

Recently I came across another autonomous vehicle disengagement report claiming thousands of miles per disengagement driven autonomously.

Essentially, most of the current attempts to evaluate the readiness of autonomous technology rely heavily on this popular metric — “miles per disengagement”. Many use it, but just a few really understand its meaning. I think it’s a good time to demystify this.

California requires developers who test Level 3 and above automated driving systems in the state to report some basic statistics about vehicle operations each year, including the number of vehicles currently in operation, miles driven and the number of disengagements. These last two figures are combined to calculate miles per disengagement.

Basically, miles per disengagement shows the frequency in which human drivers are forced to take control of their self-driving vehicles.

This metric is closely watched by industry observers and always gets lots of media attention. At this point, you may be eager to know who has the largest miles per disengagement today. But, here’s the catch: it doesn’t really matter.

Starsky Robotics’ trucks do zero disengagement runs on a daily basis. But, the thing is, there’s no clear link between these numbers and public safety.

There is no set goal for a company to achieve with miles per disengagement that would definitively make the case that their autonomous vehicle was ready to be deployed. This is because of a few basic facts about the statistic that make it difficult for people outside of the company reporting the numbers to understand what they really mean.

Starsky Robotics Autonomous Truck

RAND has published a study, Measuring Automated Vehicle Safety: Forging a Framework, that goes into these considerations of metrics and safety in great detail. The authors discuss metrics and divide them into two broad types: leading indicators and lagging indicators. Leading indicators should help give clues about safety before accidents happen; lagging indicators are mostly statistics about actual harm after the accidents happen.

Most of the US DoT policies are based around tracking lagging indicators. The U.S. government tracks traffic accidents, injuries, fatalities, and economic damage. These data are used to look for trends, to inform decision-making about policies, and track the impact of new technologies. Ultimately, most people expect AVs to be judged by how safe their track record is compared to human drivers. AVs should be as safe or safer than human drivers to justify their presence on public roads.

Unfortunately, there are two problems with this approach. The first is that because accidents are a lagging indicator, we won’t actually know how AVs are doing until they are on the road driving. The second is that it is practically impossible to prove that AVs are as safe as humans without allowing them to actually drive an enormous automated fleet for a long time (billions of fleet miles). And if we wait for a long time trying to accumulate statistics before allowing AVs on public roads it is likely we will be causing more harm than good.

Our engineering practices are built around leading indicators. Starsky Robotics believes safety is our number one priority. So as we design, build and test our systems, we are always considering ways to check that things are working correctly, and flag conditions or situations when things are not working correctly. These types of measurements drive our engineering team to identify problems and fix them as we work. Disengagements are certainly one type of leading indicator.

Disengagements happen when either a safety driver detects bad behavior and takes control of the vehicle, or when the vehicle itself detects something wrong and calls for a human to take over. Some disengagement might be “unwanted” — we do a test drive and hope everything goes well, but it doesn’t. But some disengagements might be deliberate — we might be trying to ensure that our vehicle will reject situations outside of its ODD, for example. Or we might be trying deliberate experiments to make sure that the AV responds appropriately to various types of failure situations. Disengagements as a metric will make sense to us, as the developer, because we understand when and how the disengagements happen. It probably would mean very little to someone outside of the company without that context. One week we might try a new, difficult highway (with curves, construction and heavy traffic) to challenge our system and have multiple disengagements; the next week we might go back to a well-understood, easy route and have zero disengagements. Does this mean we’ve made progress? Maybe, maybe not. It all depends on the context.

Similarly, miles driven can be misleading, especially for a trucking company. Starsky is based around the idea that the long, straight highways that cross the U.S. are the routes that are hardest for humans to do day after day but are the easiest for an AV to do. Driving from Houston, TX to Phoenix, AZ along I-10 is more than 1100 miles. How would the difficulty of that task compare to driving 1100 miles within an urban area like New York City, Los Angeles, or San Francisco? Surely it would be easier to accumulate miles simply by driving long highway stretches than by making many short trips within a complex urban environment. We could engage our system only when and where we knew we perform our best, and claim very long runs with no disengagements. That wouldn’t be wrong, but it also wouldn’t show the whole story of the actual capabilities of our system.

Because miles per disengagement sounds important though, and gets a lot of public attention, companies might be tempted to focus on this metric.

Miles per disengagement can be easily “gamed”, which can lead to perverse outcomes — we’ve actually seen this at Starsky. At one point, early in our development, we had a big status board that tracked various system metrics, including miles per disengagement, front and center. Management made it clear that this was a metric that was important to investors, and sure enough, we saw the number improve. After a few weeks of improvement of this metric, our CTO went out with the truck and realized what was really going on. The truck still had issues, but the safety driver thought they were doing the company a favor by not disengaging as long as there was no other traffic threatened by the truck. Our miles per disengagement went up! Yikes! This resulted in a rethinking of our entire outlook on metrics; and changed how we approach communication and education with investors. Safety cannot be summarized with that one number.

Making progress is simply impossible without disengagements.

For moving forward, challenging and evolving a system, we need to take on harder tasks every time. Focusing on reporting “good” disengagement rates may distract companies from making actual progress.

Miles per disengagement might still be a metric that we would use internally, for comparing runs under similar conditions, to track the performance of different versions of software or hardware. (Even then, we’re more likely to look at more direct measures of performance, such as an actual measure of lane centering). Disengagements would be looked at individually, with the goal of understanding the specifics of each disengagement to see what lessons can be learned.

So, “number of disengagements” doesn’t mean much without context, and “total miles driven” doesn’t mean much without context. This means the metric “miles per disengagement” just doesn’t mean much without context either. Context that you won’t find in the reports to the state of California.