Why AI progress is faster than Moore’s Law

Moore’s original 1965 paper Cramming more components onto integrated circuits contained a number of incredible insights. One of them has been shorthanded into “computers get 2x faster every 2 years.”.

Moore’s vision has come to pass. You can buy a “handy home computer” at walmart and pick up some deodorant at the same time.

Computer algorithms, for non machine learning problems, have not had quite as much improvement. Quick sort for example, a commonly used sorting algorithm, is turning 60 years old next year.

animation of Quick sort from wikipedia

Deep learning algorithm improvement

Machine learning models have been progressing at a much faster rate. Here’s an example, a comparison of a sub system used in many object detection systems called “region proposal networks”:

“mAP” is a measure of how good the network is. Being able to achieve a similar score at a reduced time is a goal here. Faster R-CNN gets a 250x speedup compare to the original approach, and 10x faster than Fast R-CNN.

So how long did it take for Faster to get a 10x improvement? A few years?

No.

Both Fast and Faster were published in the same year, 2015. Yes that’s right we saw a 10x speed up during the same year.

We see the error, or how many mistakes the AI makes, reduce over time too. For example in the the ImageNet large scale challenge we the error has gone down from 28% to 2% with Squeeze and Excite net. The big jump in 2012 was the switch to deep learning based approach with AlexNet getting an incredible 27,571 citations as of time of writing.

See http://www.image-net.org/challenges/LSVRC/

Algorithms that benefit from more data

Deep learning models benefit from more data. They make use of more data at train time to produce a better quality model. When the model gets used (at test time) it runs at a similar speed, regardless of the amount of data the original network was trained on³.

This is in direct contrast to “traditional” programming that typically tries to reduce the amount of data being used in the system. In fact there’s a whole language called Big O dedicated to this.

The availability of data is increasing. For example for cameras. One way the increased availability of data helps machine learning models is transfer learning. More data may mean more powerful pre-trained models and easier access to new data for fine tuning.

Left: North America camera module market by application, 2012–2022, (USD Million)

Specialized hardware

Specialized hardware, such as GPUs are also helping. This has been covered in a lot of depth by others so I’ll just briefly show this example:

https://www.rtinsights.com/gpus-the-key-to-cognitive-computing/

Here we see a 10x performance improvement over 6 years. Where as a doubling in performance every 2 years would only equate to a 4x improvement. Future improvements are likely with even more specific circuit design.

Progress

In contrast to traditional programming, AI algorithms has been improving at a faster pace. Hardware designed for these new algorithms drives further progress.

Algorithms + specialized hardware is driving this

But why is the hardware part faster than Moore's law? The difference is hardware designed for a “specific” purpose, vs generally adding more power. For example, a graphics unit is not always better than a central processor.

If you cellphone ran exclusively on a graphics unit your battery may go dead halfway through the day and it would likely feel sluggish to do common tasks.

However, specialized hardware does see a big return on tasks it is good at, such as deep learning. This specialized hardware can be constructed in a way that bypasses CPU limitations.

“And these cores can be added with a linear increase in computational ability, bypassing today’s Moore’s Law limits…” — Bruce Pile¹

Ok but what about algorithms? How is that really different?

In instruction driven programming we have “provably” optimal ways of doing certain things. If the assumptions underlying the proof are correct, the proposed method is the best we “think!” we will ever have at that problem.

We haven’t gotten to “provably optimal” yet with deep learning based approaches!

There is a huge effort in the community to progress the state of the art, to discover entire new approaches and optimizes existing ones. A specific example of this is Capsule networks.

Capsule networks operate on a list of information ie [1, 4, 65, 1] at it’s “lowest” level, in comparison to normal networks that operate on a single value, ie “1”. As you can imagine going from only 1 number, to an unlimited list of numbers opens up a whole new world of possibilities.

Why does this matter?

A key measure of the usefulness of computers has been the software that runs on them. Imagine a smartphone without your favorite app or a TV without Netflix.

While software uses have been advancing — the ability to use those advances has primarily been about hardware — not improved algorithms.

Algorithms unpinning traditional software systems may have reached diminishing returns. In addition to the example above, consider the A* algorithm.

A* illustration from Wikipedia

First published in 1968, it remains one of the best general approaches and is therefore taught in computer science classrooms today.

The age of the algorithm

We are now living in a time where deep learning algorithms are seeing massive advancements, far outpacing previous algorithmic improvement. This is compounding with improvements in hardware, leading to incredible growth in AI effectiveness.

Region proposal networks, a sub method for AI models, saw 10x improvement in the same year.

Fast growing technologies create hard to predict outcomes.

http://www.nydailynews.com/news/world/check-contrasting-pics-st-peter-square-article-1.1288700

We saw entire industries get disrupted over the last 30 years with technology progressing at a rate of 2x every 2 years. The compound effects of new algorithmic improvement + hardware improvement that will make our previous progress look slow.

We weren’t ready for the last wave of disruption and doubly so for this next one. If we thought the 30 years were disruptive, we will be blown away by the next 30.

This progress will continue to drive benefits as diverse as self driving cars, skin cancer detection, and smarter grammar checks. It may also lead to autonomous weapons and effect the majority of jobs today.

A call for rethinking the meaning of work

People often define themselves by their work. You can’t go to a social event without the inevitable “so, what do you do?” question being popped. Maybe you are even the one asking it!

But how do you define yourself by your work if an AI can do your job better than you can? What happens when the estimated time for an AI to be able to do a role is shorter than the time it takes a human to train for it?

We have always created new jobs in the past — but we have now reached a tipping point when it comes to training and learning new skills.

“It’s quite obvious that we should stop training radiologists,” — Geoffrey Hinton, computer scientist (link below)

Kai-Fu has some great ideas in his ted talk. It centers around working with AIs.

Source: Kai-Fu Lee https://www.youtube.com/watch?v=ajGgd9Ld-Wc&t=5s

I think education will play a big role. Not just for people developing software — but more and more for people using it. Companies liked Udacity are making world class education more accessible than ever before. As of course MIT and Harvard have been doing for a while with MIT OpenCourseWare and HarvardX.

We need a growing awareness on how much these behind the scenes algorithms are running our daily lives and how much more that will increase in the future.

In the age of the algorithm how will you define yourself? Could an AI do your job? What about your co-workers?

Thanks for reading!

1 https://www.forbes.com/sites/kenkam/2018/04/23/how-moores-law-now-favors-nvidia-over-intel/#7c498c3f5e42

2 https://www.economist.com/leaders/2018/06/07/ai-radiology-and-the-future-of-work

3 Technically, if you keep increasing the data in your network you may need a larger network (“representation capacity”)— but assuming that you keep the size of the network the same and are only comparing the cost of adding data at train the time, the network with more data will get better quality results (from “better” weights) and the running time will be the same.

Why AI progress is faster than Moore’s Law — the age of the algorithm