The Future of Machine Learning Hardware

Written by philjama | Published 2016/09/03
Tech Story Tags: artificial-intelligence | hardware | machine-learning | gpu | fpga

TLDRvia the TL;DR App

Let’s go through a high-level exploration of the evolution of computational hardware technologies with a focus on applications to machine learning (ML), and using cryptocurrency mining as an analogy.

I posit that the machine learning industry is undergoing the same progression of hardware as cryptocurrency did years ago.

Machine learning algorithms often consist of matrix (and tensor) operations. These calculations benefit greatly from parallel computing, which leads to model-training performed on graphics cards (rather than only on the CPU).

The natural progression of computational hardware goes:

  1. Central Processing Unit (CPU)
  2. Graphics Processing Unit (GPU)
  3. Field Programmable Gate Array (FGPA)
  4. Application-Specific Integrated Circuit (ASIC)

Each step in this progression of technologies produces tremendous performance advantages.

Performance can be measured in a number of ways:

  • computational capacity (or throughput)
  • energy-efficiency (computations per Joule)
  • cost-efficiency (throughput per dollar)

Orders of Magnitude

For comparison, let’s consider the task of mining cryptocurrencies, which demands substantial computing power in exchange for financial gain. Since the introduction of Bitcoin in 2009, the crypto-mining industry evolved from using CPUs, to GPUs, to FPGAs, and finally to ASIC systems.

Each step in the hardware evolution provided orders of magnitude in performance improvement. Below is an approximation of performance relative to a single-core CPU representing 1 computational unit:

  • Single-core CPU: 1
  • Multi-core CPU: 10
  • GPU: 100
  • FPGA: 1 000
  • ASIC: 10 000 ~ 1 000 000

These numbers are based on the performance factors (such as throughput, efficiency) observed through the cryptocurrency-mining evolution. [1,2]

General-Purpose Computing (CPU & GPU)

Prior to 2001, general-purpose computing would be done on the CPU, whereas GPUs traditionally handled only computation for rendering graphics.

Doing general-purpose computing on graphics cards became practical when computer scientists developed matrix multiplication and factorization techniques that were faster and more efficient.

Since then, there have been notable efforts to create programming languages that allow general-purpose computing on GPUs, including CUDA and OpenCL.

NVIDIA Titan X Graphics Card

However, GPUs are notoriously power-hungry. Nvidia rates their Titan X graphics card at 250W, and recommends a system power supply of 600W. At $0.12 cents/kWh, 600W translates to $50 in monthly electricity consumption! Nvidia will likely continue to address these concerns in future products.

Specialized Hardware: FPGA

Field-programmable gate arrays (FPGA) are integrated circuits whose logic blocks can be programmed and reconfigured using a hardware description language (HDL).

In the case of cryptocurrency, FPGA boards marked the transition to mining with specialized hardware.

A series of FPGA-based mining systems provided the next order-of-magnitude increase in throughput performance, as well as energy-efficiency (as the cost of electricity created a break-even favoring low-power systems).

Efforts are underway to implement machine learning models using FPGAs. For instance, Altera showcases an implementation of the AlexNet convolutional neural network used to classify images.

In late 2012, Microsoft started exploring FPGA-based processors for their Bing search engine.

Currently FPGAs only match GPUs on throughput performance, however they consume less energy for the same workload, thereby making them more feasible in low-power environments (such as self-driving cars).

Purpose-Built ASICs

Cryptocurrency mining continued its evolution to specialized hardware and ASICs quickly became the only competitive option.

The same trend has already started in machine learning.

TPU servers, AlphaGo with Lee Sedol

In May 2016, engineers at Google announced that they created an ML-specialized ASIC technology called a Tensor Processing Unit (TPU).

The TPU-servers power their RankBrain search system, StreetView, and even the AlphaGo system that beat world champion, Lee Sedol.

Google has been using TPUs since 2015,

…and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning.

This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore’s Law).

Future and Next Steps

It appears that demand for deep learning and statistical inference is driving the hardware industry towards ML-specialized hardware.

Currently, Google leads with ASICs, their top competitors run FPGAs, and the rest of us are heating our homes with GPUs.

When will ML-specialized ASIC technology become commercially available?

Will the industry adopt an open framework such as OpenCL as a basis for heterogeneous computing? Progress is already being made by popular ML libraries such as TensorFlow and Caffe.

Will this exponential evolution continue or plateau at some physical barrier? The next steps in this hardware evolution include new materials, biological computing, or quantum computing.

Imagine specialized ASIC chips thousands of times more powerful than today’s top ML hardware. What new AI applications will become feasible? What will become possible when their energy efficiency makes them viable for embedded devices such as smartphones, IoT, and wearables?

As AI applications expand, the demand for ML-specialized devices is driving hardware into the next phases of evolution. It will be fascinating to experience the impact of these technologies applied in healthcare, medicine, transportation, robotics. Many exciting steps in the evolution of machine learning still remain.

References

  1. Non-specialized hardware comparisonhttps://en.bitcoin.it/wiki/Non-specialized_hardware_comparison

  2. Mining hardware comparisonhttps://en.bitcoin.it/wiki/Mining_hardware_comparison

  3. CNN Implementation on Altera FPGA Using OpenCLhttps://www.altera.com/solutions/technology/machine-learning/overview.highResolutionDisplay.html

  4. Microsoft Working on Re-configurable Processors to Accelerate Binghttp://www.datacenterknowledge.com/archives/2014/06/27/programmable-fpga-chips-coming-to-microsoft-data-centers/

  5. Google supercharges machine learning tasks with TPU custom chiphttps://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html

  6. Google Turning Its Lucrative Web Search Over to AI Machineshttp://www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines

  7. High-Performance Hardware for Machine Learninghttps://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf


Published by HackerNoon on 2016/09/03