The Mac Pro: A Case for Expansion

Written by tompark | Published 2017/04/14
Tech Story Tags: apple | mac-pro | deep-learning | gpu | multi-gpu

TLDRvia the TL;DR App

Some things I learned while building a GPU rig for deep learning

During this month’s briefing on the Mac Pro, Apple said, paraphrasing, that they tried to put a TITAN X in the 2013 Mac Pro, but it was too hot and caused the CPU to throttle. That was my interpretation at least.

Apple has a tendency to put underpowered GPUs in their desktops, so it’s great to hear they’re seeking to improve this for the next Mac Pro. They’ve gotten plenty of feedback that “pro” users want more powerful GPUs, such as NVIDIA’s GTX 1080 or 1080 Ti.

2017 Mac Pro concept by Pascal Eggert — [Source]

Decades ago, CPUs added SIMD units (e.g. Intel’s MMX) for number crunching. Despite techniques to improve use of the CPU, there’s no denying a recent shift away from the CPU to the GPU for this kind of computation.

GPUs have become so powerful that we’re using them more, leading to a rise in general-purpose computing on GPUs. It’s a prominent time for GPUs now, AMD will be competing at the high-end again with their Vega series, and NVIDIA Volta is coming soon after that.

“Most of the software out there that’s been written to target [certain kinds of high-end cinema production tasks] doesn’t know how to balance itself well across multiple GPUs but can scale across a single large GPU.”~ Craig Federighi

It’s true some apps don’t take much advantage of multiple GPUs, but the real point is that old weak GPUs won’t compensate for a new powerful one, and we shouldn’t generalize beyond that context. It’s fairly likely that Apple will sell a Mac Pro configuration in 2019 containing two GPUs.

So you need a fast computer

A few years ago, if you were to buy a new Mac Pro, you could spend as much to upgrade the CPU ($3500) as it cost for the whole machine by itself ($3000).

[Source]

Earlier this month, Apple updated their Mac Pro configurations and pricing to something much more reasonable by today’s standards. A CPU upgrade from 6-core to 12-core is now merely $2000:

[Source]

That cost matches the difference in Intel’s list price between the 6-core ($580) and 12-core ($2600) processors, so Apple is simply passing along Intel’s suggested pricing.

Note that the clock speed of the 6-core unit is 30% faster than the 12-core one, so you’d be upgrading to slower cores. Apps that rely upon single-core performance will actually run slower on the more expensive CPU.

Now consider that $700 will buy you an NVIDIA GTX 1080 Ti. Last month this was the most powerful GPU available to the public. For about the same cost as adding 6 CPU cores, you can have 3 of the most powerful consumer GPUs in the world.

It’s a shame you won’t be able to plug all three of those monsters into the next Mac Pro. Assuming, that is, Apple designs it with similar expansion constraints as the cheese grater tower, which was the most expandable Mac Pro they’ve ever made.

Two mid-range GPUs. Or a single high-end one.

Apple supported dual GPUs before the 2013 trash can Mac Pro. In 2012, the cheese grater Mac Pro had a stock configuration that came with two AMD Radeon HD 5770 GPUs.

A dual-GPU, modular Mac Pro (mid-2012) — Photo: Tom Park / CC BY-SA

As with the GPUs in the subsequent 2013 Mac Pro, these were considered to be mid-range, not high-end. They are rated under 110-watts TDP, so they draw less wattage and generate less heat than high-end GPUs (250 watts TDP is typical nowadays) of the sort that you’d want in a workstation-class tower.

“But you could replace those cards with more powerful ones,” one might say. While that’s possible, the box wasn’t designed to allow it.

It has a decent power supply, 980 watts. However, it provides only 150 watts to each video card, or a total of 300 watts. To support a pair of 250-watt GPUs, like the GTX 1080 Ti, you’d have to supply at least 200 watts more power somehow, possibly by wiring additional power cables from the drive bay ports or from the remaining two expansion slots.

Back in 2009–2012 when this Mac Pro model was being built, high-end GPUs tended to be 180–210 watts each, but even then you’d still be short at least 60 watts. In fact, Apple offered an alternative Mac Pro configuration with a single high-end AMD Radeon HD 5870, rated at 228-watts TDP. You couldn’t insert another one without additional custom wiring. The cheese grater case was simply not designed for a pair of high-end GPUs.

On the other hand, this Mac Pro had a replaceable CPU module that allowed upgrading the system from single to dual CPUs. That’s pretty cool — in most other computers, you’d have to replace the motherboard to upgrade to dual CPUs. But nowadays you’d want this kind of upgradeability with GPUs.

OK, so let’s suppose Apple’s next Mac Pro is modular and expandable, like the cheese grater tower, and fitted with the latest tech: PCIe 4.0, Oculink-2, Thunderbolt 3, USB-C 3.1, M.2 NVMe SSDs, and LGA-2066 or possibly LGA-3647 sockets for high-end CPUs with 44 lanes of PCIe connectivity.

That sounds promising, but how many high-end GPUs will it support? Maybe two, if you fiddle with it?

Pfft, how many GPUs do you really need?

This question reminds me of those people last year who were saying no one legitimately needs more than 16 GB RAM in a Macbook Pro.

“Own the world’s greatest gaming computer and convince everyone it’s for your research” — http://imgur.com/wiLsGqA

Three GPUs in a computer might seem ridiculous. Amongst the single digit percentage of Mac customers who buy a Mac Pro, an even smaller fraction would need more than two GPUs.

There’d never be a controversy about Apple supporting only two GPUs. But if one came up, a person might claim that no one legitimately needs more than two GPUs. Then a person would be wrong.

Two weeks, two weeks, two weeks

If you try to make an AI program to recognize 1000 types of objects in a photo, one of the things you’d find out is that even with multiple GPUs it can take weeks or months to train a neural net.

We can train a model from scratch to its best performance on a desktop with 8 NVIDIA Tesla K40s in about 2 weeks.~ Jon Shlens, Google Research

The common advice to avoid this delay is to adapt one that’s already been trained, using a technique called “transfer learning”. But depending on what you want to do, you can’t always be using someone else’s pre-trained neural net — at some point you’re stuck in a process where each iteration could take weeks. If at that point you’re using only one GPU, adding a couple more would reduce the turnaround time significantly.

Suddenly synthesized

A common problem with neural nets created by “supervised training” is they need a huge amount of training data or else slight variations in input will fool them. That is, suppose you train a neural net to recognize cats, but your training data consists of photos where all the cats are sitting upright. It might not recognize a cat that’s upside down. Your program will handle these kinds of differences better if you augment the training set by adding rotated and resized copies of the originals.

In a notable example, Baidu Research made a breakthrough in speech recognition, aided by layering noise over their initial data set.

Baidu gathered about 7,000 hours of data on people speaking conversationally, and then synthesized a total of roughly 100,000 hours by fusing those files with files containing background noise.~ Derrick Harris, GigaOm

But then they had 14 times more data, which takes that much longer to process. A few years ago Baidu used 8 GPUs. More recently they reported working with as many as 40 or 128 GPUs.

Now you have two problems

Recently some of the coolest results in AI are based on a technique where two neural nets compete against each other (“generative adversarial networks”) to improve their ability to create and recognize types of data.

With a larger model and dataset, Ian [Goodfellow] needed to parallelize the model across multiple GPUs. Each job would push multiple machines to 90% CPU and GPU utilization, but even then the model took many days to train.~ OpenAI Blog

That means you’re training two neural nets, not just one. Again, you’ll want as much computing power as you can get.

Would you pay extra for expansion slots?

When I started doing deep learning work, I used GPU spot instances on Amazon cloud. After running up a small bill, I began looking into external GPU boxes, but ended up assembling a PC.

The GPUs on AWS are now rather slow (one GTX 1080 is four times faster than a AWS GPU) and prices have shot up dramatically in the last months. It now again seems much more sensible to buy your own GPU.~ Tim Dettmers

I bought just one GPU. But knowing that I might need to add more, I made sure that the PC would be able to take up to four.

It’d been a long time since I’d built a computer from components, and way back then I wasn’t paying attention to how many GPUs I could put in it. To my surprise, many computers cannot support two GPUs at full bandwidth, and most cannot support more than two. There are only a few expensive motherboards that can support 4 GPUs simultaneously with x16 lanes of PCIe 3.0 connectivity.

The ability to add 3 more GPUs cost me about $600. That’s about one-third of what the base system (w/o GPUs) would cost if built with standard components. This amount represents the higher prices for a quad-GPU capable motherboard, a big 1500-watt power supply, and a CPU with 40 PCI lanes (as opposed to 28 lanes). I literally paid extra for expandability itself.

Would Apple sell additional expansion as an option?

The Mac Pro is positioned as Apple’s most powerful computer, but computing power is coming in various forms. High-speed expansion slots provide the ability to upgrade computing power in whatever form it takes in the future. We’re not just talking about GPUs here — it might be FPGAs, or ASICs like Google’s TPU. Or fast storage.

Computer manufacturers generally do not sell a single line of desktop computers with different cases. For example, you don’t usually see something like this:

2013 Mac Pro concept by Scott Richardson —[Source]

Instead, the models of a computer line are positioned according to CPU speed, storage, and/or screen size. So this is what’s typical:

https://www.apple.com/mac-pro/specs/

But doesn’t it make sense to sell optional expandability? It has value. You can put a price on it.

C’mon you have to admit it’d be pretty cool 😎

I know what you’re thinking: “There’s no way Apple would do this. It’s not worthwhile because there aren’t enough customers for the bigger Mac Pro.”

You’re probably right.

And besides, it’ll have Thunderbolt-3, and maybe Oculink-2 which is even faster, so people can connect external GPUs if they need them… you know, like this:

Photo: Peter Wiggins — [Source]

Anyway, I’m looking forward to seeing how Apple rethinks the Mac Pro.

When we hit three GPUs, we are technically in a niche category… From talks with ASUS, despite the fact that a product may be geared towards a niche market, that product may sell well to the standard market if it is perceived to be good.~ Ian Cutress, AnandTech (review of multi-GPU boards)

Q&A

Asking the tough questions.

What’s the big deal about multiple GPUs — weren’t bitcoin miners putting 6 or 8 GPUs on a PC in a milk crate?

Yes, but cryptocurrency algorithms can run on a GPU without a lot of data transfer, so miners could use PCI riser cables or a PCI splitter, and connect each GPU using just one PCI lane. That means they didn’t need a 40-lane CPU and a quad-x16 slot motherboard.

In bitcoin mining, GPUs were overtaken by ASICs. Won’t that happen in AI too?

Probably, at least Nervana and Groq are working on that. In the meantime, GPUs have versatility that make them useful for the foreseeable future.

ASICs have lower power requirements than GPUs, so wouldn’t a big power supply be unnecessary for multiple ASICs?

Maybe, but then you might just get more ASICs on each expansion card.

You mentioned FPGAs and ASICs, but what about quantum processors?

Oh yeah, those too I guess.

Why do you need 16 lanes per GPU? Games don’t show any difference between GPUs running on x8 vs x16.

It’s true that games show no significant difference, but that’s probably because they’re tuned to run well on GPUs at x8. Games don’t saturate a x16 PCIe 3.0 connection, but other apps like deep learning programs can.

How does a CPU with only 40 PCI lanes connect to 4 GPUs that each use 16 lanes?

Right, 4×16 = 64, which is more than 40. The motherboard has two PCIe switches which each can multiplex 16 lanes of signal from the CPU to 32 lanes for two slots. So the 4 slots use only 32 lanes from the CPU.

Apple used a similar kind of PCIe switch for the Thunderbolt 2 controllers in the 2013 Mac Pro.

Don’t those PCI switches introduce a lot of latency?

Some people say the latency is so bad that it’s better to just use 8 lanes, but that’s a myth. Apparently this idea spread due to game testing that showed slightly higher frame rates on x8 GPUs than on switched x16 GPUs. But games are not good for this kind of benchmark because they don’t exceed x8 bandwidth — you can get similar results on a unswitched x16 slot.

I haven’t seen any good tests, but an NVIDIA benchmark showed very little latency or loss in bandwidth across a PLX (Avago/Broadcom) switch.

Show us your rig. Pics or it didn’t happen.

Here it is next to the cheese grater:

An open case rig next to a Mac Pro 5,1 — Photo: Tom Park / CC BY-SA

What the… ugly. Why no case?

OK, so I have to vent a little about motherboard and case design. Cool desktop computer designs are totally thwarted by the way uATX, mITX, and ATX boards are standardized. This sector has so much room for innovation with non-standard motherboards and backplanes.

A lot of people are calling for Apple to just put the next Mac Pro in a big box case, but I think it’d be a shame if that’s exactly what Apple does. They went to extremes with the 2013 Mac Pro but reverting to a standard tower box isn’t so appealing either.

I wanted a smallish form factor but the motherboard is so big (extended ATX) that it would have to go in a fairly large case, and I didn’t like anything I saw. So I experimented with MakerBeam rails instead of imitating NVIDIA’s DevBox with a Carbide Air 540 case. It ended up being taller than I intended, but slim in other dimensions.

A single-fan liquid cooler has been completely adequate for the CPU. The GPU is connected with a x16 riser cable and can slide all the way over the CPU, since the CPU doesn’t have a huge heat sink on top of it. That means multiple GPUs can be spaced apart with wide air gaps.

It’s surprisingly quiet with two 140mm fans, even under sustained heavy load. What did make a huge amount of noise was a 5TB hard drive that was hammering away while feeding images to a convnet. After a couple annoying days of that, I switched to a SSD. It has no drive bays — there’s a M.2 drive and additional SSDs can be velcroed on a side rail.


Published by HackerNoon on 2017/04/14