Why Dataproc — Google’s managed Hadoop and Spark offering is a game changer.

So far I’ve written articles on Google BigQuery (1,2,3,4,5) , on cloud-native economics(1,2), and even on ephemeral VMs (1). One product that really excites me is Google Cloud Dataproc — Google’s managed Hadoop, Spark, and Flink offering. In what seems to be a fully commoditized market at first glance, Dataproc manages to create significant differentiated value that bodes to transform how folks think about their Hadoop workloads.

Jobs-first Hadoop+Spark, not Clusters-first

Typical mode of operation of Hadoop — on premise or in cloud — require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries, SparkSQL, etc. Pretty straightforward stuff.

The standard way of running Hadoop and Spark.

Services like Amazon EMR go a step further and let you run ephemeral clusters, enabled by separation of storage and compute through EMRFS and S3. This means that you can discard your cluster while keeping state on S3 after the workload is completed.

Google Cloud Platform has two critical differentiating characteristics:

Per-minute billing (Azure has this as well)
Very fast VM boot up times

When your clusters start in well under 90 seconds (under 60 seconds is not unusual), and when you do not have to worry about wasting that hard-earned cash on your cloud provider’s pricing inefficiencies, you can flip this cluster->jobs equation on its head. You start with a job, and you acquire a cluster as a step in job execution.

If you have a MapReduce job, as long as you’re okay with paying the 60 second initial boot-up tax, rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. A cluster is now a means to an end for job execution.

Demonstration of my exquisite art skills, plus illustration of the jobs before clusters concept realized with Dataproc.

Again, this is only possible with Google Dataproc, only because of:

high granularity of billing (per-minute)
very low tax on initial boot-up times
separation of storage and compute (and ditching HDFS as primary store).

Operational and economic benefits are obvious and easily realized:

Resource segregation though tenancy segregation avoids non-obvious bottlenecks and resource contention between jobs.
Simplicity of management — no need to actually manage the cluster or resource allocation and priorities through things like YARN resource manager. Your dev/stage/prod workloads are now intrinsically separate — and what a pain that is to resolve and manage elsewhere!
Simplicity of pricing — no need to worry about rounding up to nearest hour.
Simplicity of cluster sizing — to get the job done faster, simply ask Dataproc to deploy more resources for the job. When you pay per-minute, you can start thinking in terms of VM-minutes.
Simplicity of troubleshooting — resources are isolated, so you can’t blame the other tenants on your problems.

I’m sure I’m forgetting others. Feel free to leave a comment here to add color. Best response gets a collectors’ edition Google Cloud Android figurine!

Dataproc is as close as you can get to serverless and cloud-native pay-per-job with VM-based architectures — across the entire cloud space. There’s nothing even close to it in that regard.

Dataproc does have a 10-minute minimum for pricing. Add the sub-90 second cluster creation timer, and you rule out many relatively lightweight ad-hoc workloads. In other words, this works for big serious batch jobs, not ad-hoc SQL queries that you want to run in under 10 seconds. I write on this topic here.(do let us know if you have a compelling use case that leaves you asking for less than a 10-minute minimum).

The rest of the Dataproc goodies

Google Cloud doesn’t stop there. There’s a few other benefits of Dataproc that truly make your life easier and your pockets fuller:

Custom VMs — if you know the typical resource utilization profile of your job in terms of CPU/RAM, you can tailor-make your own instances with that CPU/RAM profile. This is really really cool, you guys.
Preemptible VMs — I wrote on this topic recently. Google’s alternative to Spot instances is just great. Flat 80% off, and Dataproc is smart enough to repair your jobs in case instances go away. I beat this topic to death in the blog post, and in my biased opinion it’s worth a read on its own.
Best pricing in town. Google Compute Engine is the industry price leader for comparably-sized VMs. In some cases, up t0 40% less than EC2.
Gobs of ephemeral capacity — Yes, you can run your Spark jobs on thousands of Preemptible VMs, and we won’t make you sign a big commitment, as this gentleman found out (TL;DR: running 25,000 Preemptible VMs) .
GCS is fast fast fast — When ditching HDFS in favor of object stores, what matters is the overall pipe between storage and instances. Mr. Jornson details performance characteristics of GCS and comparable offerings here.

Dataproc for stateful clusters

Now if you are running a stateful cluster with, say Impala and Hbase on HDFS, Dataproc is a nice offering here too, if for some reason you don’t want to run Bigtable + BigQuery.

If you are after the biggest baddest disk performance on the market, why not go with something that resembles RAM more than SSD in terms of performance — Google’s Local SSD? Mr. Dinesh does a great job comparing Amazon’s and Google’s offerings here. Cliff notes — Local SSD is really, really, really good — really.

Finally, Google’s Sustained Use Discounts automatically rewards folks who run their VMs for longer periods of time, up to 30% off. No contracts and no commitments. And, thank goodness, no managing your Reserved Instance bills.

You win if you use Google’s VMs for short bursts, and you win when you use Google for longer periods.

Economics of Dataproc

We discussed how Google’s VMs are typically much cheaper through Preemptible VMs, Custom VMs, Sustained Use Discounts, and even lower list pricing. Some folks find the difference to be 50% cheaper!

Two things that studying Economics taught me (put down your pitchforks, I also did Math) — the difference between soft and hard sciences, and the ability to tell a story with two-dimensional charts.

Let’s assume a worst-case scenario, in which EMR and Dataproc VM prices are equal. We get this chart, which hopefully requires no explanation:

Which line would you rather be on?

If you believe our good friend thehftguy’s claims that Google is 50% cheaper (after things like Preemptible VMs, Custom VMs, Sustained Use Discounts, etc), you get this compelling chart:

Same chart, but with some more aggressive assumptions.

When you’re dishing out your precious shekels to your cloud provider, think of all this extra blue area that you’re volunteering to pay that’s entirely spurious. This is why many of Dataproc’s customers don’t mind paying egress from their non-Google cloud vendors to GCS!

Summary

Google Cloud has the advantage of a second-comer. Things are simpler, cheaper, and faster. Lower-level services like instances (GCE) and storage (GCS) are more powerful and easier to use. This, in turn, lets higher-level services like Dataproc be more effective:

Cheaper — per-minute billing, Custom VMs, Preemptible VMs, sustained use discounts, and cheaper VMs list prices.
Faster — rapid cluster boot-up times, best-in-class object storage, best-in-class networking, and RAM-like performance characteristics of Local SSDs.
Easier — lots of capacity, less fragmented instance type offerings, VPC-by-default, and images that closely follow Apache releases.

Fundamentally, Dataproc lets you think in terms of jobs, not clusters. You start with a job, and you get a cluster as just another step in job execution. This is a very different mode of thinking, and we feel that you’ll find it compelling.

You don’t have to take my word for it — good folks at O’Reilly had this to say about Dataproc and EMR.

Find me on twitter at @thetinot . Happy to chat further!