Machine learning on machines: building a model to evaluate CPU performance

Wouldn’t it be cool if we could train a machine learning model to predict machine performance? In this post, we’ll look at a linear regression model I built using BQML to predict the performance of a machine given hardware and software specifications.

All of the work I do is software related so I rarely think about the hardware running my code — this gave me an opportunity to learn about the hardware side of things.

The dataset: SPEC

To train this model I used data from SPEC*, an organization that builds tools for evaluating computer performance and energy efficiency. They published a series of benchmarking results from 2017, where they used 43 different tests to evaluate the performance of specific hardware. Their benchmarks are divided into 4 categories: integer and floating point tests measuring performance through both time (called SPEC speed) and throughput (called SPEC rate):

The four categories of SPEC 2017 benchmarks.

For this post, I’ll focus mostly on the Integer Speed tests.

Each test includes specs on the hardware and software of the system where the test was run, along with results for several benchmarks. The benchmarks are intended to simulate different types of applications or workloads.

Here’s an example of the hardware and software data for one test:

And here’s a subset of the SPEC benchmarks for the above system:

Why so many benchmarks, and what do they all mean? SPEC’s goal is to evaluate machine performance across a variety of common workloads, all outlined here. For example, the 631.deepsjeng_s benchmark evaluates the speed of a machine running the alpha-beta search algorithm for playing a game of Chess. SPEC also has benchmarks for other ML workloads, ocean modeling, video compression, and more.

Is there a relationship between hardware, software, and benchmark results?

The short answer is — yes! But I wanted to confirm this before naively dropping inputs into a model and hoping for the best.

To take a closer look at the data, I wrote a Python script to parse the CSV files for each test using the CPU2017 Integer Speed results. The script extracts data on the hardware and software used in each test, along with the score for each of the benchmarks run on that machine. In the script, I first initialize a dict() with the data I want to capture from each test:

Then I write the header row to the CSV using the keys from the dict above:

Here’s an example CSV that I’ll be iterating over in the script. The next step is to iterate over the local directory where I’ve saved all the CSV files for the integer speed tests, and use the Python csv module to read them:

The files aren’t typical CSV format, but they all follow the same pattern so I can grab the benchmark data with the following:

Using the benchmark start and end indices, I can iterate over the rows of the CSV that contain the benchmarks. Then I look for the specific data points I’m collecting within a CSV. Here’s how I grab the vendor and machine model name:

When I’m done collecting all of the machine specs, I create a CSV string of the data and write it to my file:

With all of the machine data and benchmark scores in a CSV file, I used a Python notebook and matplotlib to explore relationships between machine specs and benchmark results to see if there was a linear relationship. Here I’ve plotted the nominal speed of a machine and the 631.deepsjeng_s (alpha-beta search) benchmark:

Looks like I could draw a line or curve that roughly estimates the relationship above. Let’s take a look at another input. Here’s a plot of a machine’s number of cores and the 631.deepsjeng_s benchmark:

While there’s some relationship between the inputs and benchmark measured above, it’s clear that we couldn’t use a single feature (like number of cores) to accurately predict a machine’s speed. There are many features of a machine that affect its performance, and to make use of all of them we’ll build a linear regression model to evaluate the speed of a machine. Here’s what we’ll use as inputs:

- The hardware vendor- The machine model name- Nominal and max speed of the microprocessor in MHz- Number of cores- Number of memory channels- Size of L1, L2, and L3 caches- Memory in GB- Memory speed- The OS running on the system- The compiler being used- The company running the test (Lenovo, Huawei, Dell, etc.)

A linear regression model typically outputs a single numeric value, so we’ll need to create separate models to predict each benchmark score. It’s important to note that not all of our inputs are numerical — vendor name, model name, and OS name are categorical, string inputs. Since a linear regression model expects numerical inputs, we’ll need a way to encode these as integers before feeding them into our model.

Building a linear regression model with BigQuery Machine Learning

We could write the code for our linear regression model by hand, but I’d like to focus on the predictions generated by my model rather than the nitty gritty of the model code. BigQuery Machine Learning (BQML) is a great tool for this job. If I create a BigQuery table with my input data and benchmarks, I can write a SQL query to train a linear regression model. BQML will handle the underlying model code, hyperparameter tuning, splitting my data into training and test sets, and converting categorical data to integers.

Since I’ve already got the data for my model as a CSV, I can use the BigQuery web UI to upload this data into a table. Here’s the schema for the table:

And here’s a preview of the data:

To get a sense of one of the categorical features we’re working with, let’s take a look at the sponsor field in our table, which refers to the company that ran an individual test. Here’s the query we’ll run to get a breakdown of our data by test sponsor:

The results:

And here are the top operating systems used in the tests:

I can train my model with a single SQL query in BQML:

This took 1 minute and 14 seconds to train. When it completes, I can look at the stats for each iteration of training:

Loss metrics for each epoch of training in BQML

The number I want to focus on here is the Evaluation Data Loss — this is the loss metric calculated after each iteration of training. We can see that it has steadily decreased from the first to the last iteration. To get additional evaluation metrics, I can run an ML.EVALUATE query. Here are the results:

Evaluation metrics on my BQML model

Mean squared error (MSE) measures the difference between the values our model predicted using the test set and the actual values. You can also think of it as the distance between your regression (best fit) line and the predicted values. A smaller value is better, and our model’s .0121 MSE is very good.

Since this is a regression model (predicting a continuous numerical value), the best way to see how it performed is to evaluate the difference between the value predicted by the model and the ground truth benchmark score. We can do this with an ML.PREDICT query.

When this query runs it’ll create a new field prefixed withpredicted_ and the name of our label column (in this case _631_deepsjeng_s). Let’s run ML.PREDICT across the original dataset and output a few features of the system being tested, the actual speed test result, and predicted benchmark:

Our predictions are very close to the actual score values!

To measure this another way, we can subtract the predicted score from the actual score, and then get the average of the absolute values of these differences:

On average, our model’s prediction is only .04 off from the original value — pretty impressive.

Generating predictions on new data

Now that I’ve got a trained model, I can use it to predict the speed benchmarks for a machine that wasn’t part of my training set with the following query:

The SPEC speed score for this machine on the 631.deepsjeng_s benchmark should be 5.19. When I run this query my model predicts 5.20 which is pretty close to the actual value. I can now use this model to generate predictions on new combinations of hardware and software specs that weren’t included in the test dataset. In addition to the alpha-beta search benchmark, I created models for other Integer Speed benchmarks that had similar accuracy to this one — let me know if you’re interested in these results and I can share details.

Because I didn’t have to write any of the underlying model code, I was able to focus on feature engineering and finding the right combination of inputs to build a high accuracy model — thanks BQML!

More with SPEC data

If you’re interested in experimenting with this data and creating your own models, check out the Integer Speed benchmark results here. It’s also worth noting that SPEC has lots of data on previous CPU benchmarks. The 2006 SPEC CPU benchmarks have many more years worth of data (though SPEC has retired this set), so you could train a model using a similar approach to predict the performance of older machines.

This post focused on integer speed benchmarks. I haven’t included the other SPEC benchmarks here (floating point speed and throughput) because they didn’t perform as well with linear regression models (MSE in the 100s compared to MSE of less than 0.1 in the integer speed models). I’m planning to experiment with these tests using other types of models, stay tuned :)

In addition to SPEC, there are also many other performance benchmarks you could use like LINPACK, which is used by Top 500.

Have feedback?

This is my first foray into hardware performance data, so I’d love any feedback you have on the features I used as inputs, other types of models to try, or anything else. Leave a comment below or find me on Twitter at @SRobTweets.

If you want to learn more about BQML, my teammates have some great posts: this one covers predicting Stack Overflow response times, and this one walks you through building a model to predict flight delays.

Thank you to these awesome people for their contributions and feedback: Brian Kang, Shane Glass, Abhishek Kashyap, Eric Van Hensbergen, Jon Masters, and a few other SPEC experts at Red Hat 😀

*********************

*Data source: https://www.spec.org/cpu2017/results/cint2017.html Data retrieved on November 11 2018. SPEC Fair Use Rules: https://www.spec.org/fairuse.html