Building a botnet on PyPi

Or being able to

An update — September 2017

A week or so ago, some students applied this concept to the idea of typosqatting (registering malicious packages with names similar to popular libraries). By getting a university to issue a security notice, they generated some interest, and finally resulted in some changes to pypi/warehouse to address these issues.

I decided to take another look at the download figures for my packages, and see what damage my malicious alter-ego could have wreaked.

Across the 12 system module packages I’m hosting, I’m getting on average 1.5 thousand downloads per day, via pip. This adds up to 491,292 downloads so far this year. I’m hoping to hit 500k downloads before my packages are deleted!

By package, the download ratios pretty much match the numbers from May:

There’s a plan to delete my fake packages now that restrictions have been added to prevent this sort of attack, but it was fun while it lasted!

Intro

At a London python dojo in October last year, we discovered that PyPi allows packages to be registered with builtin module names.

So what? you might ask. Who would pip install a system package? Well the story goes something like this:

An inexperienced Python developer/deployer realises they need X functionality
Googling/asking around, they find out that to install packages, people use pip
Developer happily types in e.g. pip install sys
Baddie has registered the sys pip module, and included a malicious payload
Developer is now pwned by malitious package, but import sys in python works, and imports a functional sys module, so nobody notices.

When we discovered this, I was pretty interested in how plausible this was as an attack vector, so did a few things:

Emailed the pypi security contacts listed on pypi
Proactively registered all the common system module names that I could think of, as packages
Uploaded an empty package to each of them that does nothing other than immediately traceback:

raise RuntimeError("Package 'json' must not be downloaded from pypi")

Why upload anything?

It’s perfectly possible to squat on a pypi package and not upload any files. But by adding an empty package, I could track the downloads from the pypi download stats.

Pypi upload their access logs (sans identifying information) to google big query, which is pretty awesome, and allows us to get a good idea of how many systems each package ends up on.

How effective is this attack vector?

Big query says that so far this year (19th May 2017), my dummy packages have been download ~244k times, lucky they’re benign huh, otherwise that’s 1/4 million infected machines!

Some of the downloads will be people using custom scrapers, others may be automated build jobs, running over and over, but I used some tactics to gauge the quality of this data:

pypi download logs include a column installer.name this seems equivalent to an HTTP user agent string, by only selecting rows where the installer.name is pip, we’re more likely to be counting actual installs, rather than scrapers, or other bots
Another column: system.release tracks very high-level system version information (for example 4.1.13–18.26.amzn1.x86_64) By including this in the counts, we can see that lots of different types of setups are downloading these packages, suggesting it’s not just a few bots scraping the site. 3.1k different system versions have downloaded my packages this year, compared with 33k total unique versions across the whole of pypi

The query I used is here:

What now?

I never actually received a reply to my email, so a while later, I raised an issue on the official pypi github issue tracker in January. This also got no reply.

I’m currently squatting all the system package names that seem most at risk, and doing so with benign packages, so I don’t see much of a risk of disclosing this now.