Machine Learning the GitHub API with Meeshkan

Written by meeshkan | Published 2017/12/25
Tech Story Tags: github | machine-learning | github-api | free-course | meeshkan

TLDRvia the TL;DR App

Happy New Year, Hacker Noon readers!

We’re proud to announce that we’ve finished our first full-length course called Meeshkan: Machine Learning the GitHub API. The course is available on Udemy and you can follow it free of charge.

Meeshkan: Machine Learning the GitHub API | Udemy_Learn how to plan, deploy and run a Machine Learning problem on AWS and Meeshkan - Free Course_www.udemy.com

The basic idea of the tutorial is the following. Let’s say that you are a venture capitalist or talent hunter or you offer a service to developers and you would like to discover exciting new projects on GitHub to invest in them, or hire the authors, or strike up a partnership. What type of projects will really take off? Maybe Machine Learning can be used to help us hone our intuition about projects on GitHub.

Meeshkan has a crush on the Octocat.

To cut to the chase, by the end of the tutorial, you’ll see that [drumroll] dense, fully connected neural networks learn faster about stars than forks, and models perform surprisingly well when measuring mean-squared-loss between actual and predicted star count in addition to categorical cross entropy for star counts above a certain threshold. This is even after introducing satisficing criteria to account for the skewed distribution of some targets (i.e. a threshold of stars > 80000 ? 1 : 0 will only have a handful of projects in the 1 category while the rest of us mortals hang out with the 0s). The way we figure this out is by feeding Meeshkan lots of different webhooks that serve data collected from the GitHub API, uploading various Keras models to Meeshkan, and passing the results to a small express server in order to visualize how our models are performing.

Results displayed on our data visualize, which you can find at https://github.com/meeshkan/data-visualizer. Our model learns pretty darn fast but then plateaus. Epochs on the X axis, loss on the Y axis. I ran a first batch on a dataset with around 200 000 datum, which was obviously overkill. The lowest line is predictions for a higher star-count (0 for under 1500, 1 for over), whereas the upper two are for 300 stars.

The course takes a day or so to complete and by the end you’ll not only have done some Machine Learning, but you’ll have deployed an entire environment to Amazon Web Services that will automatically ingest publicly-available API data for Machine Learning. If you want to modify the tutorial to do learning about one of the hundreds of open APIs on the internet, go for it! I’m excited to see what you learn :-)

In this article, I’d like to talk about the making of the tutorial and, specifically, reveal a few neat things I learned about using Meeshkan to analyze results of large data-collection projects on public APIs.

308 550

That is the number of EC2 t2.micro server instances that were spawned on Amazon to analyze and inspect the GitHub API. The batch collected data on 995 977 repos and a whopping 14 329 073 commits. The average lifespan of a server was 203.45 seconds, and we bid 0.0040 USD per hour per server. So, when you do the math, 308550 * 203.45 * 0.0040 / 3600 = $69.75 to collect all that data. The main reason I let the job run so long is because parts of the setup (EC2, GitHub, MySQL, the tutorial’s code) can sometimes flake out unexpectedly, which means that you won’t get the number of commits you need to do machine learning. For example, only 125 082 of the repos in the database have over 50 commits. If I run:

SELECT full_name, stargazers_count, COUNT(*) AS commit_count FROM repos JOIN commits on repos.id = commits.repo_id GROUP BY repos.id ORDER BY commit_count ASC, stargazers_count DESC LIMIT 5;

I get:

+--------------------+------------------+--------------+| full_name | stargazers_count | commit_count |+--------------------+------------------+--------------+| impress/impress.js | 32868 | 1 || Automattic/kue | 6852 | 1 || cdnjs/cdnjs | 5801 | 1 || square/cube | 3871 | 1 || enyojs/enyo | 1941 | 1 |+--------------------+------------------+--------------+5 rows in set (7.51 sec)

In other words, false positives, as none of those repos have only 1 commit. But even when we only study repos with more than 20 commits, 100 000 data points is more than enough to get us started.

Our database is ingesting a blazing 40 repos per second and around 1000 commits per second! You’ll learn how to deploy this setup in a few clicks on the Udemy tutorial.

Also, there are some fun diamonds in the rough. The command:

SELECT full_name, size, stargazers_count FROM repos WHERE size = 2 ORDER BY size ASC, stargazers_count DESC LIMIT 5;

yields:

+-------------------------------+------+------------------+| full_name | size | stargazers_count |+-------------------------------+------+------------------+| atg/chocolat-public | 2 | 197 || edankwan/Jesus.js | 2 | 114 || bancek/django-smtp-ssl | 2 | 84 || boucher/stripe-webhook-mailer | 2 | 81 || tlatsas/bash-spinner | 2 | 50 |+-------------------------------+------+------------------+5 rows in set (0.45 sec)

Yes, https://github.com/edankwan/Jesus.js is really a repo, yes, it has almost nothing in it, and yes, it has managed to rack up 114 stars. Actually, now 115 with one from me :-)

If you’re not following the tutorial on Udemy and have at least a Yellow Belt in JavaScript and AWS infrastructure, you can deploy this all yourself with just a few clicks from https://github.com/meeshkan/github-tutorial-stack. Pull requests are welcome!

73

This is the number of webhooks that I passed to Meeshkan to get the results for this tutorial. Because the webhook generates the data dynamically based on the path, it is really easy to keep the model the same but change the webhook.

A screencast as I was putting together the tutorial— launching a (batch) job in Meeshkan takes less than 30 seconds!

For example, the same model can be used to analyze how many distinct authors there are over a 3 or 10 commit window just by changing a parameter in the webhook. The model doesn’t change at all. For each webhook, I uploaded between 3 and 5 models, all of which run in parallel, so we see the download icon for the the first results in minutes and the whole thing finishes up within a few hours for the longest jobs at the Meeshkan network’s very reasonable price of 0.08 USD per hour.

4

This is the number of things I can think of off the top of my head to do after you finish following the course.

  • I would choose to focus only on stars with a threshold of around 1 000 stars for a project, running a few deeper neural nets on just this range.
  • I would run this against my validation set, which is really easy. Instead of passing the webhook tutorials.meeshkan.io/github/80_10_10_/train/..., I use tutorials.meeshkan.io/github/80_10_10_/validate/... and voilà, we are on our validation set. This way, we’ll know if we need to regularize our model. Because you are not prematurely regularizing, right? RIGHT??!?
  • I’d get more data to look at a slightly longer commit history and write a new Keras model that treats commits as a sequence of events.
  • I’d get a version of the model into a production environment ASAP and start using it on new GitHub repos. That way, if we decide that an early version of the model is good enough to start helping us make business decisions, we have a small stash of predictions already waiting for us.

No, wait, there’s a fifth thing…you should crack open a nice bottle of Penfolds Grange Hermitage 1951 because you’re going to be making bank with your awesome AI ninja skills after you take our Udemy course :-)

Your very own API-crawler on Meeshkan

The Meeshkan Public Beta is itching to do your Machine Learning! We are a small company with a big heart that is taking on the likes of Google and Amazon by offering a low-cost Machine Learning sandbox where anyone can explore new ideas. Here are some great things to do once you sign into Meeshkan:

  • Deploy the Udemy tutorial materials. Following a tutorial is great, and building upon it is even better. There are endless hyper-parameters to tune. So once you’ve finished the tutorial, tune away! And remember: if you don’t want to follow the full tutorial, you can just use the webhook https://tutorials.meeshkan.io/github/... to get started, where ... is explained in video 3 of the tutorial, using a model from video 5 of the tutorial.
  • Book a time to chat. Machine Learning is for everyone, not just big companies with lots of money, but sometimes it can be daunting to get started. After you watch our tutorial, we hope you’ll have ideas for cool projects you can do on Meeshkan. We’d love to help you get up and running! Sign up for a free consultation through your Meeshkan Dashboard or directly on calendly.
  • Join our Slack group. You can join the group through the Meeshkan console or by signing up using this form.

Thank you very much for checking out our Machine Learning service. We think you’ll like it a lot and we are working every day to make it faster, cheaper and easier to use.

Did I mention you get 100 free hours of Machine Learning and a free fifteen-minute consultation to get your ML job up and running? See you on Meeshkan!


Published by HackerNoon on 2017/12/25