A Dive into FreeCodeCamp Stargazers

Written by preview | Published 2017/11/01
Tech Story Tags: javascript | github | open-source

TLDRvia the TL;DR App

Please note that my opinions automatically makes them the ones of all people from my company, France and more generally every resident of the planet Earth.

To celebrate the soon-to-be first GitHub repository to reach 300k stars, I decided to do a deeper analysis of this metric.

Before starting and in order to be completely honest, I must admit I‘m a bit biased, primarily due to their way of making newcomers believe they were required to star their repository.

For people unfamiliar with what I’m referring to, they basically used to instruct everyone in the getting started challenge to create a GitHub account and go star freeCodeCamp (FCC for short) as a mandatory step. After some time, I created a pull request to remove the imperative wording that was used, as a first step towards removing completely this instruction. It was rejected at first, got approved, sort-of reverted in another merge, but Quincy finally decided to delete the offending sentence a few months afterwards.

Now that you have the background that will also be helpful later, let’s proceed to the first mandatory step before making some visualizations!

Data: The Gathering

An interesting thing to know is that the GitHub Rest API is limited to the first 400 pages of 100 stargazers as we discovered at our expense with Starveller, but this is another story. Parsing the web page was not an option either, since it only displays the last 5100 people. Fortunately, the GraphQL endpoint doesn’t have this limitation (at least for now ;)).

Our query is pretty basic and retrieves only the information we might need later, namely location, createdAt and all the entity counts of an user. The maximum results per page you can retrieve is almost always 100, I’ve used the value 50 here since I experienced some API timeouts otherwise.

Executing such query through the GitHub GraphQL explorer would give us the following data (trimming down to the first stargazer for display purposes):

The notion of cursor accessible in the pageInfo field will help us iterate through the results, which will get passed as a new after argument to the stargazers query, instructing GraphQL to retrieve results from the previous endCursor. We will be able to pass this easily using an ES6 template string that you can look at here.

With this, we can now create our script to fetch all that good stuff. Because it took a couple hours spanning on multiple days to get everything, I couldn’t possibly save that simply in memory since I had to pause and restart a few times, so I used lowdb. For people that don’t know or won’t take the time to look at their readme, it is a local JSON database with lodash utilities.

The logic consists of a simple function calling itself with the previous cursor, or grabbing the last one from the database on the first iteration. The resulting file size is 194MB, but I leveraged git-lfs in order to make it available to you all so you can use it better than I will (feel free to diss me in the comments).

Visualizations

Now that we have our raw data, let’s play with it. The first interesting thing that we could look at would be the overall evolution of stargazers. For this article, I’m going to pick a charting library using the amazing RANDOM.org API combined with an open-source script that anyone can read, ensuring impartiality.

Well that was quite unexpected: Out of all of them, the one made by my team at Uber got picked, that will have to do I guess..

We are going to need to group the stars by date to be be able to tell how much the repo got per day and the overall evolution to look for meaningful signals.

The resulting byDay array is going to be saved into it’s own file that will be passed to our chart, in order to avoid serving the full database to you, wonderful reader.

After all this text and snippets, can we finally see something!?! Yes. Using react-vis LineSeries we can plot the overall evolution along with the per-day value represented by a BarSeries. We’re also going to do a bit of cheap tricks with the axis ticks in order to show the values corresponding to the nearest hovered X. Go ahead and play with it, it’s interactive! (Sorry mobile users, I had to remove the second one)

The first thing we notice is the pretty abnormal peak between July 9th and 10th of 2015 where we got a total of roughly 2900 stargazers. Let’s play investigative journalists and go back in time to look for what we can find at these dates, in order to see if we can pinpoint what was the cause of this growth.

One article was made in June 18th on Wired but seem to have had no impact, but another one made on July 9th on lifehacker (209th US ranked as of now) appears to have participated. We are also going to use Twitter’s advanced search since they will most likely have tweeted about big news concerning the site. And here it is, along with responses that looks like the site just went live, we have one announcing they are the top 1 trending repo on GitHub. Although I’m not sure why there were not already since June 24th with ~200 stars a day, I don’t really know how Github used to do the ranking, but it must be the main source.

The first drop on July 6th 2016 corresponds to the commit removing completely the text referring to the action of starring the repo, but still shows a gif doing it, thus still drawing some people into doing the same action. The point to which we get back to a completely normal growth in June 2017 matches the revamp of the site along with the on-boarding process; which is confirmed by looking at the web archive of the site that started picking up snapshots at this date. Funnily enough, the diff and this gif shows they are now using this growth hacking technique with Youtube subscribers, but that’s another story.

Wayback Machine snapshots for FreeCodeCamp January-August 2017

But what I’m really interested to analyse with more scrutiny would be the Smithes. What are they exactly? Well it’s the codename I gave to users who only starred FCC, and nothing else. Not all of them are necessarily completely inactive, but because they only gave this star, I assumed they would never have given it were they not instructed to do so.

At first I thought representing the Smithes would be a good way to display this data, but I quickly realized that rendering their opposites, the Neos, would be a better way to show the proportion on the full timeline.

Thus, we are going to make it so that our Y axis represents the percentage of Neos stars for each day along with a flame-graph-like color scale to clearly show potential differences from one day to another. Hovering will give you the exact percentage and the total number of Neos for a specific day.

Ultimately, you would want this to be a red rectangle.

As you can see, it’s very unlikely early stargazers from before the Operation Black Storm (as I like calling it) to be Smithes, since this limited range of people were either relatives of the project owners or already familiar with Open-Source.

After June 23rd and FCC going mainstream with more than 100 stars a day, the Smithes started doing their apparition, en masse. The only time Neos largely surpassed their counterparts was around the 24th September 2015 period. I tried to go back to the source, but it was a bit cloudy. One thing’s for sure, they didn’t win because of this.

The slow fall of Neos up until the liberating on-boarding rework can be explained by the fact older accounts had more time to learn what starring really meant and/or give it to other projects. Still, it would be surprising to see the latest batches of 2017 going past 60 or even 70% in 2019 if they follow the same pattern as the ones from 2 years ago.

For the sake of curiosity, what do you think the final star count would really look like if we got rid of all Smithes at once? A single filter will give us the long-awaited answer you have all been craving for.

Do I sound smarter using a LateX font? I know I do.

That’s still a huge number you would say, but at least it wouldn’t be the top repository anymore thanks to bootstrap beating it by a few thousands. One could only imagine what this number would be without any deception or my Smith conditional be stricter.

Just because I could, I did another iteration of our first visualization using the Neo-rmalized data to see what it could have been:

The fluctuations do not differ that much, only the units do.

Cartography

Because I really like maps, I thought it would also be cool to pin stargazers to world countries based on their publicly available location. GitHub uses a simple text input, allowing pretty much everything with potentially different formats and little jokes like “localhost” or “Earth” (you’re not unique by the way, there was respectively 20 and 243 of you in the whole dataset), so we will need to normalize that in some way.

The solution I went with was to use the Google Geocoding API. The script is composed of node-geocoder and p-queue to manage query concurrency. With their 2500 req limit per day on the free plan however, we will have to find another way to get around it, or wait the amount of remaining location to geocode divided by 2500:

We probably don’t want to wait that long

Since the database is in a single JSON file, we also need to be careful of how we are doing the write, I got burned multiple times when I killed my scripts at the exact same time it was writing, resulting in the loss of all fetched data after my last commit. To minimize that, I simply created a cache of users to update, and a 10 second interval function that would write all of them at once. I won’t embed gists of these scripts here because it’s pretty ugly, but you can still go ahead and look at them if you feel like it.

After getting throttled a few times, I realized OpenStreetMap had a Geocoding equivalent called Nominatim. You still need to be respectful of their policy and limit queries to not DDOS their service, so unlike with Google I kept a concurrency of 1. To avoid doing some queries, we can do a mapping of already geocoded cities by their names and try to match it with user locations. This way, we won’t need to geocode NYC 3012 times for example. There is one little issue though: apparently world countries didn’t agreed on name unicity of their cities, which is shocking. I also came across a few oddities: for example, there is a city called Paris in Texas where they have a little Eiffel Tower with a cowboy hat!

Chapeau.

But let’s get back to our business. Since some cities are duplicated between countries, often with the Canada, United Kingdom and USA trio; I arbitrarily made priority to some cities depending on their importance, which may skew some of the results. Given that’s how Geocoders works anyway (by returning an array of possible results sorted by importance/relevance), the impact of such method compared to querying every location should be pretty limited. I’ve also grabbed a JSON of all countries to populate users who only specified that information.

With something like 4% of geocoded addresses I was able to match 39102 users locations at the first pass of the script, leaving 94247 unassigned. After a few back and forth between geocoding and the deduplicate script, I should be able to get all the information I need.

It’s worth putting the emphasis on the fact that Geocoding is not an exact science, especially when you don’t have Google resources. While looking at my output, I saw a couple really bad results like “Venice” getting resolved to Canada or even “Johannesburg” to a bus stop is Israel. I fixed some of these but probably not most of them, thus our data is not 100% accurate, still good enough to have a global idea!

For the display, I’ll be using react-simple-maps, allowing me to render the countries as SVG components, and a d3 color scale based on the number of stargazers. Go ahead and click any country to get the top cities (only on desktop though, click again to un-focus).

Yes, the France count is purely coincidental.

I could have used another of our frameworks, deck.gl to render every single city but I wasn’t sure how that would have played with Medium embedding and especially to mobile users, so I refrained from doing so and stuck to the lightweight solution.

Conclusion

I believe you should not take the darker path to make your project popular at all costs, even if it’s serving a good cause. Resorting to subterfuges to make use of mostly inexperienced people doesn’t seem right, especially in the Open-Source sphere and even more if you’re specifically targeting people who want to learn.

I might be too severe but in any case, it was instructive for me to create visualizations about a subject I care deeply about and sharing my thought process at the same time. I hope someone will appreciate.

All the code is available here.


Published by HackerNoon on 2017/11/01