One More Analysis of GitHub and StackOverflow Data with Google BigQuery

Written by sAbakumoff | Published 2017/01/30
Tech Story Tags: github | bigquery | stackoverflow | google-cloud-platform | open-data

TLDRvia the TL;DR App

TL;DR I built the web-site where you can explore the Stack Overflow questions referenced in the source code in Github. Check it out in http://sociting.biz

MotivationI am a big fan of Google Cloud Platform, especially I love its data warehouse implementation called BigQuery. In summer of 2016 Github and Google made the open-source data available for everyone in BigQuery, here are the mind boggling numbers:

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

I have since then never tired of exploring these data, revealing interesting patterns or extreme samples and publishing articles about my findings.In December of 2016 Google team has woken up my “researcher within” again — they’ve added Stack Overflow’s history of questions and answers to the collection of public datasets on BigQuery. In practice that means that the most popular programming chat in the world is now can be analyzed with the power of Google Cloud Platform, for example one can run the sentiment analysis on the Stack Overflow data and find out that Python developers post the lowest percent of negative comments overall! What excites me the most though is the ability to join the Stack Overflow data with other publicly available data sets. For example one can try to find out whether the weather can affect the probability of a Stack Overflow question to be answered by using the data from NOAA dataset(I am actually going to conduct this research soon).In the introduction to the Stack Overflow data availability Felipe Hoffa provided the sample of joining Github and Stack Overflow data to find out which are the most referenced Stack Overflow questions in the GitHub code — specifically, Javascript. It gripped my attention because I noticed a couple of limitations: * The query searches only for stackoverflow.com/questions/([0–9]+)/ pattern in the source code. However, there are alternative forms of referencing questions : it could a short form stackoverflow.com/q/([0–9]+)/ and it could be the direct reference to one of the answers, like stackoverflow.com/answers/([0–9]+)/* The query deals only with JavaScript sources, but there are plenty of other programming languages.

So, I set out to build the catalog of the stack overflow questions referenced in the GitHub sources for popular programming languages.

Getting the dataStep 1 Finding lines of code in Github Sources that have references to StackOverflow questions or answers. contents_top_repos_top_langs table that keeps contents for the top languages from the top repos was kindly provided by Felipe Hoffa

The result has been saved in the new table called so_ref_top_repos_top_langs which contains the rows like

Step 2 Join the result with the StackOverflow data. The query should handle both of the questions id’s and answers id’s extracted from the source code

The result contains the rows the look like

There were roughly 31K records like this one, the next question was on how to visualize them.

Building the web-siteFirst of all I moved the resulting data to the SQLite database by creating separate table for each programming language. Then I built the [web-site](http://sociting.biz) that allows to navigate through the data by switching between the languages and jump to the Github source code to check how the information from the questions/answers was applied in the specific scenarios. I also caught this opportunity to play with ASP.NET Core and implement the web-site on my Macbook Pro, without Windows being involved. The resulting application uses the cross platform ASP MVC Web API on the back end and react+redux on the front end. The source code is fully available in the Github repo.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!


Published by HackerNoon on 2017/01/30