How I built a spreadsheet app with Python to make data science easier

Written by ricklamers | Published 2019/07/30
Tech Story Tags: python | data-science | machine-learning | projects | open-source | latest-tech-stories | programming | hackernoon-top-story

TLDR "Grid studio" is a web-based spreadsheet application with full integration of the Python programming language. Its killer feature is the easy-to-use read and write interface to your spreadsheet. The application runs in a Docker container which gives you easy access to a fully packed and isolated UNIX environment (even on Windows!) with everything ready to go: Python, scikit-learn, numpa, pandas, terminal, wget, zip, zip and much more. Grid studio will be available for free and open source through the GitHub repository.via the TL;DR App

Today I'm open sourcing "Grid studio", a web-based spreadsheet application with full integration of the Python programming language.
About a year ago I started tinkering with the idea of building the data science IDE that I had always wanted. Having worked extensively with Microsoft Excel, R (Studio) and Python, I envisioned how some integrated version of those would make my life easier.

Why?

The main issue that I set out to solve with Grid studio is the scattered workflow that I was experiencing while going back and forth between multiple tools such as R studio and Excel while working on data science projects.
While exporting that CSV file for the gazillionth-time, running into freezing up of application windows when my row count was too high or trying to do something straightforward such as reading in a JSON file I had had enough. The existing tools did not provide me with the environment and associated workflow that enabled me to be productive.
That's why I decided to build something that would integrate my workflow into a single, modern and easy to use application fit for my data science needs.

How does it work?

Grid studio is a web-based application that looks remarkably similar to an ordinary spreadsheet program such as Google Sheets or Microsoft Excel. However, its killer feature is the deep integration of the Python language.
Viewing data in a tabular structure and manipulating it directly feels naturally to almost everybody who has used a computer.
Combining this simple UI with the power of a full fledged programming language such as Python really makes it stand out.
Scripting with Python is as straightforward as possible: just write a few lines and execute it directly.

Core integration: Reading and Writing to sheets

At the core of the Python integration is the read and write interface to your spreadsheet. A high performance connection between your sheets' data and data in your Python process.
Simply write to the sheet like so:
sheet("A1:A3", [1, 2, 3])
And read from the sheet like this:
my_matrix = sheet("A1:A3")
With this simple yet powerful function you can read and write directly from and to sheets in order to automate data entry, extraction, visualization and more.

Writing custom spreadsheet functions

While reading and writing gives you a lot of flexibility through a simple interface, sometimes it makes a lot of sense to write custom functions that can be called directly in your spreadsheet.
Common spreadsheet functions such as AVERAGE, SUM, IF, etc. are already available by default. But what if you need more?
Simply write the function you need!
def UPPERCASE(a):
    return str(a).uppercase()
Now call this function in your spreadsheet, just like you would a regular function.

Leveraging Python ecosystem

By leveraging the power of the Python ecosystem you get instant access to state of the art data science tools:
This enables simple access to powerful models, such as a linear regression and SVMs for modelling your data.

Docker runtime

The application runs in a Docker container which gives you easy access to a fully packed and isolated UNIX environment (even on Windows!) with everything ready to go: Python, scikit-learn, numpa, pandas, terminal, wget, zip, and much more.
This makes installing Grid studio as simple as downloading the prebuilt docker image and running a single command.

Data visualization

A common task in data science is visualizing your data. Given its importance Grid studio has built in support for advanced plotting by integrating interactive plotting library Plotly.js and Python's standard Matplotlib. This provides you with advanced plotting capabilities in vector sharp format.
To give you some ideas about how to use Grid studio's features we show how they can combined with some concrete examples.

Example: Scraping the web

This example shows you the power of having Python at your fingertips. Something that typically requires some back- and forth between tools and files can now be integrated into a single script.
Above, you see how a short script easily loads news articles from Hacker News directly into the sheet.
Source: scrape.py

Example: Estimating a normal distribution

This example shows a somewhat silly use case of estimating the normal distribution with ever higher fidelity visualized with Plotly.js. Here you can see how interactive plotting can give you a sense of what is going on.

How can I use it?

Installing Grid studio locally is very simple:
(Make sure you have Docker installed)

1. Clone the repository with this command:

git clone https://github.com/ricklamers/gridstudio

2. Run the bash script (on Windows use e.g. Git Bash) with this command:

cd gridstudio && ./run.sh

3. Go to http://127.0.0.1:8080 in your browser
Git Bash for Windows - Docker install
Note: if you run into issues feel free to open a issue on GitHub, I'll try to assist/fix as quickly as possible.
Note: on Linux you might need to run (for step 2):
cd gridstudio && sudo ./run.sh
as Docker requires sudo access to run.

Release + Future development

As mentioned in the introduction, today Grid studio will be available for free and open source through the GitHub repository.
If you made it here you are very welcome to try it out yourself and submit any feedback and/or contributions to the project on GitHub.
I have some ideas about which functionality or features could be added in the future to improve Grid studio. However, since the project is now open source I think it would be wise to track these on GitHub and see which have the highest priority according to all involved.
- Expanding the number of 'native' functions available in the spreadsheet (like AVERAGE, SUM, IF, etc), maybe even reaching parity (and consistency) with some existing packages like Libre Office's Calc or Excel
- Syntax highlight/function tooltips when typing formulas in the spreadsheet
- Advanced sorting and filtering in the spreadsheet
- Extended controls for interactive plotting with Plotly.js
- Sharing of workspaces/code more easily (i.e. export workspace)
- Forms of real-time collaboration (this might be too difficult)
- Some sort of API/interface for add-ons/extensions
- Upgrade formula parser to a real grammar based parser
- Python autocompletion
- Performance optimization
- Core Python/sheet integration robustness (no character/sequence breakage)

Open source motivation

While this project was originally intented for commercial release, I've decided it might be better off as an open source project for everyone to experiment with and potentially be developed by a small community of interested data enthousiasts.
The reason for this is that during the initial development of the project I've discovered a number of projects that offer similar functionality to Grid studio.
First, there is an open source plug-in that integrates Python directly into Microsoft Excel called xlwings. Although it does not really integrate spreadsheets and Python into a single coherent product, it does offer the advantage of giving users access to the 'real' fully loaded Excel environment they are already familiar with.
Second, Python has evolved from IPython to Jupyter Notebooks to JupyterLab. It enjoys a lot of popularity and rightly so, it offers a very nice work environment for data scientists with a strong emphasis on explainable code through long form Notebooks. Although, it lacks any kind of spreadsheet functionality that, in my opinion, is so appealing to novice data scientists because of its intuitive behaviour.
Overall, projects like these meant that commercializing Grid studio would mean competing with these product substitutes that are frankly available for the incredibly low price of free.
Regardless, I sincerely believe that Grid studio does have something unique to offer over existing alternatives and could be the tool of choice for quite a few use cases.
Grid studio is available for free and open source through theĀ GitHub repository.

Written by ricklamers | Co-Founder of Orchest. Improving data science tools one commit at a time.
Published by HackerNoon on 2019/07/30