How to Create an Authentic Data Science Project for your Portfolio

Written by feliche93 | Published 2020/10/03
Tech Story Tags: data-science | analytics | portfolio | guide | tutorial | web-scraping | artificial-intelligence | careers | web-monetization

TLDR Felix is an entrepreneur currently building a legal tech start-up for document automation. He wants to show you how he developed interesting data science project ideas and implement them step by step. Follow me along on how he explored Germany’s largest travel forum Vielfliegertref. He used scrapy as framework for the following reasons: There was no Javascript enabled elements that were hiding data. The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.via the TL;DR App

Follow me along on how I explored Germany’s largest travel forum Vielfliegertref. As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the Titanic dataset on Kaggle.
In this blogpost, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR.

Step 1: Choose your passion topic that is relevant

As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:
  • Solving my own problem or burning question
  • Connected to some recent event to be relevant or especially interesting
  • Has not been solved or covered before
As these ideas are still quite abstract, let me give you a rundown how my three projects fulfilled the requirements:
As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.

Step 2: Start Scraping together your own dataset

Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:
For Vielfliegertreff, I used scrapy as framework for the following reasons:
There was no Javascript enabled elements that were hiding data. The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.There were quite a lot of posts so crawling the entire forum will definitely take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.
To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages / minute:
To be nice and not to get blocked, it is important that you scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned on to log each url request to make sure I do not put more load on the server if I stop and restart the scraping process.
Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is and the structure of html/xpath.
For my project I ended up with 1.47 gb of data which was close to 1 million posts in the forum.

Step 3: Cleaning your dataset

With your own scraped messy dataset the most challenging part of the portfolio project comes, where data scientists spend on average 60% of their time:
Unlike clean Kaggle datasets, your own dataset allows you to build skills in data cleaning and show a future employer that you are ready to deal with real life messy datasets. Additionally, you can explore and take advantage of the python ecosystem by leveraging libraries that solve some common data cleaning tasks that others solved before.
For my dataset from Vielfliegertreff, there were a couple of common tasks like turning the dates into a pandas timestamps, converting numbers from strings into actual numeric data types and cleaning a very messy html post text to something readable and usable for NLP tasks. While some tasks are a bit more complicated, I would like to share my top 3 favourite libraries that solved some of my common data cleaning problems:
  • dateparserEasily parse localized dates in almost any string formats commonly found on web pages.
  • clean-text: Preprocess your scraped data with clean-text to create a normalized text representation. This one is also amazing to remove personally identifiable information, such as emails or phone numbers etc.
  • fuzzywuzzy: Fuzzy string matching like a boss.

Step 4: Data Exploration and Analysis

When completing the Data Science Nanodegree on Udacity, I came across the Cross-Industry Standard Process for Data Mining (CRISP-DM), which I thought was quite an interesting framework to structure your work in a systematic way.
With our current flow, we implicitly followed the CRISP-DM for our project:
Expressing business understanding by coming up with the following questions in step 1:
  • How is COVID-19 impacting online frequent flyer forums like Vielfliegertreff?
  • What are some of the best posts in the forums?
  • Who are the experts that I should follow as a new joiner?
  • What are some of the worst or best things people say about airlines or airports?
And with the scraped data we are now able to translate our initial business questions from above into specific data explanatory questions:
  • How many posts are posted on a monthly basis?
  • Did the posts decrease in the beginning of 2020 after COVID-19?
  • Is there also some sort of indication that less people joined the platform not being able to travel?
  • What are the top 10 number of posts by the number of likes?
  • Who is posting the most and also receives on average the most likes for the post? These are the users I should follow regularly to see the best content.
  • Could a sentiment analysis on every post in combination with named entity recognition to identify cities/airports/airlines lead to interesting positive or negative comments?
For the Vielfliegertreff project one can definitely say that there has been a trend of declining posts over the years. With COVID-19 we can clearly see a rapid decrease in posts from January 2020 onwards when Europe was shutting down and closing borders which also heavily affected traveling:
Also user sign ups have gone down over the years and the forum seems to see less and less of its rapid growth since start in January 2009:
Last but not least, I wanted to check what the most liked post was about. Unfortunately, it is in Germany, but it was indeed a very interesting post, where a German guy was allowed to spend some time on a US aircraft carrier and experienced a catapult take off in a C2 airplane. The post has some very nice pictures and interesting details. Feel free to check it out here if you can understand some German:

Step 5: Share your work via a Blogpost or Web App

Once you are done with those steps you can go one step further and create a model that classifies or predicts certain data points. For this project I did not attempt further to use machine learning in a specific way, although I had some interesting ideas about classifying sentiment of posts in connection with certain airlines.
In another project however, I modeled a price prediction algorithm that allows a user to get a price estimate for any type of tractor. The model was then deployed with the awesome streamlit framework, which can be found here (be patient with loading, it might load a bit slower).
Another way to share your work is like me through blog posts on Medium, HackernoonKDNuggets or other popular websites. When writing blog posts, about portfolio projects or other topics, such as awesome interactive AI applications. I always try to make them as fun, visual and interactive as possible. Here are some of my top tips:
  1. Include nice pictures for easy understanding and to break up some of the long textInclude interactive elements, like tweets or videos that let the user interact
  2. Change boring tables or charts for interactive ones through tools and frameworks like airtable or plotly

Conclusion & TLDR

Come up with a blog post idea that answers a burning question you had or solves your own problem.
Ideally the timing of the topic is relevant and has not been analysed by anyone else before.
Based on your experience, website structure and complexity, choose a framework that matches the scraping job best.
During data cleaning leverage existing libraries to solve painful data cleaning tasks like parsing timestamps or cleaning text.
Finally, choose how you can best share your work. Both an interactive deployed model/dashboard or a well written medium blog post can differentiate you from other applicants on the journey to become a data scientist.
As always feel free to share with me some great data science resources or some of your best portfolio projects!

Written by feliche93 | Felix is an entrepreneur currently building a legal tech start-up for document automation.
Published by HackerNoon on 2020/10/03