10 Best Reddit Datasets for NLP and Other ML Projects

In this post, I wanted to share a Reddit dataset list that gained a lot of traction on social media when it was first posted.

Known as “the front page of the internet,” Reddit is part forum, part social media site, where users can post virtually anything and everything.

Unlike Facebook, Twitter, or Instagram, the majority of Reddit users remain anonymous. Reddit moderators strictly censor and curate the subforums, known as subreddits.

However, anonymity allows people to say what they want in whatever manner they wish. Therefore, Reddit comments and posts are perfect for testing and training numerous natural language processing (NLP) models.

Warning: Some of the datasets below were compiled specifically for the training of content moderation models. Therefore, the data may include explicit content.

Reddit Comments Datasets

1. Cryptocurrency Reddit Comments Dataset – This dataset contains comments from the subreddit r/cryptocurrency. The data consists of comments posted over five months from November 2017 to March 2018.

2. Donald Trump Comments on Reddit – A simple dataset containing thousands of comments crawled from Reddit that mention Donald Trump.

3. Reddit Comment Score Prediction – This dataset was built to help create a model that can predict whether or not a Reddit comment will receive upvotes or downvotes. The dataset includes 4 million Reddit comments: 2 million poor-performing (downvoted) and 2 million high-performing (upvoted).

Reddit News Datasets

4. Daily News for Stock Market Prediction – As the title suggests, this dataset was originally made to create models that could predict stock market fluctuations. The data consists of news crawled from r/worldnews from June 2008 to July 2016, as well as Dow Jones Industrial Average stock data.

5. World News on Reddit – Taken from the r/worldnews subreddit, this dataset contains info about all of the news posted on this subreddit dating back to 2008. The dataset includes the following info: date created, upvotes and downvotes, title, author, and whether or not the news contains mature content.

Other Data from Reddit

6. Reddit’s Top 1000 – This dataset contains the top 1,000 posts of all time from 18 subreddits, in terms of upvotes. For each post, the CSV files contain the title of the post and username of the poster. Additionally, the number of upvotes and downvotes, subreddit name, url, and other metadata has been included.

7. Reddit Usernames – A simple dataset containing a CSV file of 26 million usernames of Reddit users. Furthermore, the dataset includes the total number of comments each user has made.

8. SARC: Self-Annotated Reddit Corpus for Sarcasm – This dataset consists of over 1.3 million sarcastic comments and posts crawled from Reddit. The dataset creator has labeled the sarcasm in each statement. In addition, the username of the poster, topic, and context is also included with each statement.

9. Science and Tech Acronyms from Reddit – This dataset contains over 140,000 acronyms found on subreddits about science, biology, technology, and futurology. The data is in the form of a CSV file which includes the comment ID, time, username, subreddit name, and the acronym mentioned.

10. Things on Reddit (products) – This product dataset is a collection of the top 100 Amazon products from every subreddit that has ever posted an Amazon product from 2015 to 2017. Each CSV file in the dataset includes the name of the product, category, and URL to the product. Furthermore, the total mentions on Reddit and total subreddit mentions have been included in the data.

The datasets above could be used to help train sentiment analysis models, text classifiers, predictive models, and other NLP algorithms.

For more datasets, please view our related resources.

Also published on: https://lionbridge.ai/datasets/top-10-reddit-datasets-for-machine-learning/

Lead image via Erik Mclean on Unsplash