Reddit Data Collector

Large-Scale Reddit Data Collection and Processing for Research Analytics

 

PROBLEM

STATEMENT

The client needed data from Reddit for a research-based project. They wanted to collect a huge dataset without being blocked. The dataset was required to be parsed, cleaned, and stored in GCP Big Query. The optimization of the code was a critical factor for them as they needed millions of entries.

SOLUTION

To get a complete picture of all the posts made in a sub-reddit from the beginning of its history, I implemented a multi-pronged approach. I used the Push Shift Dumps for fetching the IDs of all the posts that have ever been made in a subreddit. This information was then fed into PRAW API (a Wrapper of the Reddit API). The PRAW API provided the latest data for those post IDs. For optimization, I used multi-threading where several accounts were working in parallel to retrieve the information. 

De-duplication was ensured through multiple checks and a proxy service (Bright Data) was utilized to handle blocking issues. The gathered data was divided into the following tables:

  • Subreddits (subreddit name, created date, description etc.)
  • Posts (post id, moderator id, subreddit, comments with complete hierarchy, upvotes, timestamp, etc.)
  • Moderators (subreddit name, moderators name, communities moderated, karma, etc.)

Input

List of Subreddits

Output

The data (posts, moderators, subreddit metadata) for all the subreddits was stored in BigQuery tables

Tools &
Technologies

Reddit Data Collector

Python (Scrapy)

Reddit Data Collector

Reddit API

Reddit Data Collector

PRAW API

Reddit Data Collector

BigQuery

Reddit Data Collector

Bright Data

Reddit Data Collector

GCP

cronjob.png

Cronjob

Wikipedia Scraping

Scroll to Top

01. Home

02. Portfolio

03. Services

04. About

05. Blog

Office

Contact

Follow us