Reddit Data Collector
Large-Scale Reddit Data Collection and Processing for Research Analytics
PROBLEM
STATEMENT
The client needed data from Reddit for a research-based project. They wanted to collect a huge dataset without being blocked. The dataset was required to be parsed, cleaned, and stored in GCP Big Query. The optimization of the code was a critical factor for them as they needed millions of entries.
SOLUTION
To get a complete picture of all the posts made in a sub-reddit from the beginning of its history, I implemented a multi-pronged approach. I used the Push Shift Dumps for fetching the IDs of all the posts that have ever been made in a subreddit. This information was then fed into PRAW API (a Wrapper of the Reddit API). The PRAW API provided the latest data for those post IDs. For optimization, I used multi-threading where several accounts were working in parallel to retrieve the information.
De-duplication was ensured through multiple checks and a proxy service (Bright Data) was utilized to handle blocking issues. The gathered data was divided into the following tables:
- Subreddits (subreddit name, created date, description etc.)
- Posts (post id, moderator id, subreddit, comments with complete hierarchy, upvotes, timestamp, etc.)
- Moderators (subreddit name, moderators name, communities moderated, karma, etc.)
Input
List of Subreddits
Output
The data (posts, moderators, subreddit metadata) for all the subreddits was stored in BigQuery tables
Tools &
Technologies
Python (Scrapy)
Reddit API
PRAW API
BigQuery
Bright Data
GCP