
PROBLEM
STATEMENT
Our client wanted to download all data (including post text and media) on-demand for any given sub-reddit for data analysis and machine learning. The app was required to be fast and reliable so that a large amount of data could be collected without any missing pieces.
SOLUTION
This app crawled all posts made under a specific subreddit in a given time period. The extracted data includes title, timestamp, post text, permalink, category, votes, media type (audio/video) and media of each post scraped. The textual data was saved in a mongo DB while the media files were saved on client’s server.
Input
No of inputs:
- Name of subreddit to be extracted
- Number of posts to be extracted
- Time range of observation
Output
The extracted data was saved as follows:
- JSON files -> MongoDB
- Media -> Server