If you don’t know yet, my partner and I created a site called beautydupes.xyz where users can search for beauty dupes. We monitor Reddit closely to collect supporting data for the dupes we recommend on the site. And for those of you who also monitor Reddit posts, either to help brands with social listening or for other data collection purposes, I thought it would be helpful to share the script we use with you.
import requests
import pandas as pd
subreddit = ['30PlusSkinCare','Skincare_Addiction','AsianBeauty','MakeupAddiction']
limit = 400
timeframe = 'week' #hour, day, week, month, year, all
listing = 'best' # controversial, best, hot, new, random, rising,top
- Import packages.
2. Create a list of subreddit you want to monitor
3. Set a limit to the number of posts you want to collect from each subreddit
4. Select the timeframe of the data pull, choose last hour/day/week/month/year or all time.
5. Select the type of post in descending order. Definitions of types:
- Top is the raw score, upvotes minus downvotes.
- Best is more about the proportion of upvotes to downvotes.
- Controversial gives high scores to posts that have a lot of both upvotes and downvotes — a lot of people like it but a lot of people also dislike it.
- Hot are posts that have gotten a lot of votes recently, either up or down, so it shows relatively new posts that are getting attention.
- Rising is what is getting a lot of activity (comments/upvotes) right now.
- New sorts post by the time of submission with the newest at the top of the page.
Use the settings above, we define a get_reddit() function.
def get_reddit(subreddit,listing,limit,timeframe):
try:
base_url = f'https://www.reddit.com/r/{subreddit}/{listing}.json?limit={limit}&t={timeframe}'
request = requests.get(base_url, headers = {'User-agent': 'yourbot'})
except:
print('An Error Occured')
return request.json()
The result we get from the function is in a nested format. So we define another function to pull out the columns we need.
We create a DataFrame Showing Title, URL, Score, and Number of Comments.
def get_results(r):
myDict = {}
for post in r['data']['children']:
myDict[post['data']['title']] = {'url':post['data']['url'],'score':post['data']['score'],'comments':post['data']['num_comments']}
df = pd.DataFrame.from_dict(myDict, orient='index')
return df
Then we create a loop to go through each subreddit we want to pull data for.
df_f=pd.DataFrame()
for i in range(len(subreddit)):
result=get_reddit(subreddit[i],listing,limit,timeframe)
df = get_results(result)
df_f=df_f.append(df)
df_f=df_f.reset_index()
df_f
Viola! You get your table of all the posts, their URLs, the score, and the number of comments. You can keep the script running on a server and send yourself alerts of any new updates to keep track of top posts on any subreddit!