Cakeday Crawler Information


Go back to the results

About the idea

When I discovered the subreddit /r/cakeday I began to wonder the frequency of cakedays around the year. So I decided to scratch that itch, along with finding a way to nicely visualize it.

About the programming

The description below states information that was relevant to the version of PRAW I was using when I made this project in 2015. Since then, it's undergone some major updating; you can now access the created timestamp of each user's accounts from the API - i.e. urllib2 isn't needed anymore.

Within Python, I used a mix of PRAW (Python-Reddit API Wrapper) and urllib2. Of course, PRAW was used to access Reddit and gather usernames. However I needed urllib2 for the reason Tom Chapin (the creator of redditcakeday.com) explained on his site: each Reddit user has an "about.json" file which contains a variable of the timestamp of when the user created their account.

Since PRAW does not provide the ability to access the information from the file, I needed urllib2 to gather that data.

With the data in hand, I used cal-heatmap, a javascript module used to display calendar heatmap data (which is what sites like GitHub uses to display user's contributions).

One of the features that I really liked about cal-heatmap was the ability to highlight individual days. However the only way to do that is via javascript code, so I made the filter bar as a bridge between user and code, allowing you to select certain dates.

About the data gathering

Since the source code will do most of the technical explanation, I just wanted to explain a few of the 'how' and 'why' quesions people may ask.

I restricted the users I allowed to one that were at least a week old. Since I was running the crawler non-stop for a few days, I didn't want any bias to new users.

For gathering users, the process was run on a fairly simple loop: I used PRAW's "get_random_subreddit" function to first target a community. I then crawled posts randomally until I got a maximum of 1000 users. Most subreddits tend to have way more than 1000 users, however I wanted to diversify the users I pulled as much as possible.

Between PRAW's limitation on requests within a certain time, the slow process of finding the user's cakeday from each user's json file, and the check to make sure the user wasn't already recorded, it was pulling in about 1 or 2 people per second. My goal was to get 1 million people, but that would have taken about 2.5 weeks of constant running.

I decided to not include the year as a seperating factor due to the fact that I was more focused with the time of year that people joined, rather than a timeline of Reddit itself.

I also wanted to include the leap day (February 29th) in the displayed calendar, so I set the calendar to display 2016's year. Due to the rarity of the day, I had expected to see a large amount of people signing up that day, however there was unfortunately only a minimal amount of people.