Collecting Reblog Data Like It’s My Summer Job
Have you been wondering how I spent the month of July? In addition to sweating through the hottest Ontario summer I’ve ever experienced, I’ve been collecting tags and added commentary from 136,199 reblogs of Martin and Bosco’s post. I’m using the dataset I built from Tumblr’s /notes
API endpoint, which you might vaguely recall from the data dashboard in a previous blog post. Since the /posts
API only lets me fetch one reblog at a time, the data collection process is slow and repetitive. I have to keep reminding myself it’s a marathon, not a sprint.
So, Laura, how is the data collection going?
Figure 1: 54% complete after 39 days of collecting tags and commentary from the reblogs. I’ll be spending the month of August doing more of the same.
In case you’re not fluent in Tumblr yet:
A reblog is when someone shares another user’s post to their own Tumblr blog. When a person reblogs a post, they can also add their own thoughts as commentary or tags.
Tags work differently on Tumblr than on other social media platforms; they’re not primarily used for searching. Instead, people tag a post to provide personal context, add witty insight, or share stories. It’s considered less intrusive than adding commentary directly to the original post. Tags are typically written as short phrases, such as #martin and bosco, #the boys are back!, or #fun we can afford in this economy.
Why Is Data Collection from the /posts
API Taking So Long?
Here’s a summary of how the /posts
API data collection is going:
- Collection Start Date: June 23, 2025
- Total Reblogs: 136,199
- Current Progress: Retrieved metadata from 54% of reblogs (as of August 1, 2025)
- R Script Run: 92 times (so far)
- Estimated Completion Date: Early September? I sure hope so.
Tumblr’s API has strict limits on both how much data I can collect and how fast I can collect it. Every reblog I’m analyzing requires its own request to the /posts
endpoint, and I need to make over 136,000 individual requests to complete the dataset.
Unfortunately, Tumblr imposes several overlapping limits on API use, including:
- 300 requests per minute (per IP address)
- 1,000 requests per hour (per account)
- 5,000 requests per day (per account)
These rate limits don’t play nicely together. Even if I stay under the per-minute cap, I might still hit the hourly or daily one. And if I hit the API too aggressively, there’s a real risk Tumblr will permanently block my access— without any way to appeal.
I put a considerable amount of thought into how my R script behaves. Each data collection session is capped at 825 reblogs, which takes about 54 minutes to complete. After each API request (which retrieves data for one reblog), the script pauses for 2 seconds. Once it has collected 50 reblogs, it takes a longer 90-second break to cool down, then automatically resumes the loop. When the session ends, I stop for at least four hours before starting the loop again. I typically run two sessions per day, occasionally three, which means I stay well below the daily limit of 5,000 requests. It’s a tedious and repetitive process, but it keeps me safely within Tumblr’s rate limits. I’m gradually grinding away at the full dataset while embodying the old joke: How do you eat an elephant? One bite at a time.
Stay Tuned
Now that half of the data from the /posts
API endpoint has been collected, we can obtain a glimpse of how Martin and Bosco traveled across the Tumblr community through reblog networks. The tags will tell us how people feel about Martin and Bosco’s post appearing on their dashboard. We’ll also gain insight into what people think about the blaze feature, Tumblr’s version of a paid or sponsored post. Martin and Bosco’s post has been blazed by users 16 times, likely more than any other post on the platform. Since those blaze campaigns were all user-sponsored, the tags offer a rare window into how the Tumblr community responds to seeing the same post repeatedly promoted.
This is where the project starts getting exciting!
In the next blog post, I’ll share an interactive dendrogram that explores initial themes that emerged from how people used the word “blaze” in their reblog tags. If you’re not sure what a dendrogram is, then don’t worry! I promised in my welcome post that you would learn a little data science along the way.
Thumbnail Image Credit
The thumbnail image used in the preview for this blog post, a weathered roadworks sign stuck in a pile of dirt, was taken by Sergei Starostin and is available on Pexels. The person on the sign is literally digging a hole in the dirt, which is metaphorically perfect for my project:
Slow, manual effort. No end in sight, but still making progress.