July Update: 39 Days of Data Collection

Collecting Reblog Data Like It’s My Summer Job

Have you been wondering how I spent the month of July? In addition to sweating through the hottest Ontario summer I’ve ever experienced, I’ve been collecting tags and added commentary from 136,199 reblogs of Martin and Bosco’s post. I’m using the dataset I built from Tumblr’s /notes API endpoint, which you might vaguely recall from the data dashboard in a previous blog post. Since the /posts API only lets me fetch one reblog at a time, the data collection process is slow and repetitive. I have to keep reminding myself it’s a marathon, not a sprint.

So, Laura, how is the data collection going?

Progress on Collecting Tags and Commentary from the Reblogs

0 50 100

54% complete | 39 days of reblog collection

Figure 1: 54% complete after 39 days of collecting tags and commentary from the reblogs. I’ll be spending the month of August doing more of the same.

In case you’re not fluent in Tumblr yet:

Tumblr Terminology: What are reblogs and tags?

A reblog is when someone shares another user’s post to their own Tumblr blog. When a person reblogs a post, they can also add their own thoughts as commentary or tags.

Tags work differently on Tumblr than on other social media platforms; they’re not primarily used for searching. Instead, people tag a post to provide personal context, add witty insight, or share stories. It’s considered less intrusive than adding commentary directly to the original post. Tags are typically written as short phrases, such as #martin and bosco, #the boys are back!, or #fun we can afford in this economy.

Why Is Data Collection from the `/posts` API Taking So Long?

Here’s a summary of how the /posts API data collection is going:

Collection Start Date: June 23, 2025
Total Reblogs: 136,199
Current Progress: Retrieved metadata from 54% of reblogs (as of August 1, 2025)
R Script Run: 92 times (so far)
Estimated Completion Date: Early September? I sure hope so.

Tumblr’s API has strict limits on both how much data I can collect and how fast I can collect it. Every reblog I’m analyzing requires its own request to the /posts endpoint, and I need to make over 136,000 individual requests to complete the dataset.

Unfortunately, Tumblr imposes several overlapping limits on API use, including:

300 requests per minute (per IP address)
1,000 requests per hour (per account)
5,000 requests per day (per account)

These rate limits don’t play nicely together. Even if I stay under the per-minute cap, I might still hit the hourly or daily one. And if I hit the API too aggressively, there’s a real risk Tumblr will permanently block my access— without any way to appeal.

I put a considerable amount of thought into how my R script behaves. Each data collection session is capped at 825 reblogs, which takes about 54 minutes to complete. After each API request (which retrieves data for one reblog), the script pauses for 2 seconds. Once it has collected 50 reblogs, it takes a longer 90-second break to cool down, then automatically resumes the loop. When the session ends, I stop for at least four hours before starting the loop again. I typically run two sessions per day, occasionally three, which means I stay well below the daily limit of 5,000 requests. It’s a tedious and repetitive process, but it keeps me safely within Tumblr’s rate limits. I’m gradually grinding away at the full dataset while embodying the old joke: How do you eat an elephant? One bite at a time.

Stay Tuned

Now that half of the data from the /posts API endpoint has been collected, we can obtain a glimpse of how Martin and Bosco traveled across the Tumblr community through reblog networks. By looking at the tags people added to their reblogs, we can get a sense of how they felt about the post appearing on their dashboards.

We’ll also gain insight into what people thought about the blaze feature, Tumblr’s version of a paid or sponsored post. Martin and Bosco’s post has been blazed by users sixteen times, likely more than any other post on the platform. Since those blaze campaigns were all user-sponsored, the tags offer a rare window into how the Tumblr community responded to seeing the same post repeatedly promoted.

This is where the project starts getting exciting!

In the next blog post, I’ll introduce a five-part series on the data science techniques behind building a human-involved text classifier. The goal of this work is to reduce the time-consuming labour of manually sorting hundreds of blaze-related tags into subthemes. Each post in the series will feature a different step of the process, from creating a labeled dataset to testing whether the classifier actually works. There will also be fun visualizations like storyboards and an interactive dendrogram!

A five-part series on semi-supervised machine learning techniques? Did anyone actually ask for this? No one? But I promised this project would give you opportunities to learn a bit of data science. XD

Thumbnail Image Credit

The thumbnail image used in the preview for this blog post, a weathered roadworks sign stuck in a pile of dirt, was taken by Sergei Starostin and is available on Pexels. The person on the sign is literally digging a hole in the dirt, which is metaphorically perfect for my project:

Slow, manual effort. No end in sight, but still making progress.

Collecting Reblog Data Like It’s My Summer Job

Why Is Data Collection from the /posts API Taking So Long?

Stay Tuned

Thumbnail Image Credit

Why Is Data Collection from the `/posts` API Taking So Long?