Part 1: Towards Automatic Classification of Blaze Tags

I need a vacation.

— The Terminator, Cyberdyne Systems Model 101 or T-800.

It’s been a hot girl summer of data collection, but I’ve finally gathered enough to start being nosy. We can begin exploring what people are saying in the tags of reblogs of Martin and Bosco’s post. I’m especially keen to see what people think about the blaze feature, Tumblr’s version of a sponsored post. Martin and Bosco’s post has been blazed 16 times, which is remarkable given how notoriously hostile Tumblr users are to being monetized. The tags offer a rare window into how the community reacts to seeing the same post repeatedly promoted.

Unfortunately, Tumblr tags are messy. There’s no standard spelling, no consistent phrasing, and I can’t borrow from an existing system that groups Tumblr tags into tidy categories. If I want to study how people react to blaze, I need to create a system that organizes the tags.

I could sort the tags by hand, but I don’t want to. Even in library school, I wasn’t thrilled about my required cataloguing courses. A few hundred tags would be manageable, but thousands would consume the rest of my summer. I would strongly prefer to spend that time sitting on a beautiful Lake Erie beach than sorting tags into subthemes. The obvious solution to my tag classification problem is to train a text classifier (ie. a robot) that can learn from examples and automatically sort new tags. The catch is that a classifier needs a labeled dataset to learn from, and no one has built a Tumblr blaze one.

This post is the first in a five-part series. Making sense of the blaze tags in Martin and Bosco’s post involves many complex moving parts, so I’ve split the story into separate blog posts. This way I can explain each step clearly without overwhelming the reader. I know five posts sounds like a deep dive into my bizarre niche interest (similar to my obsession with Russell Crowe playing an exorcist in B movies). However, blaze is just the example theme I’m using to explain a process I’ll be applying to other tag themes in the future. And I promised you would learn a bit of data science along the way.

Outline for the series of blog posts:

Post 1 (today): Why label blaze tags, and how was the labeled dataset created?
Post 2: What does each subtheme mean, and how do the subthemes connect? (Bonus: You’ll get to learn what a dendrogram is!)
Post 3: How can a text classifier learn from the dataset?
Post 4: Does the blaze tag classifier actually work? Will Laura get to go to the beach?
Post 5: What are people really saying about blaze? (Finally!)

The Storyboard

Before I start droning on about the details of classifying blaze tags, here’s a quick overview of the process.

ⓘ The yellow panel highlights the most recent finished step.

4,283 reblogs with tags

8,185 total tags

392 blaze-related tags

Create labeled dataset by manually sorting blaze tags.

Train a text classifier.

Robot classifies new blaze tags.
Laura at the beach.

Figure 1: The comic strip-style layout shows the process of creating a labeled dataset by manually sorting blaze-related tags from reblogs of Martin and Bosco’s post into subthemes. The labeled dataset will form the basis for training a text classifier.

Introducing the Labeled Dataset

A labeled dataset is just a dataset where each item in a column has been assigned to a category in a second column. In my case, that means taking blaze-related tags from reblogs of Martin and Bosco’s post and manually sorting them into subthemes based on meaning and intent. For example:

Tag: “the only good blazed post”
Subtheme: positive_feelings
Tag: “i love blaze as a function so much”
Subtheme: blaze_function
Tag: “by natural blaze aka reblogs”
Subtheme: organic_reblog

This hand-labeled dataset is what will later be used to teach the classifier (or robot) how to categorize new tags. Future blog posts will address questions of methodological rigor and transparency. In Post 2, I will explain how I checked consistency while labeling with repeat passes and rules. Post 3 will discuss how the text classifier will learn from the labeled dataset. A statistical method called stratified cross-validation will be used to assess whether the model is actually learning patterns it can apply to new blaze tags. Think of it like quizzing yourself with new flashcards to see if you really learned the material for the exam.

As a gentle reminder, what I am sharing in these blog posts is exploratory work. Poking at the dataset sparks questions and often reveals surprises long before the full data collection is finished. For me, the fun is in watching the story take shape as the data grows.

About the Data

Information about the reblogs of Martin and Bosco’s post was collected using an R script and Tumblr’s rickety /posts API endpoint. The snapshot below shows the slice of data I used for this exploratory analysis. Data collection is still relentlessly grinding away in the background; you can read more about the tedious process here. The reblogs in this snapshot span the entire life of the post (July 2022 to June 2025), since I did not sort the reblog information chronologically before calling the /posts API. This means the data slice is not limited to a specific time period.

Data Collection Period: June 23, 2025 – July 21, 2025
Total Valid Reblogs Collected: 42,300
Reblogs with Tags: 4,283 (10.1%)
Total Tags Across All Reblogs: 8,185
Tags Mentioning “Blaze”: 392 (4.79%)

It’s often estimated that about “80% of data analysis is spent on cleaning and preparing the data” (Donoho, 2017). Blaze tags were no exception, and they needed a few processing steps before labeling.

All tags were split, cleaned, and converted to lowercase.

Filtered tags for “blaze” variations.

Counted every blaze tag occurrence.

Figure 2: The comic strip-style layout shows the processing steps used to isolate blaze-related tags for labeling.

To isolate blaze-related tags, I filtered the /posts dataset to only include reblogs with tags. Multi-tag entries were split into individual phrases, cleaned (punctuation and extra spaces removed), and converted to lowercase for consistency. I then used a regex filter to only keep tags that contained the word blaze or variations like blazes, blazed, or blazing. The “Tags Mentioning Blaze” number (392) reflects total occurrences of blaze-related tags, not distinct phrases. If the same tag appeared 12 times, it was counted 12 times.

Wait. Did your eyes glaze over while skimming this section?
That’s totally fair!

Why include details about processing the data?

These technical details might seem pedantic, but they prompt the reader to consider four important questions:

What’s the size and scope of the dataset?
Are 392 labeled tags a large or small amount?
What actually counts as a “blaze” tag in this analysis?
Did Laura count every occurrence, or just unique phrases?

Without this context, it would be easy to misinterpret the dataset used in the analysis. Knowing the counts are total mentions (not unique tags) shows not just variety but intensity. While 392 labeled examples would be considered very small in machine learning terms, for Tumblr tags on a single post it’s a reasonable starting point for exploring patterns. Knowing this dataset is part of a much larger collection hints there’s more story to come. Together, these details help you understand the labeled dataset, which will serve as the foundation for all the work that follows.

Stay Tuned

I’ll be back.

— The Terminator, Cyberdyne Systems Model 101 or T-800.

In the second post of the series, you’ll learn about the blaze subthemes I identified in the dataset and the rules I used to group the tags. Each subtheme will be defined, and I’ll explain how they connect to each other. You’ll also have the opportunity to explore an interactive dendrogram! Who doesn’t want to learn about a hierarchical tree diagram? The fun never stops in this project.

Thumbnail Image Credit

The thumbnail image used in the preview for this blog post, one wooden peg standing apart from a clustered group on a navy blue background, was taken by Ann H. and is available on Pexels.

This tiny wooden peg drama says it all. I’m the lone peg off to the side, doing all the manual tagging by hand. The cozy cluster? That’s the future I’m hoping for. A well-trained text classifier labeling all the blaze tags while I enjoy the beautiful blue water of Lake Erie.

References

Cameron, J. (Director). (1991). Terminator 2: Judgment day [Film]. Carolco Pictures; Lightstorm Entertainment; Pacific Western Productions; Le Studio Canal+.

Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. https://doi.org/10.1080/10618600.2017.1384734