New digital methods can be used to analyse linguistic terms and better understand Reddit communities

By Tim Squirrell

Reddit is now the fourth most visited website in the US. Yet, surprisingly, given its position as an extremely large community, it has been the subject of relatively little research. Tim Squirrell has developed methods of studying the genealogy, spread, and use of particular words on Reddit, as demonstrated by this case study of The_Donald, the largest pro-Trump community on the web. The methods outlined here could be used to explore and understand any community or linguistic term on Reddit over time, with only a basic knowledge of structured query language required.

Reddit recently announced that it was now the fourth most visited site in the US, behind only Facebook, Google, and YouTube. Despite this, there has been relatively little research conducted into the communities that dwell there. As part of the Alt-Right Open Intelligence Initiative at the Digital Methods Initiative, I have developed methods for studying Reddit.

Through the lens of a case study of The_Donald, the largest pro-Trump community on the web, I’ll walk you through methods of studying the genealogy, spread, and use of particular words on Reddit, and show how these can be put to use in the social sciences. I’ll be illustrating how what we now think of as one entity, the Alt-Right, is actually composed of disparate groups who do not necessarily share one coherent ideology.

The majority of methods here draw on the Reddit Comments dataset on Google’s Big Query platform. They utilise numerous scripts, to be published on the DMI website in due course.

Who is the Alt-Right?

The_Donald has, over the past year and a half, become the single biggest meeting place for individuals who identify as part of the “Alt-Right”, a political movement that came to prominence in the wake of the 2016 US Presidential Election. Following up on anecdotal evidence that suggested the movement incorporates individuals from the “manosphere”, anti-progressives from the “GamerGate” movement, 4chan trolls, far-right conservatives, racists, and conspiracy theorists, we wanted to understand whether there was indeed such a thing as a cohesive Alt-Right.

To do this, we selected words that were emblematic of different aspects of the identity of The_Donald, based on their appearance in a list of most-used words that weren’t seen elsewhere on Reddit. The dynamic word cloud below gives a taste of the most frequent words in The_Donald over time, reflecting both the core facets of the group (words like “cuck”, “kek”, “trump” and “MAGA”) and issues they were concerned with at different points.

From here, I used a word association script to produce lists of words that were most likely to co-occur with a list of eight words: “cuck”, “kek”, “pepe”, “sjw”, “maga”, “4chan”, “pede” (for “centipede”, what they call themselves), and “based” (a term used to describe people they like). I used stripped-back network-modelling software to show the associations between words.

Figure 1: Word association network from The_Donald, March 2017 – produced using Halfviz (enlarged, dynamic version available here).

We noted that a core identity cluster appeared to form over time, with the words “cuck”, “maga” and “pede” appearing and then coalescing into a coherent vernacular. Around the peripheries of the community were the terms “sjw” (denoting the anti-progressive element), “globalist” (used primarily by conspiracy theorists and far-right posters), “pepe”, and “kek” (coming from “shitposters” – trolls and meme enthusiasts).

Our hypothesis, then, was that there is no coherent Alt-Right; rather, there are disparate groups who come together due to their convergent interests in the election and policies of Donald Trump.

Dissecting communities with Subreddit Algebra

To test this hypothesis, I used a tool developed by Trevor Martin which had previously been used to associate The_Donald with hate groups. The Subreddit Algebra tool uses latent semantic analysis to find subreddits with the largest overlap between commenters, and then normalises them for size, before allowing the user to “subtract” or “add” one subreddit to another with a simple interface.

With this tool, I found that subtracting the most obviously “mainstream” politics from The_Donald resulted in primarily gamers and “shitposters”. Likewise subtracting “4chan” resulted in primarily mainstream and far-right political subreddits, as does subtracting “CoonTown”, a now-banned explicitly racist community.

Figure 2: Subreddit Algebra interface showing The_Donald minus commenters from /r/Conservative.

This appeared to corroborate the “no homogeneous Alt-Right” hypothesis.

Tracking words over time: “Social Justice Warriors” and “cuck”

The term “SJW” is hugely popular as a derisory name for progressives in various circles. “Cuck” refers to a “cuckold”, historically used to refer to a man whose wife has been cheating on him and who may now be raising another man’s children. The former term has achieved some mainstream popularity. The latter seemed to be limited in its non-ironic use to The_Donald (its origins are the subject of a blog post). A final way for us to corroborate our hypothesis was to track these terms over time and see where on Reddit they were used.

To do this, I used a simple script that returns a list of subreddits in which a word occurred most frequently, iterating this by month from 2012 onwards for “SJW”) and August 2014 onwards for “cuck”. I put the results into Rawgraph, producing normalised and non-normalised time series graphs for the use of each term.

Figure 3: Frequency of “SJW” by subreddit, January 2013 to July 2017, non-normalised (enlarged version available here).


Figure 4: Frequency of “cuck” by subreddit, August 2014 to December 2015, non-normalised.


Figure 5: Frequency of “cuck” by subreddit, January 2016 to May 2017.

What these graphs appear to show is that the term “SJW” emerged in the TumblrInAction subreddit (in red) and spread from there through other anti-progressive and gaming subreddits, before being taken up by The_Donald (in yellow) when it emerged as a primary actor in the Alt-Right in February 2016. However, it is not a popular term in mainstream conservative communities like /r/Conservative.

Similarly, “cuck” began its life in the “manosphere” of TheRedPill, a misogynist subreddit, as “cuckold”. It appears to have spread from there to 4chan, and from there to KotakuInAction, a gamer-oriented misogynist hate movement. It is not popular as a term outside of The_Donald and a few related subreddits, except where it is used ironically to mock those who use it seriously.


Our initial hypothesis, that The_Donald was not a monolithic entity but instead reflected disparate groups with converging interests, appears to be the best explanation for the observed phenomena.

Perhaps more importantly for readers of this blog, the methods outlined here could be used to explore and understand any community or linguistic term on Reddit over time, with only a basic knowledge of SQL (structured query language) required.

Given the lack of mainstream research on Reddit, and its position as an extremely large community, additional research in this area would be welcome. It needn’t necessarily even involve exploring hate groups.

Featured image credit: Reddit Logo by Mechatronics Guy. This work is licensed under a CC BY 2.0 license.

Tim Squirrell is an ESRC-funded PhD researcher in the Department of Science, Technology and Innovation Studies at the University of Edinburgh. His thesis tackles the construction and negotiation of authority and expertise in online spaces, looking at fitness and nutrition communities on Reddit. You can follow him on Twitter @timsquirrell.

This post was originally published on The London School of Economics and Political Science’s Media Policy Blog. Republished here under a Creative Commons License.

Leave a Reply