Differences in topic usage between AO3 D:BH fics of different ratings

This is a follow-up of sorts to my first investigation (using unigrams).
The TL;DR of that post:
1) Mature and teen fics (to a lesser degree) typically use words that relate more to aggression (e.g. blood) than explicit fics.
2) Explicit fics strongly use smut-centric language. They’re followed by mature fics, then teen and gen fics (with little difference between teen and gen fics; thanks to writers who tag appropriately!)
3) It was really hard to understand what characterised teen and especially gen fics just looking at unigrams.

So - how else to understand what characterises/distinguishes the content of gen/teen/mature/explicit D:BH fics?

Quite recently I was introduced to the structural topic modeling algorithm (STM). I’ve talked about applying vanilla latent dirichlet allocation (LDA) before. Vanilla LDA really just retrieves the topics in the corpus and the topic distribution of each document.

STM goes a step further - it can incorporate metadata about each document in the modeling (e.g. what rating was the story assigned?)*. This metadata may affect topic prevalence (e.g. how often does a topic appear in gen fics vs mature fics?) and/or topic content (e.g. how does the vocabulary of a topic about school differ between gen fics vs teen fics?). With this knowledge of the metadata, you can do pretty cool stuff like compare how topics are used differently by different authors, for example, or track topic prevalence over time.
*note that STM is not just the vanilla LDA algorithm + metadata. STM falls under the family of topic models but is a separate algorithm. LDA can be used to initialise an STM model but the default is to use non-negative matrix factorisation.

In this analysis, I focus on how the rating of a fic may affect the prevalence of topics that appear in it. In other words, how are different topics used to different extents between the four ratings?

Note that there are already some sanity checks in place for this analysis:
1) Explicit fics should have relatively higher prevalence of a smut topic(s) versus all other ratings.
2) Based on my first analysis, mature fics should hopefully be characterised by more aggressive/violence-related topics.

Content

1) Topics retrieved by STM
2) Comparison of topic prevalence between different ratings
3) Comparison of vocabulary of seemingly similar topics

Full blog post on preprocessing and training here. STM done and images generated using the R package. No fancy interactives because I’m dog vomit bad at R.

Topics retrieved by STM

Vocabulary of topics

Topic labels not prescriptive and manually decided on by me
Prob = words with highest probability for the topic
FREX = words both frequently appearing and relatively exclusive to the topic
Words stemmed by the Porter stemmer (truncate word to (pseudo) root form; e.g. cities/city to citi)

Topic Top 5 Prob Words Top 5 FREX Words
1 Art paint, play, book, man, music paint, danc, song, music, art
2 Android biocomponents thirium, memori, bodi, damag, pump wire, repair, regul, thirium, biocompon
3 Crime cases case, lieuten, scene, offic, crime crime, victim, evid, scene, crime_scene
4 Android stress/error level, stress, system, error, stress_level stress_level, error, stress, level, instabl
5 DPD detect, desk, coffe, offic, partner desk, break_room, coffe, detect, captain
6 Medical detect, chest, keep, shoulder, son doctor, technician, hospit, slave, put_hand
7 Building interiors door, room, floor, wall, step stair, hallway, hall, door, door_open
8 Family/other AU brother, boy, mother, child, man alpha, brother, twin, demon, mother
9 Swearing fuck, shit, hell, ass, fuckin shit, fuck, fuckin, fucker, get_fuck
10 Negative emotion tear, hurt, cri, breath, happen tear, cri, sob, trembl, hurt
11 Expression word, express, question, mind, tone tone, express, convers, intent, goal
12 Driving car, seat, drive, park, phone car, passeng, driver, park_lot, passeng_seat
13 Resting bed, sleep, couch, night, room couch, sleep, bed, pillow, blanket
14 Deviancy devianc, mission, emot, machin, deviat program, machin, human, emot, mission
15 Aggression hiss, fist, yank, snarl, smoke man, stare, feet, back, air
16 Injury blood, pain, wound, knife, leg knife, wound, bleed, blood, pain
17 Touch finger, lip, skin, chest, breath fingertip, palm, thumb, brow, flutter
18 Smut finger, cock, mouth, hip, kiss moan, cock, hip, thrust, thigh
19 Feelings felt, took, knew, came, found felt, felt_hand, made_feel, felt_someth, took_moment
20 Setting dog, water, snow, tree, sky tree, sky, fish, sun, grass
21 Dialogue 1 talk, nod, tell, took, went said_look, thank_said, know_said, said_smile, look_said
22 Dialogue 2 give, tell, nod, sit, shake goe, say_look, shake_head, say_voic, say_know
23 Revolution peopl, human, other, leader, everyon leader, group, freedom, ship, revolut
24 Affection love, kiss, lip, cheek, press love, love_love, kiss, kiss_cheek, boyfriend
25 Food/eating kitchen, food, eat, tabl, cloth food, cook, bowl, dish, eat
26 Movement walk, nod, man, led, room bus, girl, nod_head, walk, place_hand
27 Guns gun, shot, kill, bullet, shoot gun, bullet, shoot, aim, trigger
28 School friend, talk, kid, year, day school, dad, friend, date, kid
29 Technology model, inform, use, design, screen tablet, data, project, test, product
30 Time day, noth, everyth, left, done spent, day, gotten, week, gone

Prevalence of topics in corpus of fics

Topic label followed by top 5 FREX words
image

Comparison of topic prevalence between different ratings

Focusing on Gen

Some results unlike the previous unigram comparison! We see that versus the other ratings, D:BH gen is characterised by relatively slightly greater use of the art, resting, feelings, setting, affection, food/eating, movement, and school topics. These are rather reminiscent of fluff fics (to me, you may disagree). Check out the plots below:

Gen vs Teen
image
Gen vs Mature
image
Gen vs Explicit
image

Focusing on Teen

D:BH Teen and D:BH Mature don’t seem to diverge greatly beyond Mature incorporating more usage of touch and smut topics (physicality). In contrast, D:BH Teen has a greater usage of topics like deviancy, injury, guns, crime (the building interiors topic is probably related to crime scene setups), android stress, the revolution versus D:BH Explicit. This closely matches the results from the unigram comparison (sanity check ticked). Note too how Gen and Teen are similar in their (lesser) usage of the smut topic versus Explicit (another sanity check ticked). Check out the plots below:

Teen vs Mature
image
Teen vs Explicit
image

Focusing on Mature

Mature vs Explicit
Like D:BH Teen, D:BH Mature has greater usage of topics like crime cases, android stress, negative emotion, injury, guns, revolution versus D:BH explicit. D:BH Explicit is largely distinguished by its Mature counterpart by the smut topic (and to a much lesser degree, the touch and affection topics). These closely follow the unigram comparison (sanity check ticked). Check out the plot for the topic usage comparison below:
image
Note how strongly the smut topic is characteristic of Explicit fics (moreso than violence/crime-related topics for Mature fics) - this also matches the H-statistic of the unigrams. If you return to the charts from that post (unigrams characteristic of mature fics, unigrams characteristic of explicit fics), you’ll see that the H-statistic (and thus the corresponding effect size) of smut words for Explicit fics is way larger than the violence-related words for Mature fics.

Comparison of vocabulary of seemingly similar topics

This is tangential to characterising the different ratings. But when I was labeling the topics, there were a couple of topics that seemed ambiguously similar to me. STM offers a feature to compare the vocabulary of two topics - I found this very helpful.

Note: Words further along the left/right axis are more closely related to the topic labelled on the left/right respectively. Words in the middle (around the dotted line) are central to both topics. Word size is relative to its frequency within the two topics (bigger = more frequent).

Aggression vs Injury

image

Injury vs Guns

image

Touch vs Smut

image

Touch vs Affection

image

Ending notes

1) Though I accounted for story publishing date in the modeling, I did not show the results here. Previously I’ve already tried a dynamic topic model on this corpus and I didn’t see a lot of changes in topics over time. I think the Detroit fandom is still too new, but it would be exciting to apply this to Harry Potter’s.

2) I’m concerned by how small the confidence intervals are in the topic prevalence comparison charts; I’m not really sure if this is expected behaviour (KIV-ing this).

3) Overall I’m still quite thrilled by what it’s achieved in this very specific corpus of fiction. The STM would be very cool to apply on more social science-driven corpora (as suggested by its original authors). However, I’m not entirely clear about covariate selection for modeling and assessing the ‘quality’ of the covariates used/model fit. I hope to find more readings on this.