Differences in word use between AO3 D:BH fics of different ratings

Do mature-rated fics use different language from explicit-rated fics? What words are generally used more in teen-rated fics versus mature-rated fics?

We can already guess how explicit fics may differ from the rest (sMuT) so my interest in running this analysis was really to see:
1) What differentiates the general audience/teen/mature fics from each other, and
2) What differentiates mature/explicit fics (since mature-rated fics may also make references to sex/violence).

Note: I dropped Not Rated fics for this analysis, leaving 12647 fics for analysis. Full blog post on preprocessing and other details here.

To recap AO3’s rating system, taken from here:

Analysis done

I was admittedly not very sure how to proceed! Based on past readings, log-likelihood ratio tests and chi-square tests seem to be the tests of choice for comparing whether certain words occur more often in one corpus versus another (e.g. men vs women).

However, I recently came across this interesting paper which argues for the suitability of using other tests instead (i.e., t-tests, Mann-Whitney U-tests, bootstrap tests). The Mann-Whitney U-test is commonly thought of as the non-parametric equivalent of the t-test (very loosely put). Given that I have 4 groups (4 ratings) instead of 2, I went with what’s commonly called the non-parametric equivalent (again, very loosely put!) of the ANOVA - the Kruskal-Wallis test. I discuss that and the corrections for multiple comparisons in my blog post.

What I would like to clarify here is that the Kruskal-Wallis does not by default test for differences in group means or medians. (If you want to assume it’s testing for medians, you have to assume that the distribution across all your groups are identical). Basically, K-W is a rank-based test.

In this scenario, we’d rank all 12647 fics by their length-corrected (so we don’t have a bias towards longer fics) frequencies of perhaps the word “love”. So the fic that’s used the word “love” least is given the rank 1, the fic that’s used it the most 12647, and everyone else in between. Then we sum the ranks within each of the 4 groups and get the average rank from each group; this information is used in the test for statistical significance.

Results

I wasn’t sure how best to show the results, so I went with a dot graph displaying the mean ranks for each of the ratings.

Note:

1. Words generally used the most by mature fics

Note: I’m only showing words that were significantly different between mature/explicit fics. I wanted to know what differentiated these two rather similar ratings.
Click here for the interactive chart.

Key takeaways from chart

1) Mature fics generally have a focus on more aggressive words than explicit fics (e.g. blood, bitch, broken).
2) Even teen fics generally use words that suggest aggression/harm more than explicit fics (e.g. hurt, cases, stress, died, shoot). So while AO3’s explicit rating does encapsulate graphic violence, it does seem the D:BH explicit fics centre more around the smut aspect than gore/violence/etc).
3) While mature fics generally use female pronouns (she/her) and the collective ‘people’ the most, explicit fics use them the least. Not entirely surprising given all the M/M ships in D:BH, but it was nice that the test picked it up.

2. Words generally used the most by explicit fics

Note: Again, I’m only showing words that were significantly different between mature/explicit fics.
Click here for the interactive chart.

Key takeaways from chart

1) Really clean patterning of mean ranks! The self-rating system seems to work pretty well (??). Sexual terms are far and away the most used by explicit fics, suggesting a strong focus on smut.
2) Teen and gen fics are generally similar in their lack of usage of smut-suggestive terms like nipples, prostate, lust, etc.

3. Words generally used the most by teen fics

Note: This chart isn’t as big as the first two because the number of words were teen fics ranked the first in mean ranks just wasn’t as large.
Click here for the interactive chart.

Key takeaways from chart

1) Teen fics generally use certain function words the most (e.g. pronouns, common verbs like am/be).
I’m not really sure why this is the case, but it’s interesting to see that explicit fics have the lowest mean ranks for some of these terms. My uneducated guess is that these represent some different emphases in story construction - but these are just words out-of-context (i.e., don’t read too much into them)!

4. Words generally used the most by gen fics

Note: Again, this chart is tiny because there just weren’t a lot of words where gen fics ranked the top in mean ranks.
Click here for the interactive chart.

Key takeaways from chart

Not much really. From the overall analysis, it seems like gen fics are just characterised by an absence of stuff - which was quite disappointing for me. But understandable at the same time, as this analysis was rather crude. This is something I’d like to work more on, perhaps by incorporating some information about story structure or syntax.