program-800 · 4 years
(An attempt at) visualising AO3 D:BH fics based on verb usage
Tumblr media
WIP. x Interactive visualisation on my Github page (link in description/Tumblr heading).
I’ve been terribly busy (at work, but also mostly in my own head). This is something I’ve been working on for a couple weekends, but I can’t get cleaner results so this write-up’s going up first while I slowly figure out improvements.
The idea was simple: can I cluster fics based on the actions that occurred within them? Obviously there’s going to be a couple clusters for smut, but how about fics which focus around character introspection, fics which focus on fluff dates, etc?
That’s what I tried here - and as is clear from the gif (tsne visualisation), the clustering didn’t work out fantastically. Details of process under the cut. Dataset is 16,211 D:BH fics published on AO3 between May 2018 to June 2020.
1. Clean up the fics. I didn’t do any stopword removal here because of step 2. Just removal of funny symbols, etc.
2. Pull out and clean the verbs. I used Spacy’s part of speech tagger for this. I also lemmatised all verbs pulled (so you may see some odd-looking words if you do explore the visualisation). Of course, sometimes the tagger misidentifies words as verbs, so you may see what should be nouns, etc. I’ve tried to remove character names at least by relying on the long list of names I created from running topic modeling on this corpus some time back.
3. TF-IDF weight the words. Basically, count how many times a verb appears in a fic, and then multiply it by the inverse count of how many fics the verb appears in. Words which are more ‘important’ or ‘representative’ of the fic in question should be weighted higher.
4. Perform non-negative matrix factorisation for dimensionality reduction. At this point, I could have possibly gone straight to tsne for visualisation with the tf-idf weights, but the results were even worse (if I recall). So I performed NMF. Like other topic modeling methods like LDA, the number of dimensions/topics to go for is user-prescribed.
I ran NMF from 10 to 60. At each point, I calculated the mean cosine similarity between all topics (we’d want something that’s lower, topics that are too similar to each other aren’t great). I picked 40 topics in the end, since that’s where the mean cosine similarity seems to level off. This is definitely a subjective call, but since I’m just really doing NMF for faster tsne visualisation, I was ok just eyeballing the graph and rolling with this.
Tumblr media
(The top 15 keywords for each topic can also be viewed on my Github page.)
5. Run tsne. I used sklearn’s implementation. I also normalised my NMF weights before submitting it for the run. I’m terribly new to tsne, so I’m pretty sure the parameters I selected weren’t great.
I went with a perplexity of 350, PCA initialisation, a learning rate of 100, a max number of 30,000 iterations with stopping if there is no improvement after 500 iterations. tl;dr, the output of tsne is heavily dependent on the parameters you set (there’s a great Distill article out there on this), and with my lack of experience I may have bungled this.
6. Visualise tsne. I used plotly’s scatterplot for this. Each dot is one fic. If you hover over the dot, you can see the top-5 tf-idf weighted verbs for it (the most ‘important’ verbs to that fic, according to the tf-idf metric).
Overall, the visualisation is more or less just a hairball, but to my amusement/chagrin, there appears to be what is a little nest of smutty-seeming fics lurking at the bottom:
Tumblr media
Mycroft Holmes submission form that i will probably regret doing in the cold sober light of day
Name: Cat(herine)
Age: 19
Highest level of education: BA (currently doing first year)
Occupation: Student. Apprentice/novice historian
Height: 5’ 1"
Gender: Female
Noteworthy Skill sets: lots of acting experience (been doing it since I was 4) and in writing, large vocabulary, good memory. My judgement calls tend to also pay off, but maybe that’s just luck. 
Negative aspects: I’m a massive worrier. I’m quite lazy. Shy. Cautious. Cynical. I find it quite hard to stand up for myself, especially with family members.  I have to consciously make myself have any kind of social life. I also write very detailed answers to submission forms.
And now I’ve just overshared. G r e a t.
Languages spoken: Conversational French and Spanish, but very rusty. My German is even rustier. I have a incy wincy bit of Ancient Greek. I don’t speak Latin, but I like learning to read it.
Best academic subject: Classics or History
Favorite academic subject: H I S T O R Y 
Worst academic subject: Geography? Some parts were interesting but it just never struck a spark. Also maths when it involves shapes and graphs. Arithmetic and mental maths and probability, I’m good with but all the graph translations and areas and curves….nope.
Level of fitness: I am really quite puny. I have plenty of stamina and I can swim (my diving is embarrassing though) but I have no muscle strength or flexibility at all. My BMI is healthy, though.
Feline, canine or both: feline
How would you rate your IQ: tests tend to put it 150-155. Where would I rate it? ehhh unless I’m very self-aware I couldn’t pinpoint it with accuracy, so. 
On a scale of 1 to 5 (with 1 being low and 5 being the highest) how would you rate your self-confidence? ____3.59__today, because I’m in a good mood
Would you say that you lean more to intellectual intelligence, intuitive intelligence or somewhere in between: Personally or in other people? Personally I’d say… intellectual.
Name the last book you read: For study or for fun? I’ll assume fun. Quiet by Susan Cain and I Claudius by Robert Graves. 
Please bold all that apply to the sentence below.
I want to ________with Mycroft Holmes.
A)  have a meaning friendship
B) have a successful mentorship
C) have a romantic relationship
D) have his babies and grow old- deep down in my wish fulfilment heart of hearts, only I’ve got to be realistic
Knowledge portion:
Solve the problems without cheating and bold your answers.
Consider the functions f(x)=
. In standard (x, y) coordinate plane, y + f(g(x)) passes through (4,6). What is the value of b? __I’ve got up to 6=f(28+b) and f(4)= +2 or -2. Now I’m stuck._______
The closest star to the sun is Proxima Centauri. IN which direction would we need to look in order to see it in the night sky? __I know that the earth goes round the sun!__________
What is the name of the galaxies grouped to which the Milky Way belongs? _Nestle galaxies. I don’t know._________
The treaty of Frankfart was signed 10 May 1871 between which two countries?
France and Prussia. It was a peace treaty. Germany and Italy were still very young countries at this point, in terms of being unified nations.
Missing angles
OK, so the other angles in the yellow triangle are 45 degrees and 90 degrees. The area of a triangle is half base times perpendicular height. All angles in a triangle must add up to 180 degrees. Maybe you could use, the sine, cosine and tangent rules to get the answer but I’m stuck again. Oh and the square of the hypotenuse is equal to the sum of the squares of the other two sides.
Genetic evidence shows that people with whom a majority of teh population of Briatin share the closest genetic link are:
Certainly the Anglo-Saxons, given that the Saxon part comes from Saxony in Germany. Depends on the region I suppose.  The northern parts had the “danelaw” because they were settled by vikings.
How long did the “Hundred Year War” last between England and France?
116 years!
What is the heaviest breed of bear?
Polar bear?
Koala bears are small and polar bears live in colder regions than black bears so they probably have more fat which makes them heavier?
When grilling , where does the heat source come from?
Electricity. I don’t know.
How many of the original 51 Member States of the Untied Nations are still members under their original names?
C) 43
Mostly because of the impact of the break up of original member state USSR and the end of the Cold War. Yugoslavia broke up (messily) in the 1990s. Czechoslovakia split to become Czech Republic and Slovakia. 
Don’t know what the art is from. Looks Greco-Roman but in good condition so maybe Georgian? I appreciate it aesthetically, at least.
Counterpoint? 5th species counterpoint? What are species doing in music, I thought they were biology? basic error? w h a a a a atttttt
I only know basic music like treble clefs and minims and crotchets and beats in a bar. I don’t even know where to begin ha.
In my defense, there weren’t any literature or classics questions on this knowledge test. And hardly any history questions.  
Mycroft's Answer:
You're a bit  hard to pin down but you have great promise for improvement given enough time and clemency.  Mycroft loves scholars and will 'shoot the breeze' so to impart knowledge on his favorite subjects with just about anyone will to suffer him. As long as you can keep up with his intellectual pursuits you should be fine.
have a meaning friendship : 7/10
have a successful mentorship: 9/10
 have a romantic relationship: 6/10 (at least until you hit 21)
have his babies and grow old: 6/10 (at least until you hit 21)
