#and romantic needs are some permutation of relational needs | Explore Tumblr posts and blogs

schweeeppess · 6 years ago

Text

“But I must also feel it as a man.”

A/N: I don’t need an excuse for writing cdbfvw what are you talking abouttttt.

This takes place in the regular comics, with only two things amended: Bruce slit the side of Jason's neck open like he did in Under The Red Hood, and Jason hasn't attacked Tim, at all. Damian isn't in the picture, and neither are Cass and Duke.

Jason's about nineteen-twentyish.I really hope you guys enjoy! I banged this out in, like, three hours? Four, tops?

Warnings: Jason curses

Batman stared at the boy-come-man sitting across from him, hands cuffed to the table, green eyes bright in self-vindication. The white streak of hair over his eyes didn’t lessen the passionate fire that burned in the green oceans.

“You killed again.”

It wasn’t a question, but Jason still answered him.

“Yes.”

Shoulders tightening at the answer, and fists curling, Bruce said, “We don’t kill, Jason.”

“I haven’t been part of your collective ‘we’ since you slit my throat,” Jason hissed, the fiery green in his eyes flaring up, his own hands balling into fists, and the muscles on his arms tensing and visibly straining under the fabric of his grey long-sleeved shirt. “So, next.”

“You promised.”

“They were child traffickers that raped the kids and got them hooked on drugs to force them to stay! You know how my mom died! You know my childhood!”

“We had a deal, Jason.”

“I don’t give a shit about your moral superiority, Bruce!” Jason shouted, slamming a fist down hard on the table, making the glasses of water jump a little.

Bruce merely blinked, unbothered by the outburst, and Jason kept talking.

“Fuck you, Bruce, fuck you,” he seethed, eyes flashing again. “Maybe I’d put up with this shit if it was on a different case—just maybe—but no. No. Not this one. They were child traffickers. They were rapists. They were drug dealers that forced addictions on these children.”

Jason’s hands shook and Bruce was struck with the feeling that Jason wasn’t seeing him, instead seeing something or someone else.

His voice lowered to a whisper.

“Do you know how many bodies I found, B? How many ten-year-olds and eight-year-olds died of overdosing?” His eyes re-focused on Bruce, and Bruce simply sat there.

“Ask me.”

Bruce decided to comply, though dread was coursing throughout him as surely as blood did, weighing his heart and limbs down like anchors.

He inhaled a little then forced the question out.

“How many?”

“Thirty-eight.”

Bruce’s mind spun at the number and his breath hitched. He was left reeling, struggling to maintain his composure, and Jason sat there, gazing emptily at a wall.

“Thirty-eight,” Jason repeated, voice lackadaisical. “And I don’t know how many more are going to be struggling with their addictions and rehab.”

Still struggling, Bruce chose that sentence to grasp hold of and re-center himself with. “Didn’t you count the children?”

“Oh yeah. I know how many I removed from those warehouses.” Jason’s eyes lazily slid back over to Bruce, and his body loosened up as he leaned back in his seat, feigning boredom or indifference. “I just don’t know how many of them are going to kill themselves.”

And that kicked Bruce back down the flight of stairs he’d just struggled to climb with truth.

“I…” Bruce closed his eyes and took a breath. “I’m sorry.”

He heard a loud crash and his eyes snapped open to see Jason standing, chest heaving, his glass of water in shards on the ground.

“You’re fucking sorry?! Are you fucking kidding me?” Jason laughed bitterly, throwing his head back.

“Oh, oh that is just rich!”

Jason looked back at Bruce, who hadn’t said a word, and still smiling joylessly asked, “And what, dare I ask, exactly are you sorry for?”

Bruce didn’t hesitate with his answer.

“All of it.”

He stood, Jason tilting his head back a fraction to look Bruce in the lenses of his cowl, and moved to take the cuffs off.

Stunned, Jason merely stood there as the hand cuffs fell off his wrists and Bruce returned to his seat, easing himself into the chair again.

“I know that saying I’m sorry won’t do anything, Jay, and that is… unfortunate. So, I’d like to try this all again.” He gestured to the empty chair with a gauntleted hand.

When Jason didn’t budge—didn’t even blink, actually—Bruce sighed and lowered his cowl.

Lifting his blue eyes that seemed to grey right with his hair, Bruce met Jason’s strong green gaze.

Jason was in every way Bruce’s opposite, but in those very ways he was almost exactly like Bruce himself. His passion that fueled what he did, his stubbornness, his iron will. They even bore physical likeness. Jason was almost as tall as Bruce, and the older man had a suspicion his son would grow taller, his broad shoulders, his massive build. One would look at Jason beside Bruce and could be forgiven for mistaking them as biologically related.

But certain aspects of Jason were far from similar to Bruce’s. The way he forgave, the thirst for approval, the need for acceptance, his compassion, the way he would do anything for family, friends, and even literal strangers. Jason planned things to the very letter, but would throw those plans right out the window for somebody. Bruce planned things the same way, and was only forced to abandon them; he never willingly disregarded them for anybody. The only exception had been Jason, in Ethiopia, but nothing had been planned, then.

None of that was Bruce.

Yes, he forgave, but not as readily as Jason did. Yes, he was compassionate, but not to Jason’s level. Yes, he would do anything for his family and friends, but that was where he drew the line at anything, and even then he had one exception: He wouldn’t kill, not even for them. Jason had no such reservations.

And it was just part of what made Jason, Jason. Bruce hated that he was only just realizing that it was as much a part of his son as Batman was a part of Bruce; he hated that it took Jason’s fury, Bruce containing him in a room against his will, and a shouting match for him to understand and even try to grasp the concept.

Because Jason was still his son.

Jason died.

But he came back.

Maybe he came back different, but he came back all the same.

As Jason gritted his teeth and forced himself to take his seat again, it suddenly wasn’t grown Jason moving.

It was Robin Jason, and he was steaming about a case involving a rapist and murderer, those teal eyes aflame in the very same passion that burned within adult Jason’s permuted green eyes.

Bruce’s mouth went dry.

Jason crossed his arms, suspicion pulling at his lips and eyes.

“Try what again,” Jason deadpanned. “This?” he gestured between himself and Bruce, “This isn’t one of your romantic relationships, where you can fix it with a box of chocolates, some roses, and an ‘I’m sorry, I’ll never do it again’ Bruce.”

“I understand,” Bruce said, clearing his throat a little as his mind shook off the brief memory. “And I’m not going to treat it like one. I want to try and fix it, Jason, but only if you will let me.”

Jason didn’t say anything for what felt to be an eternity, instead looking for something in Bruce’s eyes.

Bruce waited.

When he finally seemed satisfied, Jason’s entire body went lax and he leaned back in his seat and gazed up at the rocky ceiling.

“Why now? You just arrested me for murder, and I’m pretty sure you were gonna cart me off to Blackgate or Arkham as soon as you realized that I’m not like your Rouges gallery: I’m irredeemable, unable to rehabilitate, and a psychotic killer who just adores breaking promises he makes to his friends, family, and himself.”

Where there was once a raging fury and passion in Jason’s eyes, there was now nothing but exhaustion and pain. There was confusion and resignation.

Bruce owed him a straight answer after two lifetimes without one.

“Because I love you, son.” His grey-blue eyes flicked to the scar just visible over the collar of Jason’s jacket, and he grimaced. “Hurting you is something I regret every second of every day. Adding a scar to the myriad on your body makes me want to puke, and I hate myself for it with every breath I take.

“I… didn’t react the way I should have when you came back, Jason,” Bruce whispered, looking back to Jason’s eyes. “And I am very sorry for that. I tried not to let my emotions bias my actions, but regardless they did, and I see that now.

“When you were Robin I told you that everyone, no matter who they are, has a chance at changing their lives. I told you that it’s why we never kill; because every life taken is one that didn’t get a second chance. The Joker I don’t believe will ever change, and I don’t kill him now because that would be betraying myself. If I were to stab him, or shoot him, or break his neck, I’d be stabbing myself. I’d be shooting myself. I’d be killing myself. If I kill, Jason, I’m betraying all of you right along with my morals.

“I expected you to hold fast to my morals—expected you to take them as your own—so I was hurt when you took lives. I feel betrayed when you kill. And I realize and acknowledge my mistake now. I thought that you killed because it was the easiest thing to do.

“Jason, I am sorry for forcing my values onto you and holding you to them. I am sorry for hurting you, I’m sorry for fighting you and making you feel unwelcome, unloved, and abandoned.

“I’m sorry for betraying you.”

Bruce held his son’s gaze throughout the entirety of his speech before pausing to close his eyes, collect himself again, and opening them to continue.

“I can’t promise that it won’t hurt me in the future if or when you kill, but I can promise that I won’t make you feel like an outsider or an outcast because of it. You are my son, you always have been, and I truly apologize for being blinded by the very thing many people believe me incapable of having—my feelings. If you’d give me the chance to be part of your life again—to be your father—I promise I won’t throw it away or mess up as badly as I did the first time.

“Even if you don’t want me to be your dad, and reject this, I ask that you at least try and not be so openly hostile with your brothers. They all want you back, Jason, even Timothy, who hasn’t even met you yet. Don’t let your grudge against me get in the way of having your brothers, at least, if you feel you won’t have me.”

Now finished, Bruce waited for Jason.

Minutes passed in silence before Jason did anything more than breathe and blink, and the moment he seemed to return to himself moisture collected in his eyes.

His voice cracked a little as he spoke, and it broke something in Bruce’s heart to hear his son’s voice so vulnerable.

“But you replaced me, Bruce. How am I supposed to take that? I die, and not six months later you’ve got a new Robin flying next to you.” Jason didn’t bother to wipe at his eyes as they welled up with tears, instead continuing to stare at Bruce unashamedly. “What am I supposed to do then?”

Pain floored Bruce internally, and he felt his own eyes start tearing up with Jason’s. He tried to blink away the tears.

“Tim never replaced you, Jason. Never.”

“Yeah? ‘Cause that’s a pretty convincing suit he puts on, albeit with pants.”

“I didn’t go looking for him.”

“Did you go looking for me?”

“I—you know what I mean. He came to me, Jason. Literally knocked on the door and blackmailed Dick and myself. What was I supposed to do? He knew I was Batman, that Dick was the first Robin, and that you were the second. He knew that the cover story was a lie. He knew, Jason. What was I supposed to do? I couldn’t risk Dick’s safety, Alfred’s safety, just because the neighbor boy was insistent. I tried using the Brucie act, but he stole the Robin suit and went out. I sent him home. He did it again the next night. And the next, and the next, until I took him in because he and had no training, and the approval of Dick and Alfred. He would have gotten killed if I didn’t, and damn me if I was going to watch another Robin die.”

His voice nearly took a pleading note as he asked Jason, “What was I supposed to do?”

The tears in Jason’s eyes finally started slipping down his face as he tried to keep his twitching lips in a thin line and he fisted the cloth on his arms.

Bruce licked his lips and finished with, “I love you Jason. Nobody could ever replace you.”

“The plaque says ‘A Good Soldier’,” Jason tried to argue in a shaky voice.

“That was me being an ignorant idiot, and I’ll go destroy the case as soon as—”

“You’d really do that?”

Bruce paused, brow twitching a little in confusion. “What? Destroy that insult to you? Of course, son. I’m sorry I ever put it up.”

Jason stood, shoving himself up and out of the seat, and pointed a shaking finger at Bruce.

“If you mean that, then let’s go. Let’s destroy that thing right the fuck now.”

And Bruce stood up and said, “Okay.”

He walked out of the interrogation room he had in the Batcave and left the door open for Jason, who was right on his heels, heading for the emergency axe he had by the Batcomputer.

Bruce grabbed it and handed it to Jason, taking a second for himself.

For a minute his hand was there, extended out toward Jason, holding the axe for his son to take, and Jason hesitated.

Then he snatched the weapon and Bruce went to walk toward the memorial case.

The entire time both were silent. The entire cave was, even the bats who usually squeaked often enough for there never to be true quiet. Dick was on patrol with Tim. Alfred was probably in the Manor. It didn’t matter, because that meant that this moment would belong to Bruce and Jason alone, as they stood before the case.

Bruce’s gut twisted in disgust as he re-read the plaque.

In memory of Jason Todd

Robin

A good soldier

What had he been thinking, making that thing?

Face contorting into a scowl Bruce lifted the axe and swung as hard as he could at the glass, hitting it hard enough to watch several satisfying cracks spiderweb from the spot he’d hit it.

He did it again, and it cracked even more.

The third time, Jason swung before he could, hitting it with force enough that probably matched Bruce’s.

The glass was almost broken.

The final blow Jason and Bruce made at the same time.

As the case cracked and splintered to the ground, Jason threw the axe on the ground and himself at Bruce, arms wrapping tight around his father as he started crying into the Batsuit.

Bruce dropped his own axe and returned the embrace tightly, putting a hand on the back of Jason’s head as the other held his son close. He rested his mouth on Jason’s head, squeezing his eyes shut as emotion flooded him.

The last time he’d held Jason he couldn’t remember, and he hated himself even more for that.

A tear slipped down his cheek and he held his trembling and sobbing son closer.

“I’m so sorry,” he whispered. “I’m so, so, sorry.”

Jason replied between his silent sobs.

“I forgive you, dad.”

“I love you, Jaylad. God I missed you.”

And they both cried in each other’s arms.

“I shall do so,

But I must also feel it as a man.

I cannot but remember such things were

That were most precious to me. Did heaven look on

And would not take their part? Sinful Macduff,

They were all struck for thee! Naught that I am,

Not for their own demerits, but for mine,

Fell slaughter on their souls. Heaven rest them now.”

—Macduff in Macbeth by William Shakespeare

The title comes from Macbeth :D

Tags: @mizmahlia @boosyboo9206 @an-all-write-life @lovelywally-deactivated20181210 @avengerdragoness @crazyfreckledginger @red-balistic @solis200213 @emmadevr @tomscaprisun @queen-fighter @jaybird-rednerd @shirokokuro @aaren-27 @osejn @v01d-ch1ld @angstytodd

#my writing #my work #my fanfiction #fanfiction #Bruce Wayne #Jason Todd #Batman #Red Hood #Batman & Red Hood #Bruce Wayne & Jason Todd #Jason Todd & Bruce Wayne #Red Hood & Batman #Father/son bonding #no inscest #like #never

117 notes · View notes

vaguely-concerned · 6 years ago

Text

Since he’s apparently the one thing from Mass Effect Andromeda that won’t leave my brain alone even after two years, have some assorted Reyes Vidal thoughts

- The one thing that really keeps niggling at me: he tells you that he came to Andromeda “To be someone” -- in a moment of rare unguarded honesty; I’m willing to bet that’s the only thing he says in the entire game that’s completely true lol -- and yet he (admittedly with great success) goes about doing it specifically in such a way that no one fucking knows who he is. So... be someone in the eyes of who, exactly? Just himself? He clearly has an ego, but it’s curiously introverted; he doesn’t seem to give much of a fuck what other people think of him, so long as he knows in himself that he’s running circles around them. Where the hell does he come from that this is so important to him?

And yet there is a clear tension between that whole wreathed in secrets business (which he equates with safety and survival in the dancing scene) and the wish (need?) in him to be seen and known, in a personal sense. This is why I do think he genuinely likes and cares about Ryder. He offers them the truth of why he came here as a first hint of vulnerability and if they don’t meet him on that -- if you blow it off with a joke, “Wow, finally an honest answer” -- he slips the mask back on and the romance doesn’t go through. If you answer his claim that he doesn’t want any more secrets between you with (a probably accurate and clear eyed) “dude you live and breathe secrets don’t go promising things you can’t give just don’t lie to me about important things” he is touched by the fact that you know him for the sneaky bitch he is and still want to stay. So there’s clearly something genuine going on here, under all the layers of bullshit haha

- Considering all the resources he has at his fingertips literally the only conceivable reason he’s consistently sneaking out of paying for his drinks at Kralla’s is the instinct to be a little shit. reyes umi can & will stab you, pls rethink your life choices

- I’m pretty sure he’s the romanceable character where the player has the most agency over how Ryder responds/relates to him. There’s really a quite wide range of outcomes here: if you side with the Collective you can romance him, either all the way or break up with him once you know who he is, or you could build up what looks like an actual burgeoning friendship or at least alliance, or you can make it clear you’re only working with him out of necessity and not because you like him (if you refuse the handshake towards the end he goes “You know, Ryder... you’re kind of a dick. But I’ve worked with worse” and I think that is beautiful)

Even on the Outcast side there’s a lot of nuance depending on what you do and the preexisting relationship: if you shot him in the back after romancing him his email reads as barely holding a grudge at all, really, in a ‘ah well c’est la vie guess I should have seen it coming this one’s on me’ way. If you shot him in the back and you’re not even bros he’s p i s s e d and vowing revenge (the only time he resorts to naked hostility). If you’re friendly with him and just let him get away it reads more like a warning about Sloane/trying to manipulate you and give you doubts but without personal animosity.

I actually super enjoy this range of possible responses and outcomes -- with a lot of the squad mates I feel really railroaded into becoming BFFs or romantic no matter what bullshit they pull (pEEBEE), but considering all the ways his storyline can end... romancing Reyes is a very bad decision you have to consistently pursue for quite a while lol

- I’d say the two times he seems to be openly really angry is when Zia is having a go at Ryder for admittedly being naive about him (something about that ”Leave him/her out of this” would make me rethink trying to fuck with the Charlatan i m m e d i a t e l y lol), and in one permutation of the email he sends after you’ve shot him in the back when there’s no friendship established (and you sort of have to go out of your way not to be at least friendly with him, I’m pretty sure).

- He is 100% a control freak disguised as a laid back charming laissez-faire sort of dude (more or less the same character type as Iron Bull, only without y’know the loyalty or basic morality or extenuating circumstances of being raised under the Qun lol)

He has eyes and ears everywhere, knows exactly what strings to pull, and moreover seems to take a special enjoyment in not only outsmarting people but also doing it without ever giving away who he actually is ...but interestingly lying to Ryder about who he is quickly appears to mostly make him mildly uneasy, there’s none of that triumph or smugness there in that specific relationship/situation (unless you uh well shoot him in the back, which... actually, fair, hard to begrudge him that)

- Keema seems genuinely quite fond of him and thinks him stressing about what Ryder is going to think once they know the truth is equally parts hilarious and cute lol you get the feeling this hasn’t happened before while they’ve known each other

- The only goddamn explanation for why a romanced Ryder is SO upset with him over hiding his identity despite them knowing each other for like two weeks max is that they absolutely fucked at least once and it was fantastic, I don’t see any other way to parse that

BONUS THOUGHT: the fact that these two assholes hold hands like goddamn school children as they run away after stealing Sloane’s whiskey and that Reyes smiles into every kiss except the one you surprise him with is STUPID and AWFUL and UNFAIR and I resent it all deeply

#mass effect andromeda #mass effect #reyes vidal #reyder #meta #maybe it's just nicholas boulton's voice that's hotwired to my heart but yeah here's my shameful secret he's my ultimate me:a bae #ryder really comes alive in the kadara arc too there's some real chemistry there no matter what #(not in the romantic sense necessarily of course it just brings out a lot of cool things in ryder to have someone to play off of)

19 notes · View notes

so-shiny-so-chrome · 6 years ago

Text

Witness: Owlship

Creator name (AO3): Owlship

Creator name (Tumblr): v8roadworrier

Link to creator works: https://www.archiveofourown.org/users/owlship

Q: Why the Mad Max Fandom?

A: i am still asking myself this question! something about fury road grabbed me at just the right point in my life to interest me, and the people & community i found have been just wonderful at keeping me feeling interested & connected. i love that the world presented is clearly well thought-out and cohesive, while at the same time allowing for a huge variety of explorations even while staying strictly within the bounds of canon.

Q: What do you think are some defining aspects of your work? Do you have a style? Recurrent themes?

A: well, it's pretty clear that i adore the relationship between max & furiosa, since they star in 90% of my fics, and au's are kind of my thing. i don't consciously have a style that i write in- i just try and write more-or-less what i think could reasonably happen, i suppose, and to be honest i think of my actual writing as pretty utilitarian, rather than anything with a nice artistic style. probably the most frequent recurring theme in my fics is pining leading up to a happy ending, and i like to think i flirt with miller's idea of "engage to heal" pretty frequently as well.

Q: Which of your works was the most fun to create? The most difficult? Which is your most popular? Most successful? Your favourite overall?

A: i have fun with all my fics, or else they don't get written! i'm not good at making myself do things i don't want to do, especially if the only reason to be writing fic is to have fun in the first place. most difficult would probably be "birds in last year's nest" (the omega!max fic) because i really wanted to handle the issues in it well, while the easiest to get written was "out of the bag" (cat!furiosa) despite its length because it basically just wrote itself. my most popular is definitely "around the corner" (petshop au), which has a very dear place in my heart even if it's not the most polished of my fics. my favorite is usually whichever i've published most recently :)

Q: How do you like your wasteland? Gritty? Hopeful? Campy? Soft? Why?

A: hopeful above all, with a good balance of gritty and soft, depending on the particular fic. i like to explore the realistic effects of things, but i'm also happy to gloss over the tricky details in favor of fluff. i've only written one fic with an unhappy ending so far and i don't see myself adding to that number anytime soon, and i am just not great at humor so i avoid trying to be funny.

Q: Walk us through your creative process from idea to finished product. What's your prefered environment for creating? How do you get through rough patches?

A: my writing process is simple: i get an idea (usually i steal it), i bundle myself up in bed, and then i do other things while writing a sentence or two every few hours. sometimes i get into the groove and can bash out a few thousand words in a day, other times i flounder for weeks without anything holding my interest. when i do write i always work chronologically, which means finding the actual start of the fic can take a few tries, and figuring out the end can be difficult if i haven't really filled in the details in my head yet. for rough patches i put my head down and try to force words out, but if it doesn't want to happen i just let it go and move on, unless it's for a gift, or something like nanowrimo where i want those bragging rights. i don't use written outlines or keep notes of anything, which is a bad habit but one i can't shake. if it's not important enough for me to remember, how important was it really in the first place?

Q: What is your biggest challenge as a creator?

A: right now it's finding the motivation to write when i've got other stuff going on in my life, especially on days when i am tired out even on my days off. other than that- staying focused on a project long enough to get it finished! i also struggle with juggling multiple characters especially in the same scene, making sure that everyone gets their turn and sounds authentic.

Q: How have you grown as a creator through your participation in the Mad Max Fandom? How has your work changed? Have you learned anything about yourself?

A: my writing, both in terms of technical skills and how i compose a story, has just improved leaps and bounds since i started writing fics, thanks in large part to the feedback i'm lucky enough to get, as well as the sheer volume i've been able to put out. i've definitely learned a lot about what kinds of ideas interest me to write, which is not necessarily the same things i want as a reader.

Q: Which character do you relate to the most, and how does that affect your approach to that character? Is someone else your favourite to portray? How has your understanding of these characters grown through portraying them?

A: i probably relate to max the most, or at least the version of him that lives in my head- it's easy for me to get inside his pov, but that means i have to stop myself from making *every* fic his pov! furiosa is a close runner up in terms of how much i like writing her, which is lucky because she's the other 50% of my fics, but it's a lot harder for me to get inside her head, so i have to pay attention more to what i'm doing when i write her.

Q: Do you ever self-insert, even accidentally?

A: i probably do, but not intentionally. of course i use my own experiences and feelings when writing, but i always try to translate them to the mindset of whoever i am writing. it's just been drilled into my head too many times that writing yourself as a character is not what you are supposed to do, i think.

Q: Do you have any favourite relationships to portray? What interests you about them?

A: max & furiosa, 100%. platonic, romantic, as soulmates, as enemies- i love every possible permutation of how they can interact with each other since they're so similar but still very distinct. i love how much of their relationship is unspoken but perfectly understood- or not, and how that can set up their interactions.

Q: How does your work for the fandom change how you look at the source material?

A: i pay a hell of a lot more attention to what's happening in canon, and pick apart even minor gestures or bits of speech to really drill down into the character's heads. if i was just watching the movie(s) to enjoy them, i'd stay a lot more surface level instead of analyzing details like what the interior of the war rig says about furiosa, or what's in max's kit at the beginning of the movie vs the middle, etc.

Q: Do you prefer to create in one defined chronology or do your works stand alone? Why or why not?

A: nearly all of my works are unrelated. i love coming up with little tweaks that don't really effect anything but might contradict each other (which of the wives takes on what role post-canon, how long it takes before max comes back for the first time, etc), and writing in a single series would mean i'd have to address those differences. short fluff or pwp pieces where the entire fic is just a single scene tend to share enough similarities that you could imagine they take place in the same 'verse, but to be honest, that's just me being lazy ;b

Q: To break or not to break canon? Why?

A: canon is fake and the author is dead! that said, i do actually try and stick as close to the canon facts as possible unless it's something i'm deliberately changing, because after all without canon there wouldn't be any shared understanding of the characters that makes fanfic possible. this is one of the trickiest parts about writing an au, because i have to find the right balance of familiarity to canon with what's different about each au in order to have the changes i make to the characters/setting/etc make sense to the reader.

Q: Where do you get your ideas for your AUs?

A: all sorts of places! some of them are given to me- i love prompts- others i steal from other fandoms, like bodyswap or wings or turning furiosa into a cat, some i search out via idea generators, and at this point i honestly can't watch/read any new stories without going "but how can i turn this into an au??" i also like to say "what if" almost *constantly* and sometimes that leads to full fics, other times i just make a post on tumblr with some half-baked ideas of how it could work out. what if furiosa's mother didn't die before the movie? what if max had a pet dragon? what if it started raining and didn't stop? it's honestly harder for me to write a strictly canon fic at this point :)

Q: Share some headcanons.

A: i actually don't have a ton that apply to every fic, because i like switching things up- but here's some ones taken for granted in 99.99% of my canonverse fics: furiosa lives after the end of the movie without any major complications, max comes back to the citadel at some point, furiosa has her own room with not much more than a bed, a workbench, and a window, the war boys are willing to accept the wives as the new rulers (and that the wives form a council rather than a dictatorship), and somehow the bullet farm & gastown fall into line with the citadel's new way of thinking. also, max has a sweet tooth and furiosa doesn't remember most of her dreams.

Q: What advice can you give someone who is struggling to make their own works more interesting, compelling, cohesive, etc.?

A: something i try to keep in mind at all times is: write for yourself and not your audience. does your heart of hearts want to ship those two characters? hell yeah make 'em kiss. have a scene that is super cliche or over the top but you can't stop thinking about? write it! your stories need to be interesting to you first and foremost, because a reader absolutely can sniff out the difference between a scene you thought would be "good" and one you had fun with. you can always edit later to shape your fic into a different direction if you feel like you need to.

Q: Have you visited or do you plan to visit Australia, Wasteland Weekend, or other Mad Max place?

A: i've been to wasteland weekend twice now and hope to visit many more times in the future! it's a super fun experience in general, and it's also helped me get a feel for what a mad max world would really be like, rather than just relying on my imagination. i'd love to visit australia some day, both for mad max and other reasons, but ideally not while there's an apocalypse going on.

Q: Tell us about a current WIP or planned project.

A: *throws dart at gdocs* let's see.... i've got a fic started where furiosa is a viking, and after a raid gone wrong she ends up injured at max's farm where she has to learn the language and customs and come to terms with being his slave (until they fall in love, obviously). haven't worked on that one since july but hey, it's not going anywhere.

Thank you @v8roadworrier

#mad max fanfic #mad max fandom #Mad Max Fandom Spotlight #Mad Max Fandom Creator Spotlight #mad max fanfic author spotlight #fury road fanfic #fury road fandom #owlship #v8roadworrier

24 notes · View notes

lovelyagnesblog-blog · 7 years ago

Video

youtube

Oh Agnes, won't you go with me? We'll be married in style And we'll cross Lake Michigan, so blue and so wide We'll cross over Lake Michigan, 'til we come to the shore And our orchards will blossom for our babies as they're born

Two wonderful friends of mine sang this song for me and my husband at our wedding. It’s a lovely tune, and it’s obvious from this chorus that it’s a nice choice for a wedding. Once you check out the verses, though, there are some bits that some folks might find questionable--no one said anything, because we only invited well-mannered people who wouldn’t make obnoxious comments, but I can imagine some people might have been like why am I crying over this song about a 92-year-old woman and her weirdly-named children.

Some people might have preferred to feature a song that primarily focused on romantic love, rather than (grand)motherhood, for their wedding. I get that. You’re young, you’re madly in love, you’re the hottest you’ve ever been (and probably ever will be because never again will you spend this much time and money on looking good for one day,) why not have someone sing about all of that? It makes sense. I’m not knocking it.

But from the first time my wonderful friends sang “Lovely Agnes” to me (yes I was weeping openly by the third line, what’s wrong with that) I knew that it was perfect for our wedding, and that it would be very important to me for a very long time.

Obviously, I am a deeply sentimental person. I’m the type of person who makes playlists to recall the mood of very specific moments in my life; I’m the type of person who keeps mementos for way too long given how much I hate dealing with clutter; I’m the type of person who almost never fails to fill up every square inch of the inside of a greeting card with my sappy ramblings to mark an occasion. But it’s not just my deep-seated love for the saccharine that draws me to this song to the point of naming my blog after it.

See, in addition to my sentimentality, one of my defining characteristics is my domesticity.

What is she talking about, say my friends, who know I haven’t seen the floor of my bedroom or the bottom of the kitchen sink in weeks. You always tell people you love to bake, but when was the last time you made me a cupcake?

You are not wrong, friends. But I didn’t say one of my defining characteristics is being good at domesticity. I love and value domestic life incredibly highly; that doesn’t mean I do it well.

The only life goal I have consistently, consciously dreamed of and planned for is homemaking and the raising of children. I used to read my parents’ parenting books (I’m talking from the age of eight or nine, here.) I loved when pastors would give sermons on parenting. I whiled away many an hour comparing the parenting techniques I saw in my own mom and dad with my friends’ parents, my neighbors, my aunts and uncles, TV characters, anyone I could think of. As a sentimental thirteen-year-old, I would write letters to my future husband, and plans for our kids were heavily featured (alongside promises of extreme chastity.) When I decided it was my destiny to be a high school English teacher, one of my primary reasons was because it would allow me to spend more time with my kids. Even in college and early adulthood, behind every permutation of what I thought my future would look life, I knew that what I really wanted more than anything was to care for a home and a family. I have read the complete archives of more than one mothering-focused blog. I was forever trying out new apps, systems and methods to try to get my life together and become the tidy, presentable, Ideal Housewife type I dreamed of being.

Here’s the major problem, the reason I feel I need to write about this, the reason behind the tagline--I am so not that type.

There are so many complicating factors in my pursuit of domestic bliss. First and foremost, I have ADHD, which really effs my ability to form habits, stick to routines and follow through on tasks, making it super difficult to live up to my own standards for cooking and cleaning and whatnot. Secondly, not to be disregarded, there’s feminism--I’m a smart, well-educated, independent woman, damnit! Shouldn’t I want more out of life than motherhood? Shouldn’t I be reaching for some grand career? Shouldn’t I do something with all this potential, since I was lucky enough to be born in a society that would allow me to?

On a related note, there are the privileges I hold in class and race. Does the world need another upper-middle-class white mom? Wouldn’t I contribute more to the human species if I focused on a profession that would help lots of people, instead of dedicating most of my time and energy to my spouse and offspring alone? Isn’t it selfish of me?

And yet, as I’m entering my late 20s, having tried my hand at a few wildly varying jobs; having considered more school and having experienced un-, under- and over-full-time-employment, I can’t shake the feeling that domesticity is what I’m made for. I’ve never felt more deeply content than the few scattered stretches in which I’ve had the time and energy I need to keep up with household chores to the extent that satisfies me. I’ve never been more excited about the future than I am now, when I’m about to move across the country and start over completely with a new living space and new habits.

So that’s where Lovely Agnes comes in. Having my friends sing about Agnes at my wedding gave me a sense of security and comfort; it was a way for me to feel that my goals and dreams were being acknowledged. My husband and I had had plenty of times when our little romance was celebrated--maybe sometime I’ll write out the whole story of how he proposed, or the time when one of our best friends acted as chauffeur for an anniversary dinner, or our romantic week in Spain--and I wanted our wedding to be about more than our feelings. Our wedding was about the life we were going to build together and the community that was going to help us. It was about crossing the wild, blue lake--not to seek mad adventure, but to seek the orchards that would nourish our souls with their beautiful blossoms as well as our bodies with their fruit; not to change the world, but to create a refuge where the family can all join in the summertime.

I keep starting new paragraphs that go into completely tangential ideas that I really need to save for posts of their own, so I’m going to end with this: I’m starting this blog to process the complicated thoughts and feelings I have about womanhood, femininity, domesticity, homemaking, marriage, emotional labor and parenting in the 21st century. I will also probably write much less serious posts about the struggles of moving from the Midwest to the East Coast, finding a new job, getting a dog for the first time, and any other life happenings that I flatter myself might amuse people. I hope somebody finds it interesting, because nothing motivates me like internet notifications.

#writing #millennials #femininity #domesticity #homemaking #lovely agnes

11 notes · View notes

pellucidthings · 8 years ago

Text

Turnadette Fic Recs Part 1: Season 2

@gabolange and I talk about fic in this fandom a lot--many words, all the time--and are regularly reccing (or at this point, re-reccing, since we’ve both read everything at least once) things to each other. We thought it might be fun to post about some of our favorites. This post looks at season 2 fics, and then pop over to gabolange’s post on married Turner fics.

I love the Turners at nearly all points in the series, but I will always have a super soft spot for season 2. Slow burn falling in love! We want to be together but can’t even talk about it because she’s a nun! Talk about one of my pretty much bulletproof ship buttons…

So while I do enjoy the established relationship stuff, my favorite fic tends to do interesting things with our happy couple before they’re a happy couple (or at least before they’re a happy married couple). I could live in season 2 permutations forever: so many gaps to fill in, so many AU possibilities! Here are a few of the existing stories that do some of this stuff really well, divided into some genre categories:

Pre-Misty Road, Canon-Compliant

This is the stuff for which I most want all the fic. True story, I watched s2 for the first time on Netflix, completely unspoiled, and between the PBS/Netflix cuts and my attention being on other characters, I pretty much entirely missed any lead up to 2.05. And I remember texting gabolange all, “WTF, the doctor just kissed Sister Bernadette’s hand what is going on????” It felt so out of the blue! Like, they had one conversation about cigarettes and suddenly they’re in love? I fell for them anyway from that point forward (because my buttons), and then when I rewatched the BBC version while paying closer attention to them specifically, I saw it wasn’t quite so out of the blue. But I still always want more fic that unpacks all of that.

Sister Bernadette in the sanatorium is a bit of a genre, and there have been several really lovely stories exploring her journey and decision to leave the order. My favorite of these is The Open Window, by This Unruly Heart. Alternating Sister Julienne and Sister Bernadette perspectives, the Sister Bernadette struggle feels beautifully authentic (the faith stuff is really hard to get right, I think; a lot of people, including the writers on the show itself, either gloss over it, perhaps because it’s outside their experience, or alternately portray it in the kind of legalistic terms that doesn’t feel true to Anglican tradition or to these particular characters--this story gets it right). I also love that Sister Julienne’s surprise is genuine, which feels very in character (I don’t think she had any idea ahead of time), and the image of the letters bursting out of the box is excellent.

Fannish good manners say you don’t rec yourself, but it’s also true that we write the stories we want to read. And I had noticed, in my months of reading all the stories, that a) there’s very little s2 stuff from Patrick’s perspective, about his journey, and b) there’s very little that takes on what I think are inevitable missing scenes in between the ones we see, especially leading up to 2.06. So Good-Morrow to Our Waking Souls by pellucid (me!) tries to take that on. And because reccing yourself is weird, I will let gabolange jump in here and say why she thought it was successful:

gabolange here! (Proper disclaimer: I betaed this story and still think it has too many commas.) This story is successful for me on several levels. The first is the slow evolution of Patrick noticing Sister Bernadette--both in moments we saw and moments that are new to us. The build to his realization that he loves her, and then to the moment when he figures out that this isn’t one-sided, is developed in a way that feels absolutely genuine for the character. The second is that the missing moments really work and illuminate the characters beautifully. Timothy and Sister Bernadette bonding over bad science jokes? Of course. But more than that, these quiet moments between Patrick and Sister Bernadette that allow their relationship to grow while still conforming to all its limitations--she’s a nun and they’re definitely not supposed to be in love with each other--are just so well realized you wonder why we didn’t see them on screen. And finally, this story isn’t about Patrick’s faith, except it is: there are these wonderful threads about his belief (or not) in God, his relationship with the church, the way he sees Sister Bernadette in relation to those contexts, and those things are so rarely explored and so rarely explored as well. So, you know. Thumbs up.

Post-Misty Road, Canon Compliant (at least at time of writing)

Of course Patrick and Shelagh become rather more fun to the shippers once they manage to get themselves sorted in a truly excellent romantic scene. And there are also a lot of lovely stories set between the end of 2.08 and their wedding. (An aside: a lot of people have done the what happens immediately post-misty road thing, and with the exception of an AU rendition--see below--there’s no one version that entirely works for me, much as I enjoy parts of many of them, so I’m not reccing any of those at the moment. Maybe one of those things I’ll have to write for myself one of these days…)

Probably the most-recced story in this fandom is Timothy Turner and the Entertainment Badge, by Kathryn Wemyss.To be honest, I don’t love this story as much as many people do, though I do love it a whole lot. But there are a few too many tangents with original characters for my taste, which makes the pacing uneven. I put it on the list, though, because it’s still very good, because I think it’s important, and because there are good reasons why fandom loves it so much. This is one of the small handful of stories in this fandom that really illustrates what fic can do when it’s done well: it opens up doors that the show never could or should, it digs deep into our characters’ histories and into their present (at the time of writing), and it’s just flat-out well done. It also shows Patrick and Shelagh as grown ups who are trying to figure out this major life change as people no longer young, with histories and baggage, that they need to--and do!--talk through. I particularly like the contrast between their adulthood and Timothy’s childhood here. Not all important stories in this fandom need to look like this, but this fandom needs more important stories--more stories that are doing something no one here has done yet, and doing it well enough that everyone sits up and notices.

The show lets the relationship between Shelagh and Patrick unfold very privately, which also says something about their characters. This was a quiet, beautiful, rather secret thing for a long time. Timothy and Sister Julienne are brought in to see pieces, but for the most part, no one has any idea this is happening until suddenly, omg, Sister Bernadette left the order to marry Dr. Turner! What that looked like to their friends and community is a fascinating dynamic, and writergal85/@superfluousbananas does a great job exploring it in Found. Lovely views of their very new relationship from various perspectives. Fred is my favorite!

Alternate Universe

My favorite story in this fandom, it turns out, is a modern AU. It’s not my favorite story because it’s a modern AU, though it does that well (which is a difficult thing to pull off: in general, people either err on the side of hewing too closely to canon, so it feels artificial and repetitive, or they stray too far, so the characters no longer feel like themselves). Modern Love by ithinkyourewonderful changes just enough that the story feels authentically set in the 2010s, but not so much that these aren’t completely, 100% our characters. And it’s the characterization that makes me adore it and reread it every couple of weeks. This is as close as we’ve yet come to the season 2 story of my dreams: both of them on this journey, falling in love, and trying to figure out what on earth to do with that. The world is richly developed, and there are so many lovely moments between Patrick and Shelagh, both new imaginings of what we saw on screen and new moments altogether. This is also the post-misty road story that does everything I want it to do, even if it’s all necessarily AU. Highly recommended! (Though as a caveat, the formatting is awful, and the first time I opened it I almost didn’t read it because of that. But worth wading through the big blocks of text, I promise!)

And finally, the season 2 sneaking around AU that is my favorite new thing in this fandom, The Best of What Might Be series by @gabolange. A disclaimer: she basically wrote these for me, and I betaed them. Disclaimers aside, though, I think it’s important that stories like these exist in fandom and are well done, as these are. Would these characters have actually started having sex while she was still a nun? Probably not. But the point of fanfic is that we get to ask but what if they did?????? How does that change things? What do these characters--and even if the decision itself feels a bit out of character, once you handwave that, these are totally our characters--do when they stumble into this thing they really shouldn’t be doing but can’t quite quit? Maybe this sort of story isn’t everyone’s cup of tea, but it’s an absolute fandom staple--every fandom has its versions of this story, or should--and we’re incredibly lucky that our version happens to be by a truly excellent writer. And did I mention they’re really, really hot??? I love them all (and there are more to come!), but the second one in particular, just...guh. (Rated E)

#fic recs #turnadette #turnadette fic recs #fic is important #why smut matters

14 notes · View notes

trulycertain · 8 years ago

Text

Tru’s Writing Notes

I’ve had people ask me after seeing my feedback on stories if I’m as overanalytical with my own stuff. The answer is yes. My stuff may often be written at 4 AM and typo-laden, but yes.

Because of that and @thesecondsealwrites talking about process (though unlike her post, this is more the why/how than the everyday practicalities of writing), here are some of the notes I’ve left myself in my journal. These apply mostly to the way I write my original rather than my fic, but they can apply to both. Can I add: a lot of these probably seem very obvious, I know, and I don’t always manage to bear them in mind. Also, I’m not a pro or even a talented amateur, and these aren’t addressing an audience, they’re addressing me - and they apply more to the way I write than writing in general. But if anyone might find this interesting or wants to know if I worry about my writing, here’s your answer.

People tend to like a strong story, with good reason. The best plots tend to be simple, and then you build outwards and maybe twist. A compelling central arc, certain genre tropes or something familiar tend to be what work: forbidden romance, or an unsolved murder and a maverick. We have a fair idea of what’s going to happen, but it’s the anticipation - and/or the eventual subversion - that brings the fun. Plot and drive.

Again, try to have a strong idea of where it’s going, or the spirit of it. Terry Pratchett once said that you want to be able to write your own blurb: it’s a good sign if you can distil the essence of your story into a hundred words or so.

Just like real people, characters have verbal tics, peculiar turns of phrase and certain mannerisms. Learn them, and use but don’t overuse. Keep it natural.

Some people just don’t like present tense, or past, or first person, for whatever reason. You may be buggered from the start, and sometimes all you can do is try. Try and know your audience, try your best. Try not to bang your head against a wall.

However, present tense is a slippery bastard. At its best, there’s almost nothing that can match it for immediacy and visceral intensity. At its worst, it can either be staccato, bleak and overly clinical - or at the other end of the scale, it can be overwrought sensory overload. Either way, a reader will be put off. Ideally, I try to balance the two and end up somewhere in the middle: punch and verve, but with restraint and room for the reader to infer. I rarely manage this, but God do I try.

Speaking of inference: don’t assume the reader is an idiot. Sometimes the best punchline or explanation is the one that’s never given. Myself, my favourite horror stories are the ones that don’t go for shlock and shocks: they’re the ones where I finish them feeling mildly unsettled, go and do the washing-up while my mind puts the pieces together, and then go, five or ten minutes later, “Oh God, it was behind the door the whole time! That’s... Argh.”

People are terrifyingly complicated. Every reader brings something to the text, whether they’re aware of it or not. This can add unexpected beauty or poignancy, but it can also make implication, idioms, dialect and offence into total minefields. People can come out with things that would never have occurred to you. Something might fly over someone’s head, or something might turn out to be an incredibly offensive phrase in their country and perfectly innocuous in yours; someone might find your happy ending the most depressing thing in the known universe, and someone else might hate your likeable romantic hero because he reminds them of their arsehole ex. Sometimes you can anticipate this and take countermeasures for clarity’s sake; often you don’t need to because theirs is a perfectly valid interpretation and part of the joy of making a cake is seeing people eat it; and mostly you just can’t know, because people come in so many different permutations and you’re not actually psychic, so leave them to it. Gah.

Watch your tenses. Things like flashbacks are nightmare territory and ripe for grammar slippage. Never be afraid or too proud to read up on usage.

Same with semicolons. Tricky little gits.

People mangle language. Doesn’t matter whether you’ve had the “perfect” education, everyone does it at least sometimes. People lose words, misuse vocabulary (me, all the time), go for double negatives, mix metaphors. You always want your dialogue to be readable, and you don’t want your portrayals to be hackneyed or offensive, but it’s generally unnecessary to aim for perfection in dialogue unless it’s for effect: say, if you want to make a character less approachable, if you want to show they’re not human, or if rose-tinted dialogue is a stylistic choice. Generally, true-to-life dialogue is inherently descriptive rather than prescriptive.

Sometimes said mangling leads to fascinating new quirks, dialect and expressions.

Speech is very different from thought. A character’s narrative voice is often quite different to their dialogue voice. Thought is much faster than speech, and sometimes someone will answer their own question before they’ve finished saying it. Thought is by nature more disjointed, and thought is also a monologue, unless everyone’s suddenly turned telepathic or you’re dealing with dissociation/multiple personalities. In contrast, speech has a listener, which changes it. Nerves can make phrases choppy or make them fail completely. Often people interrupt each other. Realistic dialogue should reflect this.

On a similar note, let your characters talk. Know where to draw the line - no-one wants the tension ruined by a half-hour conversation about socks - but very few people are all business or all dramatic emotion all the time. (Those who seemingly are will have reasons for it, and those are often worth exploring, too.) Unless you’re on a particular word and/or time limit, let your characters occasionally be real people whose eyeliner runs, or who dislike artichokes, or who make bad jokes - and people who don’t revolve completely around your protagonist, with their own internal lives. When done right, relateable is not boring - especially if you’re working in a fantastic or dramatic canon. The odd anchor to reality can grab your heart and tug.

But do know where to draw the line. Let them be enigmatic and heroic when they need to, because often the magic is in that contrast between the epic and the mundane. Characters can do and be what we can’t. Don’t take away all their mystery and more idealised qualities.

There’s no one way to do funny, and there’s no way to write an instruction manual for it. Again, like most other things, it’s a matter of interpretation: everyone’s tickled by different things. But often humour relies on the subversion of expectation - bathos and anticlimax, for example, or giving an established word/phrase an entirely new meaning - or it relies on particular character idiosyncrasies, or on the other side, the utter, crushing fulfilment of expectations. (”Save the world, they said. It’ll be fun, they said.”) A good source of jokes is often that “I bloody knew it!” feeling.

Characters have biases, too. Always try and account for this in the narrative.

Foreshadowing is your friend, and often a key to emotional closure for the reader. Unless you can do some serious, stylish authorial sleight-of-hand, deus ex machina endings will prompt pissed-offness rather than satisfied applause. Even if you don’t introduce your secret weapon/s early on - best right near the beginning, if possible - at least get the key themes and characters down. You want to get an, “Oh, of course,” not “Well, that was a total arse-pull.”

Screenwriters sometimes talk of an A-plot and a B-plot. The A-plot’s the main one, and B is a seemingly separate subplot that inevitably turns out to be all tangled up with A. It’s pretty standard for detective dramas: there’s a murder, they start investigating, and the seemingly unrelated corpse on the other side of town always ends up being central to the case. A and B always converge. Often, if it’s a story with depth and a well-reasoned plot, the B plot will grow naturally. Of course, that’s only one way of doing it: some stories have a strong, driving A plot that drives everything and stands on its own, and have some C, D, E, F, so on plots. I admit, I’m not much good at the A + B plot thing, so I don’t tend to do it. If I have subplots, they tend to be less connected and a bit more character-driven, rather than about world-saving/murder-solving like the A plot. (I tend to half-jokingly call these C plots, where the C stands for “character” or “crying.”) Good characters usually write their own C plots - they have ulterior motives, hidden aspects, unexpected connections, and if you let them wander off they’ll make trouble for themselves. C plots are connected to the main plot, but unlike B plots, not a fundamental part of it. Sorry, screenwriters, for the terminology mangling.

Another trick to nick from Hollywood: the meet-cute. Sometimes you want someone to enter the narrative sneakily and unobtrusively, but often, especially with protagonists and love interests, never underestimate the power of a good, memorable character introduction. Audiences remember the ways they meet your characters, and the ways that characters meet eaxch other.

It’s not necessary for every story, but often it’s good to have a rock-bottom moment where everything looks hopeless. It reminds your audience viscerally of the stakes and penalties for failure, and it makes eventual victory even sweeter because it’s against the odds. Unless the light at the end of the tunnel is an oncoming train. In that case, rock on with your downer-ending self.

Often the best plot comes from character. (After all, Greek dramatists went on about this all the time with concepts like hubris and hamartia.) Even when nations clash, nations are run by flawed, corrupt people. Antagonists ought to have strong motivations unless you’re writing senseless violence/cruelty intentionally. So on. People often talk about the heart of drama being conflict, and some people, taking that to heart, write a war or their couple arguing. Yeah, that can work brilliantly, but there are other ways to do it, and conflict can be smaller-scale, too. It can be as simple as different aspects of the same character clashing; for instance, if they’re torn between love and duty (there’s a reason that one’s so popular), or the conflict between their past and present selves.

#writing #tru talks #tru rambles #long post #god i probably sound so bloody arrogant #but did i mention this is mostly me prodding myself and my stories?#not anyone else necessarily

31 notes · View notes

motherhoodapocalypse · 4 years ago

Text

Notes on Labor, Maternity, and the Institution

March 9, 2011 by Jaleh Mansoor

Pro labor activism will not begin to overcome the injustices and indignities it purports to redress until it addresses an irreducibly (for now) gendered form of labor: labor, as in, going into labor, giving birth (or adopting). While much recent discourse attempts to account for the industrial or “fordist” to post-industrial shift in forms of labor, patterns into which workers are set, employment, and unemployment (I am thinking of the Italian Autonomist Marxists and Virno, Negri and Hardt in particular), and while so many statistics tell us that more women are in the workforce than men (in the aftermath of the economic crisis of 2008 to the present), maternity is scotomized. Is this just another not-so-subtle form of gynophobia? A fear on the part of feminists of essentialism? A critique of the emphasis French Feminists of the 70s placed on maternity? An innocent oversight in recent iterations of Marxist analyses?

Artistic practices of the last decade highlight the remunerative system of a global service industry, one in which “art” takes its place fully embedded in–rather than at an interval of either autonomy or imminence–the fluid, continuous circulation of goods and services: Andrea Fraser’s Untitled (2002) in which Fraser had her gallery, Friedrich Petzel, arrange to have a collector purchase her sexual services for one night, Santiago Sierra’s 250 cm Line Tattooed on Six Paid People (1999) in which the artist paid six unemployed men in Old Havana, Cuba thirty dollars each to have a line tattooed across their back. Fraser’s work was characteristically “controversial” in the most rehearsed ways, and Sierra’s drew criticism for having permanently disfigured six human beings. The misprision and naivete of the critics spectacularized both, of course. Sierra’s retort involved a set of references to global economic conditions that the critics may not have liked to hear: “The tattoo is not the problem. The problem is the existence of social conditions that allow me to make this work. You could make this tattooed line a kilometer long, using thousands and thousands of willing people.”1 Both Fraser and Sierra point to the quasi-universality of what autonomist Marxist theorist Paolo Virno calls a “post-fordist” regime of “intellectual labor” to describe the shift from the assembly line to a wide range of labor in which traditional boundaries and borders no longer apply. Virno says, “By post-Fordism, I mean instead a set of characteristics that are related to the entire contemporary workforce, including fruit pickers and the poorest of immigrants.”2 This post-fordist regime is characterized by flexibility, deracination, and the shift from habituated work to contingency. Concomitantly, the post-fordist laborer does not take his or her place in the ranks of he masses, but flows into a multitude, differentiated by numerous factors, among them, post-coloniality, endless permutations at the level of gender, ethnicity, race.

For Virno and the autonomists, art and culture are no longer instantiations of exemplarity and exceptionality, as for Adorno, but rather “are the place in which praxis reflects on itself and results in self-representation.” In other words, the cultural work operates as a supplement, a parergonal addition to an already existing logic. It neither passively reflects nor openly resists. There is no vantage or “outside” from which art could dialectically reflect and resists, as Adorno would have it. Long since the work came off its pedestal and out of its frame, from the gallery to the street, the ostensibly non-site to the site as Robert Smithson put it, cultural production is too embedded in social and economic circulation to reflect let alone critique. Virno sees this limitation—the absence of an outside—as one shared with that of activism and other forms of tactical resistance: “The impasse that seizes the global movement comes from its inherent implication in the modes of production. Not from its estrangement or marginality, as some people think.”3 Ironically, the luxury of estrangement and marginalization enjoyed by the avant-garde and neo avant-garde is no longer available.And yet, it is “precisely because, rather than in spite, of this fact that it presents itself on the public scene as an ethical movement.”4 For if work puts life itself to work, dissolving boundaries between labor and leisure, rest and work, any action against it occupies the same fabric.

Among others, a problem that surfaces [too quietly and too politely, with a kind of ashamed and embarrassed reserve] is that of gender. The issue is not merely that Fraser puts her body at risk while Sierra remunerates others to place at risk, and in pain, their bodies, that corpus on which habeus corpus is founded. Needless to say, Sierra has organized projects around male prostitutes, such as that of 160 cm Line Tattooed on Four People, executed for the contemporary art museum in Salamanca, Spain, in 1999.

The problem is that the category of disembodied labor, or intellectual labor as Virno alternately calls it to describe its reliance on abstraction, scotomizes a form of irreducibly gendered embodied labor: labor. Now let the cries of essentialism! ring. Where is Julia Kristeva when you need her? Hélène Cixous telling us to allegorically write with our breast milk?5

Many feminist artists of the 1970s—in a historical moment that has both formed and been occluded by the artistic pratices of the last decade which I mention above–explicitly addressed the category of unremunerated labor: Martha Rosler’s Semiotics of the Kitchen (1973-4), for instance; Chantal Akerman’s Jeanne Dielman which explicitly draws an analogy between house-work and prostitution. Mary Kelly’s Post Partum Document (1979) elevates maternity to the level of analytical research, part of the putative archival impulse. Merle Laderman Ukeles tacitly situates domestic work in a category with the service industry understood historically, before all labor became maintenance labor, as “maintenance.”6 Ukeles’s differentiation of production and maintenance almost seems romantic in hindsight. As though there were creation/production rather than reproduction. And yet…..

Radical Marxist and feminist activist Silvia Federici, author of Genoa and the Anti Globalization Movement (2001) andPrecarious Labor: A Feminist Viewpoint (2008) argues against the gender neutrality of precarious labor theory, that of the Marxist autonomists Paolo Virno and Antonio Negri.7 Federici situates the commonality of rape and prostitution as well as violence against women within a systematized appropriation of female labor that operates as accumulation, much as accumulation did atavistically, long before the formation of commodity economies, or the development of general equivalence. Atavism as a repressed matrix for putative modernity—a modernity in which gender determination describes one of the greatest forms of uneven development—supports Ariella Azoulay’s claim, in The Civil Contract of Photography, that modernity did little to alter women’s positions in relation to discourse, the institution, and civil rights greater than the vote. Just as for Foucault the modern biopolitical regime compounds the old to achieve a more thorough penetration of everyday life, modernity permutes previous hegemonies “shaped and institutionalized over thousands of years.” In twentieth-century battles for the right to corporeal self-determination, to reproductive rights, for instance, “the body itself underwent a process of secularization, …this body came into the world without any of the normative defenses of citizenship to regulate it.”8 Under “Universal” rights, the contingencies of the body, deemed particular, did not become part of the discourse around citizenship, thus abandoning it to a renaturalized precariousness. Premised on a set of Enlightenment Universalist claims purportedly neutral to the particularities of corporeality, modernity failed to account for the specificities of women’s lives. Instead, the body, or “bare life” tacitly continues to be the way women are viewed, here commodified and sexually fetishized (neo-liberal “Western” democracies), there regulated within disciplinary, and often violent, parameters, as in Islamist cultures.9 These differences in hegemonic models of femininity may be theorized;10 the process of biological labor, however, slips the grasp of discourse, and, with it, policy. This last term would include international policies in which Enlightened self-interest are legitimated by the roles of women, of women’s bodies to be more precise.

Federici links her notion of atavistic forms of reserve—the accumulation of women’s labor—to colonial expropriation. She argues that the IMF, World Bank and other proxy institutions as engaging in a renewed cycle of primitive accumulation, by which everything held in common from water to seeds, to our genetic code become privatized in what amounts to a new round of enclosures.

Pop culture, as always a place where cultural articulations happen within normative parameters that may differ from “discourse,” presents the most direct expression of this that I have yet to come across. The high/low binary was a false product of fordism, one that no longer operates. When a famous male rapper says, “gonna get a child outta her,” he is speaking hegemony, not “marginalization.”

II.

Labor: If Virno is “correct,” in his analysis, there can be no “perspective” from which to think labor. From what fold within labor might I think it? I’ve worked as an hourly wage earner, a mother, and a salaried “professional.” One of these three terms is incongruous; discourse has hit a false note. My description of something about which I should know a great deal, my own history as a laborer, has already committed a rather egregious crime according to the law of discourse. As De Man has famously said, “abuse of language is, of course, itself the name of a trope: catachresis. …something monstrous lurks in the most innocent of catachreses: when one speaks of the legs of a table or the face of a mountain, catachresis is already turning into prosopopeia and one begins to perceive a world of potential ghosts and monsters.” What thwarted terms, or monsters, are barred from an account of my accounts? Discourse be damned, or in this case, personified; I am using “I.”

At 13, 22 years ago, I was what Siegfred Kracauer might have referred to as “a little shop girl,” working at a T shirt store for 3.75 an hour, selling 20 dollar Joy Division T-shirts and 5 dollar Grateful Dead stickers to other, older, teenagers [with allowances or their own jobs]. My mom had to accompany me to the first day to make good on PA labor laws. 7 hours of my labor/boredom would have bought me one of the T-shirts I sold. I’ve worked, like so many artists and academics, as a museum guard, 17 years ago, for 7/hr, or 10.50/hr for working past the 8-hour shift. Needless to say, none of these jobs had benefits. I’ve written articles for prominent scholarly journals where the pay may roughly be calculated at 3 cents/word, 1 percent of what a glossy magazine would pay for non-scholarly work. Let’s not get distracted by the amount of time that scholarship requires: travel; archives; dozens if not hundreds of books read; writing; and editing. But that “let’s not” is a sliding glass door of sorts: it articulates the injustice of unremunerated work, but it also stands as a reminder that the pleasure [and/or displeasure] of some work is irreducible to money, acts as an irreducible quality. But isn’t everything held in the matrix of currency [fiction]? All process, a term inclusive of work, skilled or unskilled, is irreducible to the monetary value assigned it. A bibliography supportive of that last statement alone would entail a foray into a discursive terrain bordered by Vico, Marx, Weber, The Frankfurt School, Foucault, Post Structuralism and practically every title in Verso, Stanford’s Crossing the Meridian and the University of Minnesota press, and the work of countless others. Irreducible labor. Or as Thomas Keenan has recently put it, the irreducible “jelly” of work that remains after the abstractions of exchange value is “accounted.”11

I’ve worked for 19 thousand a year as a gallery receptionist 14 years ago; for nothing, in monetary terms, writing a proto-book as a PhD candidate to produce a dissertation, partially about labor and art in reconstruction era Italy; for a stipend of 18 thousand per annum teaching college students courses that full [celebrity] professors were also teaching; for one glorious year at 55+ thousand a year as a “term” assistant professor at a prominent women’s college affiliated with an ivy league university; and some ten k (+) less a year as a tenure track assistant professor at a state institution. The latter ostensibly includes compensation for teaching Art History to undergraduates and studio practitioners, directing advises toward theirs MAs or MFAs, and coming to countless faculty meetings. I can retain that salaried position if I produce enough of those journal articles, at 3 cents a word, so let us include the latter, now that I HAVE a tenure track position, in that before-taxes salary. And I get benefits. I am by all [ac]counts VERY lucky and yet the contradictions in the remunerative system are too many to count. I am not compensated in any way—including in University evaluations and other assorted forms of self-regulative beaurocracy—for the 5 or so, sometimes more, hour (+)-long studio visits I conduct every week. An aside on the studio visit: it is by far more intense than an equal measure of time, the hour, of teaching, advising, or any other form of labor but one. And that latter, around which I skirt, is a term from which I steal to work. “Robbing peter to pay Paul.” Wait, I thought I was the one getting paid?

And I “speak” from a vantage of extreme privilege, of multiple privileges, of all privileges but one, to which I stand in a relation of excess and lack. That excess and lack revolves a particular embodied form of labor, a production that is a non productive labor unlike the non accumulative labor of which the autonomists speak…

The discursively impossible: I have given birth through the labor process to a child. “Let’s not,” in the interest of not getting caught in the sliding glass door, “count” pregnancy, or post pardum recovery or breast-feeding. Let’s try to isolate labor in order to attempt to, tautologically, quantify it, as the issue of labor conventionally requires us to do. That labor was 32 hours long. Not one of those 32 hours was commensurable with any other hour. Time contracted, not necessarily in rhythm with those of my womb (hystery in Greek), time dilated, not necessarily in tandem with my cervix. It was working parallel to me; no, those organs were working in tension against me. Dissonance. I have never been capable of thinking my body’s labor in what I will call, despite the need to shore it up by the labor of discursive legitimation, my experiential time. This time shrank and stretched like hot taffy. I would need the proper name “Deleuze” here, and The Logic of Sense, to get the discursive sanction I need to support this last claim. That would take a little labor, labor time I could punch in as academics will no doubt do some time soon, or rather do now however elliptically in requisite annual self reports. But those 32 child labor hours defy break down into 32 units of 60 minutes, 1920 units of 60 seconds, etc. This form of labor slips the grip of discourse; even metaphor.

Catachresis is not monstrous enough to operate as a medium for the articulation of this [non] event. There was, however, a quantifyable cost for the hospital ante-chamber, the delivery room, the “recovery” room, and the first examination of the infant. And there were more complex “costs;” I was “let go” of the second year of my position as a term assistant professor at a prominent women’s college associated with an ivy-league university. The Chair responsible for my firing, I mean, liberation, is a “feminist,” and a mother of two. She thought it would be “for the best,” for me to have time off. I never asked for time off. This did allow her to win a point or two for her annual docket; I was hired back on the adjunct salary of 3 thousand per class the next semester. This allowed the department to save 50 thousand dollars in 2007-2008, and the cost of benefits. Did I mention that the semester after giving birth, after having been “let go,” I still made it to campus to attend all advising sessions? 50K in savings that the institution no doubt never even registered, my loss. But who cares, I had a healthy beautiful bright baby!….. to love AND support. BTW, diapers are 20/box. Currently, I calculate that I make about 12 dollars and fifty cents an hour given that I work at least sixty hours a week. Ergo, a box of diapers is equal to over an hour and a half of work. I go through many of these per month still. At the time of being fired/demoted/whatever, I lived in NYC, where diapers cost more than 20/box. And I made, about 4.16 and hour. A box of diapers cost 5 hours of work. But like many women, and unlike many others, I had assistance, that of a partner and that of a parent. Let’s not address the emotional and psychological cost of the latter; let’s please not address the price dignity paid. Oops, prosopopeia. Does dignity have agency? I hope the reader knows by now that I find calculations to be absurd. “How do I love Thee, [dear child, dear student, dear reader,] Let me count the ways….” I am, however, serious in the following query: how do others less lucky than I make it in the global service industry (in which education and so called higher education now takes it place, now that Professors at State schools are classified as mid level managers?) How do women who have babies and work make it? They pay to work; they pay with their children. Sacrificial economies.

Now again, let’s not get caught in that door by even discussing the 24/7 labor of parenting. The pleasures of this last, and the agonies, are irreducible. But, again, isn’t everything? So: Suspended. Bracketed, a priori. A discursive delimitation or repression? It is in such poor taste to discuss this: bad form. Just a note, daycare is 10 thousand dollars per anum. A baby sitter charges 10-15 an hour. I over identify with the sitter and guiltily–as though I even had the luxury of being a fat cat liberal riddled with guilt–pay said sitter 20. But no worries: I don’t believe in baby-sitting. I have no life outside of the working and the parenting, no leisure. I mistrust the latter. I dislike being appeased. No compensatory blah blah for me. I do, however, want the hours taken away from my child by studio visits and the like to be remunerated HER. She keeps track of when I am missing. I can’t keep count. Guilty interstitial pleasure: Facebook, whom (uh oh) I can credit for the honor [snarkery free] of labor on the present piece.

III.

Like most institutions of its kind, the University at which I have a tenure track position, for which I am reminded to be eternally thankful—and I AM—does not have maternity leave. Were I to choose to have a second child (this statement requires an exegesis into the word “choice”), I would take sick-leave, as though giving-birth were an illness; as though [biological] labor were a subtraction from the forward march of time, of production and productivity, of progress. Sick-leave, time taken while ill ad ostensibly unproductive. Sick leave, the concept if not the necessary practice, is sick. More perverse still is the idea that populating the next generation, however selfish this may or may not be in many way, however narcissistic or not, is not a form of non-productivity. The double negative in this last should raise some flags in the space of textual analysis, labor analysis, gender analysis. An aside: I never felt less ill than during pregnancy, childbirth, and so called recovery. The use of the word biology will deliver the present text, again, to the accusation of essentialism. I will add that it goes without saying that maternity need not be biological. But it is still labor. A colleague recently adopted a child. Said colleague travelled to a distant continent to retrieve the child with whom she had spent a year establishing an intimate, if painfully digitally mediated, long term relationship. She took family medical (sick) leave. It, apparently, is against an ethics of work to be preoccupied with a new baby.

Moreover, were I to have a second child, my tenure clock would stop if I took that odiously named family/sick leave. My opportunity to make a case for my own worth via tenure review would be deferred. Of course, were we unionized, there may be a fighting chance, were our esteemed male colleagues to support us, for maternity leave, or, more unthinkably, paid maternity leave and no punitive tenure clock [beyond the normative punitive parameters]. “We” are our worst obstacle. As a prominent political science academic and feminist recently pointed out to me, one of the greatest obstacles to unionization or any form of collectivization, for artists and academics, is that they think of themselves as “professionals” and associate unions with blue color workers. Were they to peek around, they would note that these workers are practically extinct. We are all in an endless lateral plane of service. As one student told me, “my parents pay your salary,” to which I responded, “like the cleaning lady.” Note that there is no “liberal elitism” lurking here. We are all, to some extent, unless we work for JPMorgan Chase or some hedge fund, the cleaning lady (many nannies, like many cabbies, have a string of PhDs. My republican aunt once told me with delight that her cleaning lady had worked with my dissertation adviser when she, “the cleaning lady” was in grad school). Anyway, the student just nodded. I told him he should work to get his parents’ money’s worth.

Professors and academics like to think that they transcend as they were believed to do in a previous disciplinary socio-cultural regime. Jackson Pollock thought that too. He was an easy puppet in Cold War politics. Teaching undergrads in a core curriculum of an ivy league university that shores its superiority and identity around said core curriculum of old master literature, art and music—in other words, utterly dependent on a labor pool of graduate students—I participated in the effort to unionize. The threats were not subtle. The University’s counter argument was that students study; they don’t labor.

And women work, they don’t labor. There is no language.

1 Marc Spiegler. “When Human Beings are the Canvas.” Art News. June, 2003.

2 Interview with Paolo Virno. Branden W. Joseph, , Alessia Ricciardi trans. Grey Room No. 21 (Fall 2005): 26-37.

3 Ibid. P. 35.

4 Ibid.

5 The Laugh of Medusa.

6 For an excellent panoramic overview of these practices, see Helen Molesworth. “House Work and Art Work.” October No. 92 (Spring 2000).

7 Reprinted in Occupy Everything January 2011. http://occupyeverything.com/news/precarious-labor-a-feminist-viewpoint/

8 Ariella Azoulay. The Civil Contract Of Photography. New York: Zone Books, 2008. P. 226.

9 Ibid. For a discussion of the blind spot of sexuality and embodiment in Enlightenment thinking, see Jacques Lacan’s “seminal” “Kant with Sade.” Critique (April, 1963).

10 “Nothing, we are told by Western Hegemonic discourse, so differentiates “us” from “them” as the lack of freedom for women in Islamist societies. It needs to be noted, however, that far from silencing the power of women, Islamist regimes highlight it, acknowledging through severe and violent restrictions that what women do is crucial to political and social order. The argument justifying the strict codes of conduct, based on respect for women (in contrast to the Western commodification of women and their disparagement as sex objects), has a dialectical dynamic that can lead to its own undoing.” Susan Buck-Morss. Thinking Past Terror. P. 12. London: Verso, 2003. P. 12.

11 Thomas Keenan. “The Point is to (Ex) Change It: Reading ‘Capital’ Rhetorically.” Fables of Responsibility. Stanford: Stanford UP, 2007.

#labor #maternity #motherhood #jaleh mansoor

0 notes

tak4hir0 · 6 years ago

Link

Transformers from scratch Transformers are a very exciting family of machine learning architectures. Many good tutorials exist (e.g. [1, 2]) but in the last few years, transformers have mostly become simpler, so that it is now much more straightforward to explain how modern architectures work. This post is an attempt to explain directly how modern transformers work, and why, without some of the historical baggage. I will assume a basic understanding of neural networks and backpropagation. If you'd like to brush up, this lecture will give you the basics of neural networks and this one will explain how these principles are applied in modern deep learning systems. A working knowledge of Pytorch is required to understand the programming examples, but these can also be safely skipped. Self-attention The fundamental operation of any transformer architecture is the self-attention operation. We'll explain where the name "self-attention" comes from later. For now, don't read too much in to it. Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let's call the input vectors $\x_1, \x_2, \ldots \x_t$ and the corresponding output vectors $\y_1, \y_2, \ldots, \y_t$. The vectors all have dimension $k$. To produce output vector $\y_\rc{i}$, the self attention operation simply takes a weighted average over all the input vectors $$ \y_\rc{i} = \sum_{\gc{j}} w_{\rc{i}\gc{j}} \x_\gc{j} \p $$ Where the weights sum to one over all $\gc{j}$. The weight $w_{\rc{i}\gc{j}}$ is not a parameter, as in a normal neural net, but it is derived from a function over $\x_\rc{i}$ and $\x_\gc{j}$. The simplest option for this function is the dot product: $$ w'_{\rc{i}\gc{j}} = {\x_\rc{i}}^T\x_\gc{j} \p $$ This gives us a value anywhere between negative and positive infinity, so we apply a softmax to map the values to $[0, 1]$ and to ensure that they sum to 1 over the whole sequence: $$ w_{\rc{i}\gc{j}} = \frac{\text{exp } w'_{\rc{i}\gc{j}}}{\sum_\gc{j} \text{exp }w'_{\rc{i}\gc{j}}} \p $$ And that's the basic operation of self attention. A visual illustration of basic self-attention. Note that the softmax operation over the weights is not illustrated. A few other ingredients are needed for a complete transformer, which we'll discuss later, but this is the fundamental operation. More importantly, this is the only operation in the whole architecture that propagates information between vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors. Understanding why self-attention works Despite its simplicity, it's not immediately obvious why self-attention should work so well. To build up some intuition, let's look first at the standard approach to movie recommendation. Let's say you run a movie rental business and you have some movies, and some users, and you would like to recommend movies to your users that they are likely to enjoy. One way to go about this, is to create manual features for your movies, such as how much romance there is in the movie, and how much action, and then to design corresponding features for your users: how much they enjoy romantic movies and how much they enjoy action-based movies. If you did this, the dot product between the two feature vectors would give you a score for how well the attributes of the movie match what the user enjoys. If the signs of a feature match for the user and the movie—the movie is romantic and the user loves romance or the movie is unromantic and the user hates romance—then the the resulting dot product gets a positive term for that feature. If the signs don't match—the movie is romantic and the user hates romance or vice versa—the corresponding term is negative. Furthermore, the magnitudes of the features indicate how much the feature should contribute to the total score: a movie may be a little romantic, but not in a noticeable way, or a user may simply prefer no romance, but be largely ambivalent. Of course, gathering such features is not practical. Annotating a database of millions of movies is very costly, and annotating users with their likes and dislikes is pretty much impossible. What happens instead is that we make the movie features and user features parameters of the model. We then ask users for a small number of movies that they like and we optimize the user features and movie features so that their dot product matches the known likes. Even though we don't tell the model what any of the features should mean, in practice, it turns out that after training the features do actually reflect meaningful semantics about the movie content. The first two learned features from a basic matrix factorization model. The model had no access to any information about the content of the movies, only which users liked the. Note that movies are arranged from low-brow to high-brow horizontally, and from mainstream to quirky vertically. From [4]. See this lecture for more details on recommender systems. For now, this suffices as an explanation of how the dot product helps us to represent objects and their relations.This is the basic principle at work in the self-attention. Let's say we are faced with a sequence of words. To apply self-attention, we simply assign each word $\bc{t}$ in our vocabulary an embedding vector $\v_\bc{t}$ (the values of which we'll learn). This is what's known as an embedding layer in sequence modeling. It turns the word sequence $$\bc{\text{the}}, \bc{\text{cat}}, \bc{\text{walks}}, \bc{\text{on}}, \bc{\text{the}}, \bc{\text{street}}$$ into the vector sequence $$\v_\bc{\text{the}}, \v_\bc{\text{cat}}, \v_\bc{\text{walks}}, \v_\bc{\text{on}}, \v_\bc{\text{the}}, \v_\bc{\text{street}} \p $$ If we feed this sequence into a self-attention layer, the output is another sequence of vectors $$\y_\bc{\text{the}}, \y_\bc{\text{cat}}, \y_\bc{\text{walks}}, \y_\bc{\text{on}}, \y_\bc{\text{the}}, \y_\bc{\text{street}} $$ where $\y_\bc{\text{cat}}$ is a weighted sum over all the embedding vectors in the first sequence, weighted by their (normalized) dot-product with $\v_\bc{\text{cat}}$. Since we are learning what the values in $\v_\bc{t}$ should be, how "related" two words are is entirely determined by the task. In most cases, the definite article the is not very relevant to the interpretation of the other words in the sentence; therefore, we will likely end up with an embedding $\v_\bc{\text{the}}$ that has a low or negative dot product with all other words. On the other hand, to interpret what walks means in this sentence, it's very helpful to work out who is doing the walking. This is likely expressed by a noun, so for nouns like cat and verbs like walks, we will likely learn embeddings $\v_\bc{\text{cat}}$ and $\v_\bc{\text{walks}}$ that have a high, positive dot product together. This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input sequence are, with "related" defined by the learning task, and the output vectors are weighted sums over the whole input sequence, with the weights determined by these dot products. Before we move on, it's worthwhile to note the following properties, which are unusual for a sequence-to-sequence operation: There are no parameters (yet). What the basic self-attention actually does is entirely determined by whatever mechanism creates the input sequence. Upstream mechanisms, like an embedding layer, drive the self-attention by learning representations with particular dot products (although we'll add a few parameters later). Self attention sees its input as a set, not a sequence. If we permute the input sequence, the output sequence will be exactly the same, except permuted also (i.e. self-attention is permutation equivariant). We will mitigate this somewhat when we build the full transformer, but the self-attention by itself actually ignores the sequential nature of the input. In Pytorch: basic self-attention What I cannot create, I do not understand, as Feynman said. So we'll build a simple transformer as we go along. We'll start by implementing this basic self-attention operation in Pytorch. The first thing we should do is work out how to express the self attention in matrix multiplications. A naive implementation that loops over all vectors to compute the weights and outputs would be much too slow. We'll represent the input, a sequence of $t$ vectors of dimension $k$ as a $t$ by $k$ matrix $\X$. Including a minibatch dimension $b$, gives us an input tensor of size $(b, t, k)$. The set of all raw dot products $w'_{\rc{i}\gc{j}}$ forms a matrix, which we can compute simply by multiplying $\X$ by its transpose: import torch import torch.nn.functional as F # assume we have some tensor x with size (b, t, k) x = ... raw_weights = torch.bmm(x, x.transpose(1, 2)) # - torch.bmm is a batched matrix multiplication. It # applies matrix multiplication over batches of # matrices. Then, to turn the raw weights $w'_{\rc{i}\gc{j}}$ into positive values that sum to one, we apply a row-wise softmax: weights = F.softmax(raw_weights, dim=2) Finally, to compute the output sequence, we just multiply the weight matrix by $\X$. This results in a batch of output matrices $\Y$ of size (b, t, e) whose rows are weighted sums over the rows of $\X$. y = torch.bmm(weights, x) That's all. Two matrix multiplications and one softmax gives us a basic self-attention. Additional tricks The actual self-attention used in modern transformers relies on three additional tricks. 1) Queries, keys and values Every input vector $\x_\rc{i}$ is used in three different ways in the self attention operation: It is compared to every other vector to establish the weights for its own output $\y_\rc{i}$ It is compared to every other vector to establish the weights for the output of the $\gc{j}$-th vector $\y_\gc{j}$ It is used as part of the weighted sum to compute each output vector once the weights have been established. These roles are often called the query, the key and the value (we'll explain where these names come from later). In the basic self-attention we've seen so far, each input vector must play all three roles. We make its life a little easier by deriving new vectors for each role, by applying a linear transformation to the original input vector. In other words, we add three $k \times k$ weight matrices $\W_q$, $\W_k$,$\W_v$ and compute three linear transformations of each $x_\rc{i}$, for the three different parts of the self attention: $$ \begin{align*} \q_\rc{i} &= \W_q\x_\rc{i} & \k_\rc{i} &= \W_k\x_\rc{i} & \v_\rc{i} &= \W_v\x_\rc{i} \end{align*} $$ $$ \begin{align*} w'_{\rc{i}\gc{j}} &= {\q_\rc{i}}^T\k_\gc{j} \\ w_{\rc{i}\gc{j}} &= \text{softmax}(w'_{\rc{i}\gc{j}})\\ \y_\rc{i} &= \sum_\gc{j} w_{\rc{i}\gc{j}} \v_\rc{i}\p\\ \end{align*} $$ This gives the self-attention layer some controllable parameters, and allows it to modify the incoming vectors to suit the three roles they must play. Illustration of the self-attention with key, query and value transformations.2) Scaling the dot product The softmax function can be sensitive to very large input values. These kill the gradient, and slow down learning, or cause it to stop altogether. Since the average value of the dot product grows with the embedding dimension $k$, it helps to scale the dot product back a little to stop the inputs to the softmax function from growing too large: $$ w'_{\rc{i}\gc{j}} = \frac^T\k_\gc{j}}{\sqrt{k}} $$ Why $\sqrt{k}$? Imagine a vector in ${\mathbb R^k}$ with values all $c$. Its Euclidean length is $\sqrt{k}c$. Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.3) Multi-head attention Finally, we must account for the fact that a word can mean different things to different neighbours. Consider the following example. $$\bc{\text{mary}}, \bc{\text{gave}}, \bc{\text{roses}}, \bc{\text{to}}, \bc{\text{susan}}$$ We see that the word gave has different relations to different parts of the sentence. mary expresses who's doing the giving, roses expresses what's being given, and susan expresses who the recipient is. In a single self-attention operation, all this information just gets summed together. If Susan gave Mary the roses instead, the output vector $\y_\bc{\text{gave}}$ would be the same, even though the meaning has changed. We can give the self attention greater power of discrimination, by combining several self attention mechanisms (which we'll index with $\bc{r}$), each with different matrices $\W_q^\bc{r}$, $\W_k^\bc{r}$,$\W_v^\bc{r}$. These are called attention heads. For input $\x_\rc{i}$ each attention head produces a different output vector $\y_\rc{i}^\bc{r}$. We concatenate these, and pass them through a linear transformation to reduce the dimension back to $k$. In Pytorch: complete self-attention Let's now implement a self-attention module with all the bells and whistles. We'll package it into a Pytorch module, so we can reuse it later. Combining three attention heads into one matrix multiplication (for the queries). import torch from torch import nn import torch.nn.functional as F class SelfAttention(nn.Module): def __init__(self, k, heads=8): super().__init__() self.k, self.heads = k, heads We think of the $h$ attention heads as $h$ separate sets of three matrices $\W^\bc{r}_q$, $\W^\bc{r}_k$,$\W^\bc{r}_v$, but it's actually more efficient to combine these for all heads into three single $k \times hk$ matrices, so that we can compute all the concatenated queries, keys and values in a single matrix multiplication. # These compute the queries, keys and values for all # heads (as a single concatenated vector) self.tokeys = nn.Linear(k, k * heads, bias=False) self.toqueries = nn.Linear(k, k * heads, bias=False) self.tovalues = nn.Linear(k, k * heads, bias=False) # This unifies the outputs of the different heads into # a single k-vector self.unifyheads = nn.Linear(heads * emb, emb) We can now implement the computation of the self-attention (the module's forward function). First, we compute the queries, keys and values: def forward(self, x): b, t, k = x.size() h = self.heads queries = self.toqueries(x).view(b, t, h, k) keys = self.tokeys(x) .view(b, t, h, k) values = self.tovalues(x) .view(b, t, h, k) The output of each linear module has size (b, t, h*k), which we simply reshape to (b, t, h, k) give each head its own dimension. Next, we need to compute the dot products. This is the same operation for every head, so we fold the heads into the batch dimension. This ensures that we can use torch.bmm() as before, and the whole collection of keys, queries and values will just be seen as a slightly larger batch. Since the head and batch dimension are not next to each other, we need to transpose before we reshape. (This is costly, but it seems to be unavoidable.) # - fold heads into the batch dimension keys = keys.transpose(1, 2).contiguous().view(b * h, t, k) queries = queries.transpose(1, 2).contiguous().view(b * h, t, k) values = values.transpose(1, 2).contiguous().view(b * h, t, k) As before, the dot products can be computed in a single matrix multiplication, but now between the queries and the keys. # - get dot product of queries and keys, and scale dot = torch.bmm(queries, keys.transpose(1, 2)) # - dot has size (b*h, t, t) containing raw weights dot = F.softmax(dot, dim=2) # - dot now contains row-wise normalized weights We apply the self attention to the values, giving us the output for each attention head # apply the self attention to the values out = torch.bmm(dot, values).view(b, h, t, e) To unify the attention heads, we transpose again, so that the head dimension and the embedding dimension are next to each other, and reshape to get concatenated vectors of dimension $kh$. We then pass these through the unifyheads layer to project them back down to $k$ dimensions. # swap h, t back, unify heads out = out.transpose(1, 2).contiguous().view(b, t, h * e) return self.unifyheads(out) And there you have it: multi-head, scaled dot-product self attention. You can see the complete implementation here. Building transformers A transformer is not just a self-attention layer, it is an architecture. It's not quite clear what does and doesn't qualify as a transformer, but here we'll use the following definition: Any architecture designed to process a connected set of units—such as the tokens in a sequence or the pixels in an image—where the only interaction between units is through self-attention. As with other mechanisms, like convolutions, a more or less standard approach has emerged for how to build self-attention layers up into a larger network. The first step is to wrap the self-attention into a block that we can repeat. The transformer block There are some variations on how to build a basic transformer block, but most of them are structured roughly like this: That is, the block applies, in sequence: a self attention layer, layer normalization, a feed forward layer (a single MLP applied independently to each vector), and another layer normalization. Residual connections are added around both, before the normalization. The order of the various components is not set in stone; the important thing is to combine self-attention with a local feedforward, and to add normalization and residual connections. Normalization and residual connections are standard tricks used to help deep neural networks train faster and more accurately. The layer normalization is applied over the embedding dimension only. Here's what the transformer block looks like in pytorch. class TransformerBlock(nn.Module): def __init__(self, k, heads): super().__init__() self.attention = SelfAttention(k, heads=heads) self.norm1 = nn.LayerNorm(k) self.norm2 = nn.LayerNorm(k) self.ff = nn.Sequential( nn.Linear(k, 4 * k), nn.ReLU(), nn.Linear(4 * k, k)) def forward(self, x): attended = self.attention(x) x = self.norm1(attended + x) fedforward = self.ff(x) return self.norm2(fedforward + x) We've made the relatively arbitrary choice of making the hidden layer of the feedforward 4 times as big as the input and output. Smaller values may work as well, and save memory, but it should be bigger than the input/output layers. Classification transformer The simplest transformer we can build is a sequence classifier. We'll use the IMDb sentiment classification dataset: the instances are movie reviews, tokenized into sequences of words, and the classification labels are positive and negative (indicating whether the review was positive or negative about the movie). The heart of the architecture will simply be a large chain of transformer blocks. All we need to do is work out how to feed it the input sequences, and how to transform the final output sequence into a a single classification. The whole experiment can be found here. We won't deal with the data wrangling in this blog post. Follow the links in the code to see how the data is loaded and prepared. Output: producing a classification The most common way to build a sequence classifier out of sequence-to-sequence layers, is to apply global average pooling to the final output sequence, and to map the result to a softmaxed class vector. Overview of a simple sequence classification transformer. The output sequence is averaged to produce a single vector representing the whole sequence. This vector is projected down to a vector with one element per class and softmaxed to produce probabilities. Input: using the positions We've already discussed the principle of an embedding layer. This is what we'll use to represent the words. However, as we've also mentioned already, we're stacking permutation equivariant layers, and the final global average pooling is permutation invariant, so the network as a whole is also permutation invariant. Put more simply: if we shuffle up the words in the sentence, we get the exact same classification, whatever weights we learn. Clearly, we want our state-of-the-art language model to have at least some sensitivity to word order, so this needs to be fixed. The solution is simple: we create a second vector of equal length, that represents the position of the word in the current sentence, and add this to the word embedding. There are two options. position embeddings We simply embed the positions like we did the words. Just like we created embedding vectors $\v_\bc{\text{cat}}$ and $\v_\bc{\text{susan}}$, we create embedding vectors $\v_\bc{\text{12}}$ and $\v_\bc{\text{25}}$. Up to however long we expect sequences to get. The drawback is that we have to see sequences of every length during training, otherwise the relevant position embeddings don't get trained. The benefit is that it works pretty well, and it's easy to implement. position encodings Position encodings work in the same way as embeddings, except that we don't learn the position vectors, we just choose some function $f: {\mathbb N} \to {\mathbb R}^k$ to map the positions to real valued vectors, and let the network figure out how to interpret these encodings. The benefit is that for a well chosen function, the network should be able to deal with sequences that are longer than those it's seen during training (it's unlikely to perform well on them, but at least we can check). The drawbacks are that the choice of encoding function is a complicated hyperparameter, and it complicates the implementation a little. For the sake of simplicity, we'll use position embeddings in our implementation. Pytorch Here is the complete text classification transformer in pytorch. class Transformer(nn.Module): def __init__(self, k, heads, depth, seq_length, num_tokens, num_classes): super().__init__() self.num_tokens = num_tokens self.token_emb = nn.Embedding(k, num_tokens) self.pos_emb = nn.Embedding(k, seq_length) # The sequence of transformer blocks that does all the # heavy lifting tblocks = [] for i in range(depth): tblocks.append(TransformerBlock(emb=emb, heads=heads)) self.tblocks = nn.Sequential(*tblocks) # Maps the final output sequence to class logits self.toprobs = nn.Linear(emb, num_classes) def forward(self, x): """ :param x: A (b, t) tensor of integer values representing words (in some predetermined vocabulary). :return: A (b, c) tensor of log-probabilities over the classes (where c is the nr. of classes). """ # generate token embeddings tokens = self.token_emb(x) b, t, e = tokens.size() # generate position embeddings positions = torch.arange(t) positions = self.pos_emb(positions)[None, :, :].expand(b, t, e) x = tokens + positions x = self.tblocks(x) # Average-pool over the t dimension and project to class # probabilities x = self.toprobs(x.mean(dim=1)) return F.log_softmax(x, dim=1) At depth 6, with a maximum sequence length of 512, this transformer achieves an accuracy of about 85%, competitive with results from RNN models, and much faster to train. To see the real near-human performance of transformers, we'd need to train a much deeper mode on much more data. More about how to do that later. Text generation transformer The next trick we'll try is an autoregressive model. We'll train a character level transformer to predict the next character in a sequence. The training regime is simple (and has been around for far longer than transformers have). We give the sequence-to-sequence model a sequence, and we ask it to predict the next character at each point in the sequence. In other words, the target output is the same sequence shifted one character to the left: With RNNs this is all we need to do, since they cannot look forward into the input sequence: output $i$ depends only on inputs $0$ to $i$. With a transformer, the output depends on the entire input sequence, so prediction of the next character becomes vacuously easy, just retrieve it from the input. To use self-attention as an autoregressive model, we'll need to ensure that it cannot look forward into the sequence. We do this by applying a mask to the matrix of dot products, before the softmax is applied. This mask disables all elements above the diagonal of the matrix. Masking the self attention, to ensure that elements can only attend to input elements that precede them in the sequence. Note that the multiplication symbol is slightly misleading: we actually set the masked out elements (the white squares) to $-\infty$ Since we want these elements to be zero after the softmax, we set them to $-\infty$. Here's how that looks in pytorch: indices = torch.triu_indices(k, k, offset=0) matrices[:, indices[0], indices[1]] = float('-inf') After we've handicapped the self-attention module like this, the model can no longer look forward in the sequence. We train on the standard enwik8 dataset (taken from the Hutter prize), which contains $10^8$ characters of Wikipedia text (including markup). During training, we generate batches by randomly sampling subsequences from the data. We train on sequences of length 256, using a model of 12 transformer blocks and 256 embedding dimension. After about 24 hours training on an RTX 2080Ti (some 170K batches of size 32), we let the model generate from a 256-character seed: for each character, we feed it the preceding 256 characters, and look what it predicts for the next character (the last output vector). We sample from that with a temperature of 0.5, and move to the next character. The output looks like this: 1228X Human & Rousseau. Because many of his stories were originally published in long-forgotten magazines and journals, there are a number of [[anthology|anthologies]] by different collators each containing a different selection. His original books have been considered an anthologie in the [[Middle Ages]], and were likely to be one of the most common in the [[Indian Ocean]] in the [[1st century]]. As a result of his death, the Bible was recognised as a counter-attack by the [[Gospel of Matthew]] (1177-1133), and the [[Saxony|Saxons]] of the [[Isle of Matthew]] (1100-1138), the third was a topic of the [[Saxony|Saxon]] throne, and the [[Roman Empire|Roman]] troops of [[Antiochia]] (1145-1148). The [[Roman Empire|Romans]] resigned in [[1148]] and [[1148]] began to collapse. The [[Saxony|Saxons]] of the [[Battle of Valasander]] reported the y Note that the Wikipedia link tag syntax is correctly used, that the text inside the links are represent reasonable subjects for links. Most importantly, note that there is a rough thematic consistency; the generated text keeps on the subject of the bible, and the Roman empire, using different related terms at different points. While this is far form the performance of a model like GPT-2, the benefits over a similar RNN model are clear already: faster training (a similar RNN model would take many days to train) and better long-term coherence. In case you're curious, the Battle of Valasander seems to be an invention of the network. At this point, the model achieves a compression of 1.343 bits per byte on the validation set, which is not too far off the state of the art of 0.93 bits per byte, achieved by the GPT-2 model (described below). Design considerations To understand why transformers are set up this way, it helps to understand the basic design considerations that went into them. The main point of the transformer was to overcome the problems of the previous state-of-the-art architecture, the RNN (usually an LSTM or a GRU). Unrolled, an RNN looks like this: The big weakness here is the recurrent connection. while this allows information to propagate along the sequence, it also means that we cannot compute the cell at time step $i$ until we've computed the cell at timestep $i - 1$. Contrast this with a 1D convolution: In this model, every output vector can be computed in parallel with every other output vector. This makes convolutions much faster. The drawback with convolutions, however, is that they're severely limited in modeling long range dependencies. In one convolution layer, only words that are closer together than the kernel size can interact with each other. For longer dependence we need to stack many convolutions. The transformer is an attempt to capture the best of both worlds. They can model dependencies over the whole range of the input sequence just as easily as they can for words that are next to each other (in fact, without the position vectors, they can't even tell the difference). And yet, there are no recurrent connections, so the whole model can be computed in a very efficient feedforward fashion. The rest of the design of the transformer is based primarily on one consideration: depth. Most choices follow from the desire to train big stacks of transformer blocks. Note for instance that there are only two places in the transformer where non-linearities occur: the softmax in the self-attention and the ReLU in the feedforward layer. The rest of the model is entirely composed of linear transformations, which perfectly preserve the gradient. I suppose the layer normalization is also nonlinear, but that is one nonlinearity that actually helps to keep the gradient stable as it propagates back down the network. Historical baggage If you've read other introductions to transformers, you may have noticed that they contain some bits I've skipped. I think these are not necessary to understand modern transformers. They are, however, helpful to understand some of the terminology and some of the writing about modern transformers. Here are the most important ones. Why is it called self-attention? Before self-attention was first presented, sequence models consisted mostly of recurrent networks or convolutions stacked together. At some point, it was discovered that these models could be helped by adding attention mechanisms: instead of feeding the output sequence of the previous layer directly to the input of the next, an intermediate mechanism was introduced, that decided which elements of the input were relevant for a particular word of the output. The general mechanism was as follows. We call the input the values. Some (trainable) mechanism assigns a key to each value. Then to each output, some other mechanism assigns a query. These names derive from the datastructure of a key-value store. In that case we expect only one item in our store to have a key that matches the query, which is returned when the query is executed. Attention is a softened version of this: every key in the store matches the query to some extent. All are returned, and we take a sum, weighted by the extent to which each key matches the query. The great breakthrough of self-attention was that attention by itself is a strong enough mechanism to do all the learning. Attention is all you need, as the authors put it. They key, query and value are all the same vectors (with minor linear transformations). They attend to themselves and stacking such self-attention provides sufficient nonlinearity and representational power to learn very complicated functions. The original transformer: encoders and decoders But the authors did not dispense with all the complexity of contemporary sequence modeling. The standard structure of sequence-to-sequence models in those days was an encoder-decoder architecture, with teacher forcing. The encoder takes the input sequence and maps it to a single latent vector representing the whole sequence. This vector is then passed to a decoder which unpacks it to the desired target sequence (for instance, the same sentence in another language). Teacher forcing refers to the technique of also allowing the the decoder access to the input sentence, but in an autoregressive fashion. That is, the decoder generates the output sentence word for word based both on the latent vector and the words it has already generated. This takes some of the pressure off the latent representation: the decoder can user word-for-word sampling to take care of the low-level structure like syntax and grammer and use the latent vector to capture more high-leve semantic structure. Decoding twice with the same latent vector would, ideally, give you two different sentences with the same meaning. In later transformers, like BERT and GPT-2, the encoder/decoder configuration was entirely dispensed with. A simple stack of transformer blocks was found to be sufficient to achieve state of the art in many sequence based tasks. This is sometimes called a decoder-only transformer (for an autoregressive model) or an encoder-only transformer (for a model without masking). Modern transformers Here's a small selection of some modern transformers and their most characteristic details. BERT was one of the first models to show that transformers could reach human-level performance on a variety of language based tasks: question answering, sentiment classification or classifying whether two sentences naturally follow one another. BERT consists of a simple stack of transformer blocks, of the type we've described above. This stack is pre-trained on a large general-domain corpus consisting of 800M words from English books (modern work, from unpublished authors), and 2.5B words of text from English Wikipedia articles (without markup). Pretraining is done through two tasks: MaskingA certain number of words in the input sequence are: masked out, replaced with a random word or kept as is. The model is then asked to predict, for these words, what the original words were. Note that the model doesn't need to predict the entire denoised sentence, just the modified words. Since the model doesn't know which words it will be asked about, it learns a representation for every word in the sequence. Next sequence classificationTwo sequences of about 256 words are sampled that either (a) follow each other directly in the corpus, or (b) are both taken from random places. The model must then predict whether a or b is the case. BERT uses WordPiece tokenization, which is somewhere in between word-level and character level sequences. It breaks words like walking up into the tokens walk and ##ing. This allows the model to make some inferences based on word structure: two verbs ending in -ing have similar grammatical functions, and two verbs starting with walk- have similar semantic function. The input is prepended with a special token. The output vector corresponding to this token is used as a sentence representation in sequence classification tasks like the next sentence classification (as opposed to the global average pooling over all vectors that we used in our classification model above). After pretraining, a single task-specific layer is placed after the body of transformer blocks, which maps the general purpose representation to a task specific output. For classification tasks, this simply maps the first output token to softmax probabilities over the classes. For more complex tasks, a final sequence-to-sequence layer is designed specifically for the task. The whole model is then re-trained to finetune the model for the specific task at hand. In an ablation experiment, the authors show that the largest improvement as compared to previous models comes from the bidirectional nature of BERT. That is, previous models like GPT used an autoregressive mask, which allowed attention only over previous tokens. The fact that in BERT all attention is over the whole sequence is the main cause of the improved performance. This is why the B in BERT stands for "bidirectional". The largest BERT model uses 24 transformer blocks, an embedding dimension of 1024 and 16 attention heads, resulting in 340M parameters. GPT-2 is the first transformer model that actually made it into the mainstream news, after the controversial decision by OpenAI not to release the full model. The reason was that GPT-2 could generate sufficiently believable text that large-scale fake news campaigns of the kind seen in the 2016 US presidential election would become effectively a one-person job.The first trick that the authors of GPT-2 employed was to create a new high-quality dataset. While BERT used high-quality data (lovingly crafted books and well-edited wikipedia articles) this creates a certain lack of diversity in the writing style. To collect more diverse data without sacrificing quality the authors used the social media site Reddit to find a large collection of writing with a certain minmum level of social support (expressed on Reddit as karma). GPT2 is fundamentally a language generation model, so it uses masked self-attention like we did in our model above. It uses byte-pair encoding to tokenize the language, which , like the WordPiece encoding breaks words up into tokens that are slightly larger than single characters but less than entire words. GPT2 is built very much like our text generation model above, with only small differences in layer order and added tricks to train at greater depths. The largest model uses 48 transformer blocks, a sequence length of 1024 and an embedding dimension of 1600, resulting in 1.5B parameters. They show state-of-the art performance on many tasks. On the wikipedia compression task that we tried above, they achieve 0.93 bits per byte. While the transformer represents a massive leap forward in modeling long-range dependency, the models we have seen so far are still fundamentally limited by the size of the input. Since the size of the dot-product matrix grows quadratically in the sequence length, this quickly becomes the bottleneck as we try to extend the length of the input sequence. Transformer-XL is one of the first succesful transformer models to tackle this problem. During training, a long sequence of text (longer than the model could deal with) is broken up into shorter segments. Each segment is processed in sequence, with self-attention computed over the tokens in the curent segment and the previous segment. Gradients are only computed over the current segment, but information still propagates as the segment window moves through the text. In theory at layer $n$, information may be used from $n$ segments ago. A similar trick in RNN training is called truncated backpropagation through time. We feed the model a very long sequence, but backpropagate only over part of it. The first part of the sequence, for which no gradients are computed, still influences the values of the hidden states in the part for which they are. To make this work, the authors had to let go of the standard position encoding/embedding scheme. Since the position encoding is absolute, it would change for each segment and not lead to a consistent embedding over the whole sequence. Instead they use a relative encoding. For each output vector, a different sequence of position vectors is used that denotes not the absolute position, but the distance to the current output. This requires moving the position encoding into the attention mechanism (which is detailed in the paper). One benefit is that the resulting transformer will likely generalize much better to sequences of unseen length. Sparse transformers tackle the problem of quadratic memory use head-on. Instead of computing a dense matrix of attention weights (which grows quadratically), they compute the self-attention only for particular pairs of input tokens, resulting in a sparse attention matrix, with only $n\sqrt{n}$ explicit elements. This allows models with very large context sizes, for instance for generative modeling over images, with large dependencies between pixels. The tradeoff is that the sparsity structure is not learned, so by the choice of sparse matrix, we are disabling some interactions between input tokens that might otherwise have been useful. However, two units that are not directly related may still interact in higher layers of the transformer (similar to the was a convolutional net builds up a larger receptive field with more convolutional layers). Beyond the simple benefit of training transformers with very large sequence lengths, the sparse transformer also allows a very elegant way of designing an inductive bias. We take our input as a collection of units (words, characters, pixels in an image, nodes in a graph) and we specify, through the sparsity of the attention matrix, which units we believe to be related. The rest is just a matter of building the transformer up as deep as it will go and seeing if it trains. Going big The big bottleneck in training transformers is the matrix of dot products in the self attention. For a sequence length $t$, this is a dense matrix containing $t^2$ elements. At standard 32-bit precision, and with $t=1000$ a batch of 16 such matrices takes up about 250Mb of memory. Since we need at least four of them per self attention operation (before and after softmax, plus their gradients), that limits us to at most twelve layers in a standard 12Gb GPU. In practice, we get even less, since the inputs and outputs also take up a lot of memory (although the dot product dominates). And yet models reported in the literature contain sequence lengths of over 12000, with 48 layers, using dense dot product matrices. These models are trained on clusters, of course, but a single GPU is still required to do a single forward/backward pass. How do we fit such humongous transformers into 12Gb of memory? There are three main tricks: Half precisionOn modern GPUs and on TPUs, tensor computations can be done efficiently on 16-bit float tensors. This isn't quite as simple as just setting the dtype of the tensor to torch.float16. For some parts of the network, like the loss, 32 bit precision is required. But most of this can be handled with relative ease by existing libraries. Practically, this doubles your effective memory. Gradient accumulationFor a large model, we may only be able to perform a forward/backward pass on a single instance. Batch size 1 is not likely to lead to stable learning. Luckily, we can perform a single forward/backward for each instance in a larger batch, and simply sum the gradients we find (this is a consequence of the multivariate chain rule). When we hit the end of the batch, we do a single step of gradient descent, and zero out the gradient. In Pytorch this is particulary easy: you know that optimizer.zero_grad() call in your training loop that seems so superfluous? If you don't make that call, the new gradients are simply added to the old ones. Gradient checkpointingIf your model is so big that even a single forward/backward won't fit in memory, you can trade off even more computation for memory efficiency. In gradient checkpointing, you separate your model into sections. For each section, you do a separate forward/backward to compute the gradients, without retaining the the intermediate values for the rest. Pytorch has special utilities for gradient checkpointng. For more information on how to do this, see this blogpost. Conclusion The transformer may well be the simplest machine learning architecture to dominate the field in decades. There are good reasons to start paying attention to them if you haven't been already. Firstly, the current performance limit is purely in the hardware. Unlike convolutions or LSTMs the current limitations to what they can do are entirely determined by how big a model we can fit in GPU memory and how much data we can push through it in a reaosnable amount of time. I have no doubt, we will eventually hit the point where more layers and and more data won't help anymore, but we don't seem to have reached that point yet. Second, transformers are extremely generic. So far, the big successes have been in language modelling, with some more modest achievements in image and music analysis, but the transformer has a level of generality that is waiting to be exploited. The basic transformer is a set-to-set model. So long as your data is a set of units, you can apply a transformer. Anything else you know about your data (like local structure) you can add by means of position embeddings, or by manipulating the structure of the attention matrix (making it sparse, or masking out parts). This is particularly useful in multi-modal learning. We could easily combine a captioned image into a set of pixels and characters and design some clever embeddings and sparsity structure to help the model figure out how to combine and align the two. If we combine the entirety of our knowledge about our domain into a relational structure like a multi-modal knowledge graph (as discussed in [3]), simple transformer blocks could be employed to propagate information between multimodal units, and to align them with the sparsity structure providing control over which units directly interact. So far, transformers are still primarily seen as a language model. I expect that in time, we'll see them adopted much more in other domains, not just to increase performance, but to simplify existing models, and to allow practitioners more intuitive control over their models' inductive biases. References [1] The illustrated transformer, Jay Allamar. [2] The annotated transformer, Alexander Rush. [3] The knowledge graph as the default data model for learning on heterogeneous knowledge Xander Wilcke, Peter Bloem, Victor de Boer [4] Matrix factorization techniques for recommender systems Yehuda Koren et al.

0 notes