#machine language learning
Explore tagged Tumblr posts
lingthusiasm · 8 days ago
Text
Lingthusiasm Episode 98: Helping computers decode sentences - Interview with Emily M. Bender
When a human learns a new word, we're learning to attach that word to a set of concepts in the real world. When a computer "learns" a new word, it is creating some associations between that word and other words it has seen before, which can sometimes give it the appearance of understanding, but it doesn't have that real-world grounding, which can sometimes lead to spectacular failures: hilariously implausible from a human perspective, just as plausible from the computer's.
In this episode, your host Lauren Gawne gets enthusiastic about how computers process language with Dr. Emily M. Bender, who is a linguistics professor at the University of Washington, USA, and cohost of the podcast Mystery AI Hype Theater 3000. We talk about Emily's work trying to formulate a list of rules that a computer can use to generate grammatical sentences in a language, the differences between that and training a computer to generate sentences using the statistical likelihood of what comes next based on all the other sentences, and the further differences between both those things and how humans map language onto the real world. We also talk about paying attention to communities not just data, the labour practices behind large language models, and how Emily's persistent questions led to the creation of the Bender Rule (always state the language you're working on, even if it's English).
Click here for a link to this episode in your podcast player of choice or read the transcript here.
Announcements: The 2024 Lingthusiasm Listener Survey is here! It’s a mix of questions about who you are as our listener, as well as some fun linguistics experiments for you to participate in. If you have taken the survey in previous years, there are new questions, so you can participate again this year.
In this month’s bonus episode we get enthusiastic about three places where we can learn things about linguistics!! We talk about two linguistically interesting museums that Gretchen recently visited: the Estonian National Museum, as well as Mundolingua, a general linguistics museum in Paris. We also talk about Lauren's dream linguistics travel destination: Martha's Vineyard.
Join us on Patreon now to get access to this and 90+ other bonus episodes. You’ll also get access to the Lingthusiasm Discord server where you can chat with other language nerds.
Also, Patreon now has gift memberships! If you'd like to get a gift subscription to Lingthusiasm bonus episodes for someone you know, or if you want to suggest them as a gift for yourself, here's how to gift a membership.
Here are the links mentioned in the episode:
Emily Bender
Emily Bender on Bluesky and Twitter
Mystery AI Hype Theater 3000
Mystery AI Hype Theater 3000: The Newsletter
The AI Con by Emily M. Bender and Alex Hanna
'Data Sovereignty and the Kaitiakitanga License' on Te Hiku
wordfreq by Robyn Speer on GitHub
Lingthusiasm Episode ‘Making machines learn language - Interview with Janelle Shane’
Bonus with Janelle Shane: we do a dramatic reading of the funniest auto-generated Lingthusiasm episodes
You can listen to this episode via Lingthusiasm.com, Soundcloud, RSS, Apple Podcasts/iTunes, Spotify, YouTube, or wherever you get your podcasts. You can also download an mp3 via the Soundcloud page for offline listening.
To receive an email whenever a new episode drops, sign up for the Lingthusiasm mailing list.
You can help keep Lingthusiasm ad-free, get access to bonus content, and more perks by supporting us on Patreon.
Lingthusiasm is on Bluesky, Twitter, Instagram, Facebook, Mastodon, and Tumblr. Email us at contact [at] lingthusiasm [dot] com
Gretchen is on Bluesky as @GretchenMcC and blogs at All Things Linguistic.
Lauren is on Bluesky as @superlinguo and blogs at Superlinguo.
Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our production editor is Sarah Dopierala, our production assistant is Martha Tsutsui Billins, and our editorial assistant is Jon Kruk. Our music is ‘Ancient City’ by The Triangles.
This episode of Lingthusiasm is made available under a Creative Commons Attribution Non-Commercial Share Alike license (CC 4.0 BY-NC-SA).
19 notes · View notes
river-taxbird · 3 months ago
Text
AI hasn't improved in 18 months. It's likely that this is it. There is currently no evidence the capabilities of ChatGPT will ever improve. It's time for AI companies to put up or shut up.
I'm just re-iterating this excellent post from Ed Zitron, but it's not left my head since I read it and I want to share it. I'm also taking some talking points from Ed's other posts. So basically:
We keep hearing AI is going to get better and better, but these promises seem to be coming from a mix of companies engaging in wild speculation and lying.
Chatgpt, the industry leading large language model, has not materially improved in 18 months. For something that claims to be getting exponentially better, it sure is the same shit.
Hallucinations appear to be an inherent aspect of the technology. Since it's based on statistics and ai doesn't know anything, it can never know what is true. How could I possibly trust it to get any real work done if I can't rely on it's output? If I have to fact check everything it says I might as well do the work myself.
For "real" ai that does know what is true to exist, it would require us to discover new concepts in psychology, math, and computing, which open ai is not working on, and seemingly no other ai companies are either.
Open ai has already seemingly slurped up all the data from the open web already. Chatgpt 5 would take 5x more training data than chatgpt 4 to train. Where is this data coming from, exactly?
Since improvement appears to have ground to a halt, what if this is it? What if Chatgpt 4 is as good as LLMs can ever be? What use is it?
As Jim Covello, a leading semiconductor analyst at Goldman Sachs said (on page 10, and that's big finance so you know they only care about money): if tech companies are spending a trillion dollars to build up the infrastructure to support ai, what trillion dollar problem is it meant to solve? AI companies have a unique talent for burning venture capital and it's unclear if Open AI will be able to survive more than a few years unless everyone suddenly adopts it all at once. (Hey, didn't crypto and the metaverse also require spontaneous mass adoption to make sense?)
There is no problem that current ai is a solution to. Consumer tech is basically solved, normal people don't need more tech than a laptop and a smartphone. Big tech have run out of innovations, and they are desperately looking for the next thing to sell. It happened with the metaverse and it's happening again.
In summary:
Ai hasn't materially improved since the launch of Chatgpt4, which wasn't that big of an upgrade to 3.
There is currently no technological roadmap for ai to become better than it is. (As Jim Covello said on the Goldman Sachs report, the evolution of smartphones was openly planned years ahead of time.) The current problems are inherent to the current technology and nobody has indicated there is any way to solve them in the pipeline. We have likely reached the limits of what LLMs can do, and they still can't do much.
Don't believe AI companies when they say things are going to improve from where they are now before they provide evidence. It's time for the AI shills to put up, or shut up.
5K notes · View notes
reasonsforhope · 22 days ago
Text
"As a Deaf man, Adam Munder has long been advocating for communication rights in a world that chiefly caters to hearing people. 
The Intel software engineer and his wife — who is also Deaf — are often unable to use American Sign Language in daily interactions, instead defaulting to texting on a smartphone or passing a pen and paper back and forth with service workers, teachers, and lawyers. 
It can make simple tasks, like ordering coffee, more complicated than it should be. 
But there are life events that hold greater weight than a cup of coffee. 
Recently, Munder and his wife took their daughter in for a doctor’s appointment — and no interpreter was available. 
To their surprise, their doctor said: “It’s alright, we’ll just have your daughter interpret for you!” ...
That day at the doctor’s office came at the heels of a thousand frustrating interactions and miscommunications — and Munder is not isolated in his experience.
“Where I live in Arizona, there are more than 1.1 million individuals with a hearing loss,” Munder said, “and only about 400 licensed interpreters.”
In addition to being hard to find, interpreters are expensive. And texting and writing aren’t always practical options — they leave out the emotion, detail, and nuance of a spoken conversation. 
ASL is a rich, complex language with its own grammar and culture; a subtle change in speed, direction, facial expression, or gesture can completely change the meaning and tone of a sign. 
“Writing back and forth on paper and pen or using a smartphone to text is not equivalent to American Sign Language,” Munder emphasized. “The details and nuance that make us human are lost in both our personal and business conversations.”
His solution? An AI-powered platform called Omnibridge. 
“My team has established this bridge between the Deaf world and the hearing world, bringing these worlds together without forcing one to adapt to the other,” Munder said. 
Trained on thousands of signs, Omnibridge is engineered to transcribe spoken English and interpret sign language on screen in seconds...
“Our dream is that the technology will be available to everyone, everywhere,” Munder said. “I feel like three to four years from now, we're going to have an app on a phone. Our team has already started working on a cloud-based product, and we're hoping that will be an easy switch from cloud to mobile to an app.” ...
At its heart, Omnibridge is a testament to the positive capabilities of artificial intelligence. "
-via GoodGoodGood, October 25, 2024. More info below the cut!
To test an alpha version of his invention, Munder welcomed TED associate Hasiba Haq on stage. 
“I want to show you how this could have changed my interaction at the doctor appointment, had this been available,” Munder said. 
He went on to explain that the software would generate a bi-directional conversation, in which Munder’s signs would appear as blue text and spoken word would appear in gray. 
At first, there was a brief hiccup on the TED stage. Haq, who was standing in as the doctor’s office receptionist, spoke — but the screen remained blank. 
“I don’t believe this; this is the first time that AI has ever failed,” Munder joked, getting a big laugh from the crowd. “Thanks for your patience.”
After a quick reboot, they rolled with the punches and tried again.
Haq asked: “Hi, how’s it going?” 
Her words popped up in blue. 
Munder signed in reply: “I am good.” 
His response popped up in gray. 
Back and forth, they recreated the scene from the doctor’s office. But this time Munder retained his autonomy, and no one suggested a 7-year-old should play interpreter. 
Munder’s TED debut and tech demonstration didn’t happen overnight — the engineer has been working on Omnibridge for over a decade. 
“It takes a lot to build something like this,” Munder told Good Good Good in an exclusive interview, communicating with our team in ASL. “It couldn't just be one or two people. It takes a large team, a lot of resources, millions and millions of dollars to work on a project like this.” 
After five years of pitching and research, Intel handpicked Munder’s team for a specialty training program. It was through that backing that Omnibridge began to truly take shape...
“Our dream is that the technology will be available to everyone, everywhere,” Munder said. “I feel like three to four years from now, we're going to have an app on a phone. Our team has already started working on a cloud-based product, and we're hoping that will be an easy switch from cloud to mobile to an app.” 
In order to achieve that dream — of transposing their technology to a smartphone — Munder and his team have to play a bit of a waiting game. Today, their platform necessitates building the technology on a PC, with an AI engine. 
“A lot of things don't have those AI PC types of chips,” Munder explained. “But as the technology evolves, we expect that smartphones will start to include AI engines. They'll start to include the capability in processing within smartphones. It will take time for the technology to catch up to it, and it probably won't need the power that we're requiring right now on a PC.” 
At its heart, Omnibridge is a testament to the positive capabilities of artificial intelligence. 
But it is more than a transcription service — it allows people to have face-to-face conversations with each other. There’s a world of difference between passing around a phone or pen and paper and looking someone in the eyes when you speak to them. 
It also allows Deaf people to speak ASL directly, without doing the mental gymnastics of translating their words into English.
“For me, English is my second language,” Munder told Good Good Good. “So when I write in English, I have to think: How am I going to adjust the words? How am I going to write it just right so somebody can understand me? It takes me some time and effort, and it's hard for me to express myself actually in doing that. This technology allows someone to be able to express themselves in their native language.” 
Ultimately, Munder said that Omnibridge is about “bringing humanity back” to these conversations. 
“We’re changing the world through the power of AI, not just revolutionizing technology, but enhancing that human connection,” Munder said at the end of his TED Talk. 
“It’s two languages,” he concluded, “signed and spoken, in one seamless conversation.”"
-via GoodGoodGood, October 25, 2024
452 notes · View notes
mostlysignssomeportents · 1 year ago
Text
How plausible sentence generators are changing the bullshit wars
Tumblr media
This Friday (September 8) at 10hPT/17hUK, I'm livestreaming "How To Dismantle the Internet" with Intelligence Squared.
On September 12 at 7pm, I'll be at Toronto's Another Story Bookshop with my new book The Internet Con: How to Seize the Means of Computation.
Tumblr media
In my latest Locus Magazine column, "Plausible Sentence Generators," I describe how I unwittingly came to use – and even be impressed by – an AI chatbot – and what this means for a specialized, highly salient form of writing, namely, "bullshit":
https://locusmag.com/2023/09/commentary-by-cory-doctorow-plausible-sentence-generators/
Here's what happened: I got stranded at JFK due to heavy weather and an air-traffic control tower fire that locked down every westbound flight on the east coast. The American Airlines agent told me to try going standby the next morning, and advised that if I booked a hotel and saved my taxi receipts, I would get reimbursed when I got home to LA.
But when I got home, the airline's reps told me they would absolutely not reimburse me, that this was their policy, and they didn't care that their representative had promised they'd make me whole. This was so frustrating that I decided to take the airline to small claims court: I'm no lawyer, but I know that a contract takes place when an offer is made and accepted, and so I had a contract, and AA was violating it, and stiffing me for over $400.
The problem was that I didn't know anything about filing a small claim. I've been ripped off by lots of large American businesses, but none had pissed me off enough to sue – until American broke its contract with me.
So I googled it. I found a website that gave step-by-step instructions, starting with sending a "final demand" letter to the airline's business office. They offered to help me write the letter, and so I clicked and I typed and I wrote a pretty stern legal letter.
Now, I'm not a lawyer, but I have worked for a campaigning law-firm for over 20 years, and I've spent the same amount of time writing about the sins of the rich and powerful. I've seen a lot of threats, both those received by our clients and sent to me.
I've been threatened by everyone from Gwyneth Paltrow to Ralph Lauren to the Sacklers. I've been threatened by lawyers representing the billionaire who owned NSOG roup, the notoroious cyber arms-dealer. I even got a series of vicious, baseless threats from lawyers representing LAX's private terminal.
So I know a thing or two about writing a legal threat! I gave it a good effort and then submitted the form, and got a message asking me to wait for a minute or two. A couple minutes later, the form returned a new version of my letter, expanded and augmented. Now, my letter was a little scary – but this version was bowel-looseningly terrifying.
I had unwittingly used a chatbot. The website had fed my letter to a Large Language Model, likely ChatGPT, with a prompt like, "Make this into an aggressive, bullying legal threat." The chatbot obliged.
I don't think much of LLMs. After you get past the initial party trick of getting something like, "instructions for removing a grilled-cheese sandwich from a VCR in the style of the King James Bible," the novelty wears thin:
https://www.emergentmind.com/posts/write-a-biblical-verse-in-the-style-of-the-king-james
Yes, science fiction magazines are inundated with LLM-written short stories, but the problem there isn't merely the overwhelming quantity of machine-generated stories – it's also that they suck. They're bad stories:
https://www.npr.org/2023/02/24/1159286436/ai-chatbot-chatgpt-magazine-clarkesworld-artificial-intelligence
LLMs generate naturalistic prose. This is an impressive technical feat, and the details are genuinely fascinating. This series by Ben Levinstein is a must-read peek under the hood:
https://benlevinstein.substack.com/p/how-to-think-about-large-language
But "naturalistic prose" isn't necessarily good prose. A lot of naturalistic language is awful. In particular, legal documents are fucking terrible. Lawyers affect a stilted, stylized language that is both officious and obfuscated.
The LLM I accidentally used to rewrite my legal threat transmuted my own prose into something that reads like it was written by a $600/hour paralegal working for a $1500/hour partner at a white-show law-firm. As such, it sends a signal: "The person who commissioned this letter is so angry at you that they are willing to spend $600 to get you to cough up the $400 you owe them. Moreover, they are so well-resourced that they can afford to pursue this claim beyond any rational economic basis."
Let's be clear here: these kinds of lawyer letters aren't good writing; they're a highly specific form of bad writing. The point of this letter isn't to parse the text, it's to send a signal. If the letter was well-written, it wouldn't send the right signal. For the letter to work, it has to read like it was written by someone whose prose-sense was irreparably damaged by a legal education.
Here's the thing: the fact that an LLM can manufacture this once-expensive signal for free means that the signal's meaning will shortly change, forever. Once companies realize that this kind of letter can be generated on demand, it will cease to mean, "You are dealing with a furious, vindictive rich person." It will come to mean, "You are dealing with someone who knows how to type 'generate legal threat' into a search box."
Legal threat letters are in a class of language formally called "bullshit":
https://press.princeton.edu/books/hardcover/9780691122946/on-bullshit
LLMs may not be good at generating science fiction short stories, but they're excellent at generating bullshit. For example, a university prof friend of mine admits that they and all their colleagues are now writing grad student recommendation letters by feeding a few bullet points to an LLM, which inflates them with bullshit, adding puffery to swell those bullet points into lengthy paragraphs.
Naturally, the next stage is that profs on the receiving end of these recommendation letters will ask another LLM to summarize them by reducing them to a few bullet points. This is next-level bullshit: a few easily-grasped points are turned into a florid sheet of nonsense, which is then reconverted into a few bullet-points again, though these may only be tangentially related to the original.
What comes next? The reference letter becomes a useless signal. It goes from being a thing that a prof has to really believe in you to produce, whose mere existence is thus significant, to a thing that can be produced with the click of a button, and then it signifies nothing.
We've been through this before. It used to be that sending a letter to your legislative representative meant a lot. Then, automated internet forms produced by activists like me made it far easier to send those letters and lawmakers stopped taking them so seriously. So we created automatic dialers to let you phone your lawmakers, this being another once-powerful signal. Lowering the cost of making the phone call inevitably made the phone call mean less.
Today, we are in a war over signals. The actors and writers who've trudged through the heat-dome up and down the sidewalks in front of the studios in my neighborhood are sending a very powerful signal. The fact that they're fighting to prevent their industry from being enshittified by plausible sentence generators that can produce bullshit on demand makes their fight especially important.
Chatbots are the nuclear weapons of the bullshit wars. Want to generate 2,000 words of nonsense about "the first time I ate an egg," to run overtop of an omelet recipe you're hoping to make the number one Google result? ChatGPT has you covered. Want to generate fake complaints or fake positive reviews? The Stochastic Parrot will produce 'em all day long.
As I wrote for Locus: "None of this prose is good, none of it is really socially useful, but there’s demand for it. Ironically, the more bullshit there is, the more bullshit filters there are, and this requires still more bullshit to overcome it."
Meanwhile, AA still hasn't answered my letter, and to be honest, I'm so sick of bullshit I can't be bothered to sue them anymore. I suppose that's what they were counting on.
Tumblr media Tumblr media Tumblr media
If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
https://pluralistic.net/2023/09/07/govern-yourself-accordingly/#robolawyers
Tumblr media
Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg
CC BY 3.0
https://creativecommons.org/licenses/by/3.0/deed.en
2K notes · View notes
prokopetz · 9 months ago
Text
I suppose the thin silver lining to the discoverability of online resources going to shit because of SEO explotation is that all the folks who responded to reasonable questions with snarky "let me Google that for you" links which now lead to nothing but AI-generated gibberish look like real assholes now.
950 notes · View notes
advancedtreelover · 9 months ago
Text
Đ“Đ”Đč, ŃĐŸĐșĐŸĐ»Đž
You won't believe me what Google translate did to Ukrainian title of the 19th century song - "Đ“Đ”Đč, ŃĐŸĐșĐŸĐ»Đž" (it's very, very popular in Poland too as "Hej sokoƂy" - it's one of these things we share)
Tumblr media Tumblr media
I won't even comment on other parts of the translation which is... er... but from now on I'm going to call this song "Gay Falcons", and I snorted tea through my nose at work.
521 notes · View notes
computersthatwritecode · 3 months ago
Text
192 notes · View notes
Text
AUTOMATIC CLAPPING XBOX TERMINATOR GENISYS
Tumblr media
Tumblr media
Tumblr media
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
107 notes · View notes
birgfan · 7 months ago
Text
A phonetic alphabet for sperm whales proposed by Daniela Rus, Antonio Torralba and Jacob Andreas.
The open-access study published in Nature Communications titled 'Contextual and combinatorial structure in sperm whale vocalisations', has analysed sperm whale vocalizations and as part of that, a phonetic 'alphabet' has been proposed for them.
So cool. Dolphins next please.
Tumblr media
MIT news also did an article on this
57 notes · View notes
probablyasocialecologist · 2 years ago
Link
MW: I don’t think there’s any evidence that large machine learning models—that rely on huge amounts of surveillance data and the concentrated computational infrastructure that only a handful of corporations control—have the spark of consciousness.
We can still unplug the servers, the data centers can flood as the climate encroaches, we can run out of the water to cool the data centers, the surveillance pipelines can melt as the climate becomes more erratic and less hospitable.
I think we need to dig into what is happening here, which is that, when faced with a system that presents itself as a listening, eager interlocutor that’s hearing us and responding to us, that we seem to fall into a kind of trance in relation to these systems, and almost counterfactually engage in some kind of wish fulfillment: thinking that they’re human, and there’s someone there listening to us. It’s like when you’re a kid, and you’re telling ghost stories, something with a lot of emotional weight, and suddenly everybody is terrified and reacting to it. And it becomes hard to disbelieve.
FC: What you said just now—the idea that we fall into a kind of trance—what I’m hearing you say is that’s distracting us from actual threats like climate change or harms to marginalized people.
MW: Yeah, I think it’s distracting us from what’s real on the ground and much harder to solve than war-game hypotheticals about a thing that is largely kind of made up. And particularly, it’s distracting us from the fact that these are technologies controlled by a handful of corporations who will ultimately make the decisions about what technologies are made, what they do, and who they serve. And if we follow these corporations’ interests, we have a pretty good sense of who will use it, how it will be used, and where we can resist to prevent the actual harms that are occurring today and likely to occur. 
258 notes · View notes
lingthusiasm · 8 days ago
Text
Transcript Episode 98: Helping computers decode sentences - Interview with Emily M. Bender
This is a transcript for Lingthusiasm episode ‘Helping computers decode sentences - Interview with Emily M. Bender’. It’s been lightly edited for readability. Listen to the episode here or wherever you get your podcasts. Links to studies mentioned and further reading can be found on the episode show notes page.
[Music]
Lauren: Welcome to Lingthusiasm, a podcast that’s enthusiastic about linguistics! I’m Lauren Gawne. Today, we’re getting enthusiastic about computers and linguistics with Professor Emily M. Bender.
But first, November is our traditional anniversary month! This year, we’re celebrating eight years of Lingthusiasm. Thank you for sharing your enthusiasm for linguistics with us. We’re also running a Lingthusiasm listener survey for the third and final time. As part of our anniversary celebrations, we’re running the survey as a way to learn more about our listeners, get your suggestions for topics, and to run some linguistics experiments. If you did the survey in a previous year, there’re new questions, so you can totally participate again this year. There’s also a spot for asking us your linguistics advice questions, since our first linguistics advice bonus episode was so popular.
You can hear about the results of the previous surveys in two bonus episodes, which we’ll link to in the show notes. We’ll have the results from this year’s survey in an episode for you next year. To do the survey or read more details, go to bit.ly/lingthusiasmsurvey24 – that’s bit.ly/lingthusiasmsurvey24 (the numbers 2 and 4) – before December 15 anywhere on Earth. This project has ethics board approval from La Trobe University, and we’re already including results from previous surveys into some academic papers. You, too, could be part of science if you do the survey.
Our most recent bonus episode was a linguistics travelogue. We discuss Gretchen’s recent trip to Europe where she saw cool language museums, and what she did to prepare for encountering several different languages on the way, as well as planning our fantasy linguistic excursion to Martha’s Vineyard. Go to patreon.com/lingthusiasm to hear this and many more bonus episodes and to help keep the show running ad-free.
Also, very exciting news from Patreon, which is that they’re finally adding the ability to buy Patreon memberships as a gift for someone else. If you’d be excited to receive a Patreon membership to Lingthusiasm as a gift, we’ll have a link in the show notes for you to forward to your friends and/or family with a little wink wink, nudge nudge. We also have lots of Lingthusiasm merch that makes a great gift for the linguistics enthusiast in your life.
[Music]
Lauren: Today, I am delighted to be joined by Emily M. Bender who is a professor at the University of Washington in the Department of Linguistics. She is the director of the Computational Linguistics Laboratory there. Emily’s research and teaching expertise is in multilingual grammar engineering and societal impacts of language technologies. She runs the live-streaming podcast Mystery AI Hype Theater 3000 with sociologist Dr. Alex Hanna. Welcome to the show, Emily!
Emily: I am so enthusiastic to be on Lingthusiasm.
Lauren: We are so delighted to have you here today. Before we ask you about some of your current work with computational linguistics, how did you get into linguistics?
Emily: It was a while ago. Back when I was in high school, we didn’t have things like the Lingthusiasm podcast – or podcasts for that matter – to spread the word about what linguistics was. I actually hadn’t heard about linguistics until I got to university. Someone gave me the excellent advice to get the course catalogue ahead of time – it was a physical book in those days – and just flip through it and circle anything that looked interesting. There was this one class called “An Introduction to Language.” In my second term, I was looking for a class that would fulfil some kind of requirements, and it did, and I took it. Let me tell you, I was hooked on the first day. Even though the first day was actually about the bee dance and other animal communication, I just fell in love with it immediately. I think, honestly, I had always been a linguist. I loved studying languages. My ideal undergraduate course of study would’ve been, like, take the first year of all the languages I could.
Lauren: That would be an amazing degree. Just like, “I have a bachelors in introductory language.”
Emily: Yeah, I mean, speaking now as a university educator, I think there’s some things missing from that, but as a linguist, how much fun would that be. I didn’t know there was a way to study how languages work without studying all the languages. When I found it, I was just thrilled.
Lauren: Excellent. I think that’s such a typical experience of a lot of people who get to university, and they’re intrigued by something that’s like, “How can it be an intro to language when I’ve learnt a bunch of languages?” And then you discover there’s linguistics, which brings you into the whole systematic nature of things.
Emily: Absolutely. My other favourite story to tell about this is I have a memory of being 11 or 12 and day dreaming and trying to figure out what the difference was between a consonant and a vowel.
Lauren: Amazing.
Emily: Because we were taught the alphabet. There’s five vowels and sometimes Y, and the other ones are consonants. What’s the difference? My regret with this story is that I didn’t record what it was that I came up with. I have no idea if I was anywhere near the right track. But I don’t think that your average non-linguist does things like that.
Lauren: That’s extremely proto-linguist behaviour. I love it. I’m sad we don’t have 11-year-old Emily’s figuring out from first principles of the IPA.
Emily: Emily who definitely went on to be a syntax / semantics side linguistics and not a phonetics / phonology side linguist.
Lauren: How did you become a syntax-semantics linguist? How did you get into your research topic of interest?
Emily: In undergrad, it was definitely the syntax class that I connected with the most. I got to study Construction Grammar with Chuck Fillmore and Paul Kay at US Berkeley, which was amazing, and sort of was aware at the time that at Stanford there was work going on on two other frameworks called Lexical-Functional Grammar and Head-Driven Phrase-Structure Grammar. These are different ways of building up representations of language. I went to grad school at Stanford with the idea that I was going to create generalised Bay Area grammar and bring together everything that was best about each of the frameworks. They are similar in spirit. They’re sometimes described as “cousins.” Then I got to Stanford, and I took a class with Joan Bresnan on Lexical-Functional Grammar and a class with Ivan Sag on Head-Driven Phrase-Structure Grammar. I realised that it’s actually really valuable to have different toolkits because they help you focus on different aspects of the grammars of languages. Merging them all together really wasn’t gonna be a valuable thing to do.
Lauren: It’s good that you could see what each of them was bringing to – that we have syntax, and there’s structure, but different ways of explaining it give different perspectives on things.
Emily: Exactly, and lead linguists to want to go explore different things about different languages. If you’re working with Lexical-Functional Grammar, then languages that do radical things with their word order, like some of the languages of Australia, are particularly interesting, and languages that put a lot of information into the morphology – so the parts of the words – are really interesting. If you’re doing Head-Driven Phrase-Structure Grammar, then it’s things like getting deep into the idiosyncrasies of particular languages – the idioms and the sub-patterns – and making them work together with the major patterns is a big focus of HPSG. You’re just gonna work on different problems using the different frameworks.
Lauren: I love that. An incredibly annoying undergraduate proto-linguist behaviour I still remember in my syntax class – because you learn to draw syntax trees. One of my fellow students and I were like, “Trees are fine, but we need to keep extending them down because they only go as far as words,” and there’s all this stuff happening in the morphology. We thought we were very clever for having this very clever thought. We were very lucky that our syntax professor was Rachel Nordlinger, who is another person who works with Lexical-Functional Grammar, which, as you said, is really interested in morphology. You could tell she was just like, “You guys are gonna be so happy when we get to advanced syntax, but just hold on. We’re just doing trees for now.” That’s how I got introduced to different forms of syntax helping answer different questions. It’s like, “Oh, this is one that accounts for the things that are happening inside words as well.” It’s really cool.
Emily: One of the things about both LFG and HPSG is that they’re associated with these long-term computational projects where people aren’t just working out the grammars of languages with pen and paper but actually codifying them in rules that both people and computers can deal with. I got involved with the HPSG project like that as a graduate student at Stanford, and then later on while – my first job, actually – that’s not true. My first job out of grad school was teaching for a year at UC Berkeley, but then I had a year after that where I was working in industry at a start up called “YY Technologies” that was using a large-scale grammar of English to create automated customer service response. You’ve got an email coming in, and the idea is that we parse the email, get some representation of what’s being asked, look up in a database what an appropriate answer would be, and then send that answer back. The goal was to do it on the easy cases so that the harder cases that the automated system couldn’t handle would get passed through to a representative. The start up was doing that for English, and they wanted to expand to Japanese. I had been working on the English grammar, actually, as a graduate student at Stanford because it’s an open source grammar, and I speak Japanese, and so I got to do this job where it was literally my job to build a grammar of Japanese on a computer. It was so cool. That was a fantastic job. In the course of that year, there was a project starting up in Europe that was interested in building more of these grammars for more languages. I picked up the task of saying, “How can we abstract out of this big grammar for English,” which at that point was about seven years old, still under development. It is quite a bit older now, quite a bit bigger.
Lauren: Amazing.
Emily: “How can we take what we’ve learned about doing this for English and make it available for people to build grammars more quickly of other languages?” I took that English grammar and held it up next to the Japanese grammar I was working on and basically just stripped out everything that the Japanese made look English-specific and said, “Okay, here’s a starter kit. This is the start of the grammar matrix that you can use to build a new grammar.” That’s the beginning of that project. I have since been developing that – we can talk more about what “developing it” means – together with students, now, for 23 years. It’s a really long-standing project.
Lauren: Amazing. That is – in terms of linguistics research projects and, especially, computational linguistics projects – a really long time. It speaks to the fact that computers don’t process language the same way we do. A human by the age of 23 is fully functional at a language by itself and can be sharing that language with other people, but for a computer, you’re finding more and more – I assume at this point it’s really specific rules or factors or edge cases.
Emily: For the English grammar that I was describing, yes, it’s basically that. The grammar matrix grows when people add facilities to it for handling new things that happen across languages. For example, in some languages, you have a situation where, instead of having just one verb to say something like “bathe,” it requires two words together. You might have a verb like “take” that doesn’t mean very much on its own and then the noun “bath,” and “take a bath” means the same thing as “bathe.” This phenomenon, which is called “light verb constructions,” shows up in many different languages around the world in slightly different ways. When the student is done with her master’s thesis, you’ll be able to go to the grammar matrix website and enter in a description of light verb constructions in a language and have a grammar come out that can handle them.
Lauren: So excellent. And not something, if we were only working in English, that we would think about, but light verbs show up across different language families and across the grammars of languages that you want to build computational resources for, so it makes sense to add this kind of functionality.
Emily: Exactly. And light verbs do happen in English, but they happen in different ways and more intensively in other languages. You can kind of ignore them in English and get pretty far, but in a language like Bardi, for example, in Australia, you aren’t gonna be able to do very much if you don’t handle the light verbs.
Lauren: And now, hopefully at the end of this MA, we’ll be able to.
Emily: Yes, exactly.
Lauren: Why is it useful to have resources and grammars that can be used for computers for languages like Bardi or, I mean, even large languages like Japanese?
Emily: Why would you want to build a grammar like this? Sometimes, it’s because you want to build a practical application where you can say, “Okay, I’m gonna take in this Japanese string, and I’m going to check it for grammatical errors,” or “I’m going to come up with a very precise representation of what it means that I can then use to do better question answering,” or things like that. But sometimes, what you’re really interested in is just what’s going on in that language. The cool thing about building grammars in a computer is that your analysis of light verb constructions has to work together with your analysis of coordination and your analysis of negation and your analysis of adverbs because they aren’t separate things, they’re all part of one grammar.
Lauren: And so, if we can make computers understand it, it’s a good way of validating that we have understood it and that we’ve described the phenomenon sufficiently.
Emily: And on top of that, if you have a collection of texts in the language, and you’ve got your grammar that you’ve built, and you wanna find what you haven’t yet understood about the language, you try running that text through your grammar and find all of the places where the grammar can’t process the sentence. That’s indicative of something new to look into.
Lauren: It’s thanks to this kind of computational linguistics that all those blue squiggles turn up on my word processing, and I don’t make major syntactic mess ups while I’m writing.
Emily: That’s actually an interesting case. Historically, yes, the blue squiggles came from grammar engineering. I believe they are now done with the large language models. We can talk about that some if you want.
Lauren: Okay, sure. But it was that kind of grammar engineering that led to those initial developments in spell checkers and those kind of things.
Emily: Yes, exactly.
Lauren: Amazing. Attempting to get computers to understand human language has been something that has been part of the interest of computational scientists since the early days of 20th Century computing. I feel like a question that keeps popping up when you read the history of this is like, “And then someone figured something out, and they figured we’d solve language in five years.” Why haven’t we solved getting computers to understand language yet?
Emily: I think part of it is that getting computers to understand language is a very imprecise goal, and it is one where, if you really want the computer to behave the same way that a person would behave if they heard something and understood it, then you need way more than linguistics. You need something – and I really hate the term “artificial intelligence” – but you basically need to solve all of the problems that building artificial intelligence – if that were a worthy goal – would require solving. You can ask much narrower questions and build useful language technology – so grammar checkers, spell checkers – that is computers processing natural languages to good effect. Machine translation – it’s not the case that the computer has understood and then is giving you a rendition in the output language. Machine translation is just “Well, we’re gonna take this string of characters and turn it into that string of characters because, according to all of the data that was used to develop the system, those patterns relate to each other.”
Lauren: I think it’s also easier to understand from a linguistic perspective that when people say, “solve language,” they have this idea of language as a single, unified thing, but so far, we’ve only been talking about written things and the issues that are around syntax and meaning. But dealing with understanding or processing written language versus processing voice going in versus creating voice – they’re all different skills. They require different linguistic and computational skills to do well. Solving language involves solving, actually, hundreds and thousands of tiny different problems.
Emily: Many, many different problems, and they’re problems that, you say, involve different skills. So, are you dealing with sound files? Are you dealing with if you actually wanted to process something more like what a person is doing? Do you have video going on? Are you capturing the gesture and figuring out what shades of meaning the gesture is adding?
Lauren: Nodding vigorously here.
Emily: I know I don’t need to tell you that. [Laughs] But also pragmatics, right, we can get to a pretty clear representation for English at least of the “Who did what to whom?” in a sentence – the bare bones meaning in the form of semantics. But if we want to get to “Okay, but what did the person mean by saying that? How does that fit in with what we’ve been discussing so far and the best understanding possible of what the person is trying to do with those words?” that’s a whole other set of problems – that’s called “pragmatics” – that is well beyond anything that’s going on right now. There’s tiny little forays into computational pragmatics, but if you really want to understand language – a language, right, most of this work happens in English. We have a pretty good idea about how languages vary in their syntax. Variation at the level of semantics, less well studied. Variation in pragmatics, even less so. If we were going to solve language, we need to say which language.
Lauren: Which raises a very important point. As you’ve said, most of this work happens in English. In terms of computational linguistics, there’s been the sense that people are very pleased that we’ve now got maybe a few hundred languages that we have pretty good models for, but there’s still thousands of languages that we don’t have any good computational models for. What is required to make that happen? If you had a very large budget and a great deal many computational linguists to train at your disposal, what’s the first thing you would need to start doing?
Emily: The very first thing that I would start doing, I think, is engaging with communities and seeing which communities actually want computational work done on their languages. And then my ideal use of those resources would be to find the communities that want to do that, find the people in those communities who want to be computational linguists, and train them up rather than what’s usually a much more extractive, “We’re gonna grab your data and build something” kind of a thing. And then it becomes a question of “Okay, well, what do you want computers to be able to do with your language?” – a question to the community. Do you want to be able to translate in and out of, maybe, English or French or some other world or colonial language? Do you want a spell checker? Do you want a grammar checker? Do you want a dialogue partner for people who are learning the language? Do you want a dictionary that makes it easier to look up words? If your language is the kind of language that has a whole bunch of prefixes, just alphabetical order, you know, the words, isn’t gonna be very helpful. What’s needed? And then it depends – do you want automatic transcription? Do you want text-to-speech? Then depending on what the community is trying to build, you have different data requirements. If you wanna build a dictionary like that, that’s a question of sitting down and writing the rules of morphology for the language and collecting a big lexicon. If you want text-to-speech, you need lots and lots of recordings that have been transcribed in the language. If you want machine translation, you need lots and lots of parallel text between that language and the language you’re translating into.
Lauren: And so, a lot of that will use the same computational grammar models but will have slightly different takes on what those models are and will need different data to help those models do their job.
Emily: In some cases, the same models, in some cases, different. I think if we’re talking speech processing, automatic transcription, or speech-to-text, we’re definitely in machine learning territory, and so that’s one kind of model. Machine translation can be done in a model of the grammar mapped to semantics form, or it can be done with machine learning. The spell checker, especially if you’re dealing with a language that doesn’t have enormous amounts of texts to start with, you definitely want to do that in a someone-writes-down-the-rules kind of a fashion. That’s a kind of grammar engineering, but it’s distinct from the kind that I do with syntax.
Lauren: And so, it just starts to unpack how complicated this idea of “Computers do language” is because they’re doing lots of different things, and they need lots of different data. Obviously, we say “data” as though it’s some kind of objective, general pot of things, but when we say “data,” we mean maybe people’s recordings, maybe people’s stories, maybe knowledge and language that they don’t want people outside of their community to have. That creates different imperatives around whether these models are gonna be a way forward or useful for people.
Emily: And at the moment, we don’t have very many great models for collecting data and then handling it respectfully. There are some great models, and then there’s a lot of energy behind not doing that. The best example that I like to point to is the work of Te Hiku Media in Aotearoa (New Zealand). This is an organisation that grew out of a radio project for Te Reo Māori. They were at a community level collecting transcriptions of radio shows in Teo Reo Māori, which is the Indigenous language of Aotearoa (New Zealand). Forgive my pronunciation; I’m trying my best. They had been approached over the years many, many times by big tech saying, “Give us that data. We’d like to buy that data,” and they said, “No, this belongs to the community.” They have developed something called the “Kaitiakitanga License,” which is a way that works for them of granting access to the data and keeping data sovereignty – basically keeping community control of the data. There are ways of thinking about this, but it really requires strength and community against the interests of big tech that takes a very extractivist view of data.
Lauren: It’s good that there are some models that are being developed and normalising of this as one possible way of going forward. As you’ve said, you’ve spent a lot of time working to build a grammar matrix for lots of different languages. This goes against a general trend of focusing on technologies in major languages where there’re clear commercial and large-audience imperatives. Part of this work has been making visible the fact that English is very much a default language in the computational linguistics space. Can you give us an introduction to the way that you started going about making the English-centric nature of computational linguistics more visible?
Emily: I think that this really came to a head in 2019 when I was getting very fed up with people writing about English as if it weren’t language. They would say, “Here’s an algorithm for doing machine reading comprehension,” or “Here’s an algorithm for doing spell checking,” or whatever it is. If it were English, they wouldn’t say that. It seems like, “Well, that’s a general solution,” and then anybody working on any other language would have to say, “Well, here’s a system for doing spell checking in Bardi,” or “Here’s a system for doing spell checking in Swahili,” or whatever it is. Those papers tended to get read as, “Well, that’s only for Bardi,” or “That’s only for Swahili,” where the English ones – because English was treated as default – were taken as general. I made a pest of myself at a conference in 2019 – the conference is called “NAACL” – where I basically just, after every talk where people didn’t mention the name of the language, went to the microphone, introduced myself, and said, “Excuse me, what language was this on?” which is a ridiculous question, right, because it's obvious that it’s English. It’s sort of face threatening. It’s impolite because it’s “Why are you asking this question?” but it’s also embarrassing as the asker. Like, “Why would you ask this silly question?” But I was just making a point. Somewhere along the line, people dubbed that the “Bender Rule,” that you have to name the language that you’re working on, especially if it’s English.
Lauren: I really appreciate your persistence, and I appreciate people who codified it into the Bender Rule because now it’s actually less threatening for me to “I’m just gonna evoke the Bender Rule and just check if this was just on English.” You’ve given us a very clear model where we can all very politely make pests of ourselves to remind people that solving something for English or improving a process for English doesn’t automatically translate to that working for other languages as well.
Emily: Exactly. And I like to think that, basically, by lending my name to it, I’m allowing people to ask that question while blaming it on me.
Lauren: Great. Thank you very much. I do blame it on you all the time in the nicest possible way.
Emily: Excellent.
Lauren: This seems to be part of a larger process you’ve been working on. Obviously, there’s people working on computational processes for English, and you’re trying to be very much a linguist at them, but it seems like you also are spending a lot of time, especially in terms of ethical use of computational processes, trying to explain linguistics to computer scientists as well. How is that work going? Are computer scientists receptive to what linguistics has to offer?
Emily: Computer scientists are a large and diverse group in terms of their attitudes. They are an unfortunately un-diverse group in other ways. It’s an area of research and development that has a lot of money in it right now. There’s always new people coming in, and so it feels like no matter how much teaching of linguistics I do, there is still just as many people who don’t know about it as there ever were because new people are coming in. That said, I think it’s going well. I have written two books that I call, informally, “The 100 Things” books because they started off as tutorials at these computational linguistics conferences with the title, “100 Things You Always Wanted to Know About Linguistics But Were Afraid to Ask” and then subtitle, “For Fear of Being Told 1,000 More.” [Laughter]
Lauren: I mean, it’s not a mischaracterisation of linguists, that’s for sure.
Emily: We’re gonna keep linguisting at you, right. In both cases, the first one is about morphology and syntax. I basically just wrote down, literally, 100 things that I wish that people working in natural language processing in general knew about how language works because they tend to see language as just strings of words without structure. Worse than that, they tend to see language as directly being the information they’re interested in. I used to have really confusing conversations with colleagues in computer science here – people who were interested in gathering information from large collections of texts, like the web (this is a process called “information extraction”) – and when I finally realised that we were focusing on different things – I was interested in the language, and they were interested in the information that was expressed in the language – the conversations started making sense. I came up with a metaphor to help myself, which is, if you live somewhere rainy, can you picture you’ve got a rain-splattered window. You can focus on the raindrops, or you can focus on the scene through the window distorted by the raindrops. Language and its structures are the raindrops, which have an effect on what it is that you can see through the window, but it is very easy to look right through them and imagine you’re just seeing the information of the world outside. When I realised that, as a computational linguist, I’m interested in the raindrops, but some of these people working in computer language processing are just staring straight through them at the stuff outside, it helped me communicate a lot better.
Lauren: I feel like I’ve had a lot of conversations with computational scientists where they’re like, “Ah, we did a big semantic analysis of –” so there’s a process you can apply where you have a whole bunch of processes and algorithms that run, and it says, “80% of the people in this chat –” or this series of, I think they’re like, used to pulling things from Reddit. You could do that easily. It’s like, “80% of people in this hate chocolate ice cream.” I’d always be like, “Okay, but did you account for the person who’s like, ‘Oh my god, I hate how delicious this ice cream is’?” And they’re just like, “Ah
well, no, because we just used – ‘hate’ was negative so
 ‘delicious’ was positive, so this person probably came out in the wash.” I’m like, “No, this is a person who extremely likes this ice cream,” and it’s also a very idiomatic, informal kind of English. I certainly wouldn’t write that in a professional reference for someone – “I hate how amazing this person is. You should hire them.” As a linguist, I’m really interested in these nuanced, novel edge cases, and as a computational scientist, they’re like, “Oh, we just hope we get enough data that they disappear in the noise.”
Emily: And the words are the data. The words are the meaning. There’s no separation there. There’s no structure to the raindrops. “If I have the words, I have the meaning” seems to be the attitude.
Lauren: Well, it’s great that you’re doing the work of slowly letting them down from that assumption.
Emily: We’re trying. Oh, one other thing about these books. The first one is morphology and syntax, the second one is semantics and pragmatics. In both of them – the second one is co-authored with Alex Lascarides – in both of them I have the concept index and the index of languages. Every time we have an example sentence, it shows up as an entry in the index for languages. There’s an index entry for English. Even though it indexes almost every single page in the book, it’s in there because English is a language.
Lauren: There’s this thing called the “Bender Rule.” I don’t know if you’ve heard of it, but I’m really glad that you’re following its principles. A lot of the work you’ve been doing is with a type of computational linguistics where you are building rules to process language and create useful computational outputs, but there are other models for how people can use language computationally.
Emily: I tend to do symbolic or rule-based computational linguistics. I’m really interested in “What are the rules of grammar for this language or for this phenomenon across languages? How can I encode them so that I can get the machine to test them, but also, I can still read them?” But a lot of work in computational linguistics, instead, uses statistical models, so building models that can represent patterns across large bodies of text.
Lauren: Oh, so that’s like predictive text on my mobile phone where it’s so used to reading all of the data that it has from other people’s text messages and my text messages that sometimes it can just predict its way through a whole message for me.
Emily: Yes, exactly. And in fact, I don’t know if this is so true anymore, but for a while, you could see that the models were all different on different phones. Remember we used to play that game where you typed in, “Sorry I’m late, I
” and then just picked the middle option over and over again, and people would get different, fun answers.
Lauren: Yes, and you’d get wildly different answers.
Emily: That reflects local statistics being gathered based on how you’ve been using that phone versus a model that it may have started with that was based on something more generic. That is, yes, an example of statistical patterns. You also see these – and this is fun in automatic transcriptions, like the closed captioning in TV shows if you’re thinking about live news or something where it wasn’t done ahead of time, and they get to a name of a person or a place which clearly wasn’t in the training data already represented in the model, and ridiculous, funny things come out because the system has to fall back to statistical patterns about what that word might have been, and it reveals interesting things about the training data.
Lauren: We used to always put the show through a first pass on YouTube, where Lingthusiasm is also hosted, before Sarah Dopierala came in and transformed our lives by being an amazing transcriptionist. For years, YouTube would transcribe “Lingthusiasm” – a word it has never encountered before, in its defence, as a computer – it would come up with “Link Susy I am” most often. We still occasionally refer to “Link Susy I am.” It was interesting when it finally, clearly, had enough episodes with Lingthusiasm with our manually updated transcripts that it got the hang of it, but that was definitely a case where it needed to learn. We definitely have a much higher success rate of perfect, first-time transcripts with Sarah.
Emily: That pattern that you saw happening with YouTube, that change, shows you that Google was absolutely taking your data and using it to train their models. In the podcast that I run, Mystery AI Hype Theater 3000, we have some phrases that are uncommon, and we do use a first-pass auto-transcriber. For example, we refer to the so-called AI models as “Mathy Maths.”
Lauren: “Mathy Maths,” yeah.
Emily: That’ll come out as like, “Matthew Math.”
Lauren: Oh, my good friend Matthew Math.
Emily: [Laughs] And the phrase “stochastic parrots” sometimes comes out as like, “sarcastic parrots” or things like that.
Lauren: And you and Alex both have, I would say, relatively standard North American English accents, which is really important for these models because, so far, we’ve just been talking about data where it’s found, and like, we’re linguists working with it and processing it before the computer gets to it. But with a lot of these new statistical models, it’s just taking what you give it. That means, as an Australian English speaker, I’m relatively okay, but it’s not as good for me as it is for a Brit or an American. And then if you’re a Singaporean English or Indian English speaker, even as a native English speaker, the models aren’t trained with you in mind as the default user. It just gets more and more challenging.
Emily: Exactly. Some of that is a question of “What could the companies training these models easily get their hands on?” But some of it is also a question of “Who were they designing for in the first instance? Whose data did they think of as ‘normal data’ that they wanted to collect?”
Lauren: These are deliberate choices that are being made.
Emily: Absolutely.
Lauren: With these statistical models, how do they differ from the grammars that you’ve created?
Emily: In a rule-based grammar system, somebody is sitting down and actually writing all the rules. Then when you try a sentence, and it doesn’t work as expected, you can trace through “What rule was used and shouldn’t have been used?” “What rule did you expect to have showing up in that analysis that wasn’t there?” and you can debug like that. The statistical models, instead, you build the model that’s the receptacle for the statistics. You gather a whole bunch of data, and then you use this receptacle model to process the data item by item and have it output according to its current statistics, likely answers, and then compare them to what’s actually there, and then update the statistics every time it’s wrong. You do that over and over and over again, and it becomes more and more effective at closely modelling the patterns in the data, but you can’t open it up and say, “Okay, this part is why it gives that output, and I want to change that.” It’s much more amorphous, in a sense, much more of a “black box” is the terminology that gets used a lot.
Lauren: In 2020 we were really lucky to have Janelle Shane join us on the show and walk us through one of these generative statistical models from that era. She generated some Lingthusiasm transcripts based off the first 40 or so episodes of transcripts that we had. When it generated transcripts, the model had this real fixation on soup. It got the intro to Lingthusiasm right because we say that 40 times across 40 episodes. We’ll be like, “Today, we’re talking about soup.” And we were like, “Janelle, what’s with the soup?” and she’s like, “I can’t tell you. It’s a black box in there,” literally referred to as hidden layers in the processing. So, because we don’t know why it was fixated on soup, there’s some great fake Lingthusiasm transcripts that we read – very soup-focused, very focused on a couple of major pieces of fan fiction literature, which, again, is classic fan fiction favourite IP because it read a bunch of fan fiction as well. You can make some guesses about why it’s talking about wizards a whole bunch, but you can’t make many guesses about why it’s talking about soup a whole bunch, and that makes it hard to debug that issue.
Emily: Hard to debug, yeah. But also, if you don’t know the original training data – so it sounds like she took a model that had been trained on some collection of data –
Lauren: Yes, so that it could be coherent with only those 40 transcripts.
Emily: Exactly, yeah. But if you don’t know what’s in that training data, then you are even more poorly placed to figure out “Why soup?”
Lauren: And since we did that episode, I think the big thing that’s changed is that the models are being given enough extra data that they’re no longer fixated on soup, but they’ve also just become easier for everyday people to use. Part of why we were really grateful for her to come on the show is that she walked us through the fact that she was still using scripting language to ingest those transcripts and to generate the new fabricated text. It all looked very straightforward if you’re a computer person, but you need to be a person who’s comfortable with scripting languages. That’s no longer the case with these new chat-based interfaces. That’s really changed the extent to which people interact with these models.
Emily: Yes, exactly. There’s a few things that have changed. One is there’s been some engineering that allowed companies to make models that could actually take advantage of very large data sets. There has been the collection of very large data sets in a not very consent-based fashion. Then there has been the establishment of these chat interfaces, as you say, where you can just go and poke at it and get something back. Honestly, the biggest thing that happened – the reason that all of a sudden everybody’s talking about ChatGPT and so-called “AI” – was that OpenAI set up this interface where anybody could go poke at it, and then they had a million people sharing their favourite examples. It was this marketing win for OpenAI and a big loss for the rest of us.
Lauren: I think the sharing of examples is really important as well because people don’t talk very often about the human curation that goes into picking funny or coherent or relevant examples. We had to junk so many of those fake transcripts to find the handful that were funny enough to pretend read and give a rendition of. When people are sharing their favourite things that come out of these machines, that’s a level of human interaction with them that I think is often missing but making it very easy for people to generate a whole bunch of content and then pick their favourite and share it has really normalised the use of these large language model ways of playing with language.
Emily: Exactly. If you were someone who’s not playing with it, or even if you are, most of the output you’re going to see is other people sharing their favourites. You get a very distorted view of what it’s doing.
Lauren: In terms of what it is doing, you know, we talked before about when a computer is doing translation between two languages, it’s not that it’s understanding, it’s replacing one string of texts with another string of text with these generative models that are creating this text that, on an initial read, reads like English. What are some of the limitations of these models?
Emily: Just like with machine translation, it’s not understanding. The chat interface encourages you to think that you are asking the chat bot a question, and it is answering you. This isn’t what’s happening. You are imputing a string, and then the model is programmed to come up with a likely continuation of that string. But a lot of its training data is dialogues, and so something that takes the form of a question provokes as a likely continuation an answer. But it hasn’t understood. It doesn’t have a database that it’s consulting. It doesn’t have access to factual information. It’s just coming out with a likely next string given what you put in. Any time it seems to make sense, it’s because the person using it is the one making sense of it.
Lauren: And because it’s had enough input because it basically took large chunks of the English speaking internet that there’s a statistical likelihood it is going to say something that is correct, but that is only a statistical chance. It doesn’t actually have the ability to verify its own factual information.
Emily: Exactly. I really dislike this term, but people talk about “hallucinations” with these models to describe cases where it outputs something that is not factually correct.
Lauren: Okay, why is “hallucination” not an appropriate word for you?
Emily: There’s two problems with it. One speaks to what you were just talking about which is if it says something that is factually correct, that is also just by chance. It’s always doing the same thing; it’s just that sometimes it corresponds to something we take to be true and sometimes it doesn’t. But also, if you think about the term “hallucination,” it refers to perceiving things that aren’t there. That suggests that these chat bots are perceiving things, which they very much aren’t. That’s why I don’t like the term.
Lauren: Fair enough. It’s a bit too human for what they’re actually doing, which is a pretty cool party trick, but it is just a party trick. One thing I’ve really appreciated about your critiquing of these systems is that you situate the linguistic issues around lack of actual understanding and real pragmatic capability, but you also talk about it in terms of these larger systems issues in terms of problems with the data and problems with the amount of computer processing it takes to perform this party trick, which are a combination of alarming issues. Can you talk to some of those issues and maybe some of the other issues that you’ve seen crop up with these models?
Emily: It’s so vexed. So, one place to start is a paper that I wrote with six other people in late 2020 called “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🩜” and then the parrot emoji is part of the title.
Lauren: Excellent.
Emily: This paper became famous in large part because five of the co-authors were at Google, and Google decided, after approving it for submission to a conference, that, in fact, it should be either retracted or have their names taken off of it, and, ultimately, three of the authors took their names off, and two others got fired over it.
Lauren: Right, okay. That is big impact for a conference paper.
Emily: In part, the paper, in the aftermath of that, the impact was enhanced by the fact that the first author to get fired, Dr. Timnit Gebru, was masterful at taking the ensuing media attention and using it to shine a light on the mistreatment of black women in tech. She did an amazing job. Dr. Margaret Mitchell was the other one who got fired. It took a couple more months in her case.
Lauren: Oh, you mean, her name is not “Shmargaret Shmitchell”? [Laughter] That was a pseudonym?
Emily: That was a pseudonym, yeah. Who would’ve thought?
Lauren: I can’t believe it.
Emily: We wrote that paper because Dr. Gebru came to me in a Twitter DM in September of 2020 saying, “Hey, has anyone written about the problems with these large language models and what we should be considering?” because she was a research scientist in AI ethics at Google. It was literally her job to research this stuff and write about it. She had seen people around her pushing for ever bigger language models. This is 2020. So, the 2020 language models are small compared to the ones that we have now. Doing her job, she said, “Hey, we should be looking into what to look out for down this path.” I wrote back saying, “I don’t know of any such papers, but off the top of my head, here are the issues that I would expect to find in one based on independent papers,” so looking at things one by one in the literature. That was things like environmental impact, like the fact that they pick up biases and systems of oppression from the training data, like the fact that if you have a system that can output plausible-looking synthetic text that nobody is accountable for, that can cause various problems down the road when people believe it to be a real text. Then a beat or so later, I said, “Hey, this looks like a paper outline. Do you wanna write it?” That’s how the paper came to be. There’s two really important things that we didn’t realise at the time. One is the extent to which creating these systems relies on exploitative labour practices. That is both basically stealing everybody’s text without consent, but then also, in order to keep the systems from routinely outputting bigoted garbage, there’s this extra layer of so-called training where poorly paid workers, working long hours without psychological support, have to look at all the awful stuff and say, “That’s bad. That’s bad. This one’s okay,” and so on. This tends to be outsourced. There’s famously workers in Kenya who had been doing this. We didn’t know about that at the time, though some of the information was available, we could have.
Lauren: And it keeps outputting highly bigoted, disgusting text because it’s been trained on the internet, which as we all know is a bastion of enlightened and equal opportunity conversation.
Emily: Yes. But even if you go with only, for example, scientific papers, which are supposed to not be awful, guess what? There’s such a thing as scientific racism, and it is well embedded in the scientific literature. There was a large language model that Meta put together called “Galactica.” It came out right before ChatGPT. It was built as a way to access the world’s scientific knowledge, which of course it isn’t because if you take a whole bunch of scientific text, chop it up, and turn it into paper mĂąchĂ©, what you get out is not science but paper mĂąchĂ©, right. But anyway, people were poking at this and very quickly got it to say racist and otherwise terrible things in the guise of being scientific. I think it was the linguist Rikker Dockum who asked it something about stigmatisation of linguist varieties, and it came out with something about how African Americans don’t have a language of their own.
Lauren: Oh. A thing that we don’t even need to fact check because that is incorrect.
Emily: Anyway, you can certainly get to bigoted stuff starting with things less awful than the stuff that’s out there on the internet, but also, these models are trained on what’s out there on the internet. Labour exploitation was one thing that we missed. The other thing that we missed in the stochastic parrots paper was we had no idea that people were gonna get so excited about synthetic text. In the section where we actually introduce the term “stochastic parrot” to describe these machines that are outputting text with no understanding and no accountability, we thought we were going out on thin ice. Like, “People aren’t really gonna do this.” But now, it’s all over the place, and everyone is trying to sell it to you as something you might pay for.
Lauren: Yes, in many ways it’s a paper that was very prescient about a technology that has really become very quickly normalised, which creates a compounding effect in terms of data because now everyone’s sharing the synthetic text that they’re creating for fun, but people are also using it to populate webpages, and heavens knows a lot of spam in my inbox is getting longer because it can just be generated with these machines and processes as well. What used to be human-created data that it was trained on, now, if you try to scrape the internet, there’d be all of this synthetic machine-created language as well. It will just start training on its own output, which I’m not a computational linguist, but that just sounds like it's not a great idea.
Emily: If you think about what it is that you want to use these for, then ultimately, data quality really, really matters and, ideally, data quality that is not only good data but well-documented data, so you can decide, “Hey, is this good for my use case?” The ability to use the web as corpus to do linguistic studies is rapidly degrading. In fact, there’s a computational linguist named Robyn Speer who used to maintain a project called “wordfreq” which counted frequencies of words in web text over time. She has discontinued it because she says, “There’s too much synthetic garbage out there anymore. I can’t actually do anything reliable here. So, this is done.”
Lauren: So, it’s bad for computational linguistics. It’s bad for linguistics. And just to be clear, with these models, there’s no magic tweak that we can make to make them be factual.
Emily: No. Not at all. Because they’re no representing facts. They’re representing co-occurrences of words in text. Does this spelling happen a lot next to that spelling? Do they happen in the same places? Then they’re likely to be output in the same places that sometimes reflects things that happen in the world because sometimes the training text is things that people said because they were describing the actual world, but if it outputs something factual, it’s just by accident.
Lauren: So, your work on the stochastic parrots paper really set the tone for this conversation in linguistics. And you’ve been continuing to talk about the issues and challenges with these large language models and other kinds of generative models because, obviously, similar processes are used for image creation, and we’ve only really talked about the text-based stuff, and there’s a whole bunch of things happening with audio and spoken language as well. But there’ll be heaps more of that on Mystery AI Hype Theater 3000, and also in your book The AI Con, which is coming out in spring 2025.
Emily: Yes, I am super excited for this book. It was a delight to work with Dr. Alex Hanna, who is my co-host on Mystery AI Hype Theater 3000, to put together a book that is for popular audiences. One of the things that I think worked really well is that she’s a sociologist, and I’m a linguist, and so we have different technical terms. We were able to basically catch each other, it’s like, “I don’t really know what that word means,” and so the general audience isn’t gonna know what that word means. Hopefully, it will be nice and accessible. The subtitle, by the way – so the title, The AI Con, and the subtitle is “How to Fight Big Tech’s Hype and Create the Future We Want.” It’ll be out in May of 2025.
Lauren: And it seems like, given the limitations of these big models, there’s still lots of space for the kind of symbolic grammar-processing work that you do.
Emily: Yes, there’s definitely space for symbolic grammar-based work, especially if you’re interested in something that will get a correct answer, if it gets an answer at all. And you’re in a scenario where it’s okay to say, “No possibility here. Let’s send this on to a human,” for example. But also, there’s a lot of room for linguistics in designing better statistical natural language processing in understanding what it is that the person is going to be doing with the computer and how people relate to language so that we can design systems that are not misleading but, in fact, are useful tools.
Lauren: If you could leave people knowing one thing about linguistics, what would it be?
Emily: In light of this conversation, the thing that I would want people to know is that linguistics is the area that lets us zoom in on language and pick apart the rain drops and understand their structure so that we can then zoom back out and have a better idea of what’s going on with the language in the world.
Lauren: Thank you so much for joining us today, Emily.
Emily: It’s been an absolute pleasure.
[Music]
Lauren: For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on all of the podcast platforms or lingthusiasm.com. You can get transcripts of every episode on lingthusiasm.com/transcripts. You can follow @lingthusiasm on all social media sites. You can get scarves with lots of linguistics patterns on them including IPA, branching tree diagrams, bouba and kiki, and our favourite esoteric Unicode symbols, plus other Lingthusiasm merch – like our “Etymology isn’t Destiny” t-shirts and Gavagai pin buttons – at lingthusiasm.com/merch.
My social media and blog is Superlinguo. Links to Gretchen’s social media can be found at gretchenmcculloch.com. Her blog is AllThingsLinguistic.com. Her book about internet language is called Because Internet.
Lingthusiasm is able to keep existing thanks to the support of our patrons. If you want to get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just want to help keep the show running ad-free, go to patreon.com/lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include behind-the-scenes on the Tom Scott Language Files with Tom and team, linguistics travel, and also xenolinguistics and what alien languages might be like. If you can’t afford to pledge, that’s okay, too. We really appreciate it if you can recommend Lingthusiasm to anyone in your life who’s curious about language.
Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins, and our Editorial Assistant is Jon Kruk. Our music is “Ancient City” by The Triangles.
Emily: Stay lingthusiastic!
[Music]
Tumblr media
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
7 notes · View notes
river-taxbird · 1 year ago
Text
There is no such thing as AI.
How to help the non technical and less online people in your life navigate the latest techbro grift.
I've seen other people say stuff to this effect but it's worth reiterating. Today in class, my professor was talking about a news article where a celebrity's likeness was used in an ai image without their permission. Then she mentioned a guest lecture about how AI is going to help finance professionals. Then I pointed out, those two things aren't really related.
The term AI is being used to obfuscate details about multiple semi-related technologies.
Traditionally in sci-fi, AI means artificial general intelligence like Data from star trek, or the terminator. This, I shouldn't need to say, doesn't exist. Techbros use the term AI to trick investors into funding their projects. It's largely a grift.
What is the term AI being used to obfuscate?
If you want to help the less online and less tech literate people in your life navigate the hype around AI, the best way to do it is to encourage them to change their language around AI topics.
By calling these technologies what they really are, and encouraging the people around us to know the real names, we can help lift the veil, kill the hype, and keep people safe from scams. Here are some starting points, which I am just pulling from Wikipedia. I'd highly encourage you to do your own research.
Machine learning (ML): is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. (This is the basis of most technologically people call AI)
Language model: (LM or LLM) is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. (This would be your ChatGPT.)
Generative adversarial network (GAN): is a class of machine learning framework and a prominent framework for approaching generative AI. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss. (This is the source of some AI images and deepfakes.)
Diffusion Models: Models that generate the probability distribution of a given dataset. In image generation, a neural network is trained to denoise images with added gaussian noise by learning to remove the noise. After the training is complete, it can then be used for image generation by starting with a random noise image and denoise that. (This is the more common technology behind AI images, including Dall-E and Stable Diffusion. I added this one to the post after as it was brought to my attention it is now more common than GANs.)
I know these terms are more technical, but they are also more accurate, and they can easily be explained in a way non-technical people can understand. The grifters are using language to give this technology its power, so we can use language to take it's power away and let people see it for what it really is.
12K notes · View notes
reasonsforhope · 2 years ago
Link
Tumblr media
“A 27-year-old PhD scholar finally cracked the riddle which has defeated Sanskrit experts since the 5th Century BC—by decoding a rule taught by “the father of linguistics” Pāáč‡ini.
The discovery makes it possible to ‘derive’ any Sanskrit word—to construct millions of grammatically correct words including ‘mantra’ and ‘guru’—using Pāáč‡ini’s revered ‘language machine’ which is widely considered to be one of the great intellectual achievements in history.
Leading Sanskrit scholars have described the discovery as ‘revolutionary’—and it now means that Pāáč‡ini’s grammar can be taught to computers for the first time...
Pāáč‡ini’s system—4,000 rules detailed in his greatest work, the AáčŁáč­ÄdhyāyÄ« which is thought to have been written around 500 BC—is meant to work like a machine. Feed in the base and suffix of a word and it should turn them into grammatically correct words and sentences through a step-by-step process.
However, until now, there had been a huge problem. Scientists say that, often, two or more of Pāáč‡ini’s rules are simultaneously applicable at the same step, leaving scholars to agonize over which one to choose...
Thought to have lived in a region in what is now north-west Pakistan and south-east Afghanistan, Pāáč‡ini taught a ‘metarule’ to help decide which rule should be applied in the event of a conflict...
Traditionally, scientists have interpreted Pāáč‡ini’s metarule as meaning: in the event of a conflict between two rules of equal strength, the rule that comes later in the grammar’s serial order wins.
Rajpopat rejects this, arguing instead that Pāáč‡ini meant that between rules applicable to the left and right sides of a word respectively. Pāáč‡ini wanted us to choose the rule applicable to the right side. Employing this interpretation, Rajpopat found Pāáč‡ini’s language machine produced grammatically correct words with almost no exceptions...
“This discovery will revolutionize the study of Sanskrit at a time when interest in the language is on the rise.”
Sanskrit is an ancient and classical Indo-European language from South Asia. It is the sacred language of Hinduism, but also the medium through which much of India’s greatest science, philosophy, poetry, and other secular literature have been communicated for centuries.
While only spoken in India by an estimated 25,000 people today, Sanskrit has influenced many other languages and cultures around the world.
Rajpopat, who was born in Mumbai and learned Sanskrit in high school, explained, “Some of the most ancient wisdom of India has been produced in Sanskrit and we still don’t fully understand what our ancestors achieved.
“I hope this discovery will infuse students in India with confidence, pride, and hope that they too can achieve great things.”
He said that a major implication of his discovery is that now we have the algorithm that runs Pāáč‡ini’s grammar, we could potentially teach this grammar to computers.
“Computer scientists working on Natural language processing gave up on rule-based approaches over 50 years ago. So teaching computers how to combine the speaker’s intention with Pāáč‡ini’s rule-based grammar to produce human speech would be a major milestone in the history of human interaction with machines, as well as in India’s intellectual history.”” -via Good News Network, 12/16/22
2K notes · View notes
underlockv · 1 year ago
Text
Interestingly enough I think calling large language models a.i. is doing too much to humanize them. Because of how scifi literature has built up a.i. as living beings with actual working thought processes deserving of the classification of person (bicentennial man etc) a lot of people want to view a.i. as entities. And corporations pushing a.i. can take advantage of your soft feelings toward it like that. But LLMs are nowhere close to that, and tbh I don't even feel the way they learn approaches it. Word order guessing machines can logic the way to a regular sounding sentence but thats not anything approaching having a conversation with a person. Remembering what you said is just storing the information you are typing into it, its not any kind of indication of existence. And yet, so many people online are acting like when my grandma was convinced siri was actually a lady living in her phone. I think we need to start calling Large Language Models "LLMs" and not giving the corps pushing them more of an in with the general public. Its marketing spin, stop falling for it.
122 notes · View notes
awkward-teabag · 8 months ago
Text
I have to wonder how many people celebrating AI translation also complain about "broken English" and how obvious it is something was Google translated from another language without a fluent English speaker involved to properly clean up the translation/grammar.
Because I bet it's a lot.
I know why execs are all for it—AI is the new buzzword and it lets them cut jobs thus "save" money and not have to worry about pesky labour laws when one employs humans—but everyone else?
There was some outcry when Crunchyroll fired many of their translators in favour of AI translation (with some people to "clean up the AI's work") but I can't help but think that was in part because it was Japanese-to-English and personally affected them. Same when Duolingo fired many of their translators in favour of LLM translation. Meanwhile companies are firing staff when it's English to another language and there's this idea that that's fine or not as big a deal because English is "easy" to translate and/or because people don't think of how it will impact people in non-English countries.
Also it doesn't affect native English speakers so it doesn't get much headway in the news cycle or online anyway because so much of the dominant media is from English-speaking countries and English-speakers dominate social media.
But different languages have different grammar structures that LLMs don't do, and I grew up on "jokes" about people speaking in "broken English" and mocking people who use the wrong word when it was clearly a literal translation but the meaning was obvious long before LLMs were a thing, too. In fact, the specific way a character spoke broken English has been a way to denote their native tongue for decades, usually in a racist way.
Then Google translate came out and "Google-translated English" became an insult for people and criticism of companies because it was clearly wonky to native speakers. Even now, LLMs—which are heavily trained on English compared to other languages—don't have a natural output so native English speakers can clock LLM-generated text if it's longer than a sentence or two.
But, for whatever reason, it's not seen as a problem when it goes the other way because fuck non-English readers or people who want to read in their native tongue I guess.
#and it's not like no people were doing translations so wonky translations were better than nothing#it's actual translators being fired for a subpar replacement#and anyone who keeps their job suddenly being responsible for cleaning up llm output rather than what they trained in#(which can take just as much time or longer than doing the translation by hand from scratch)#(if you want it done right anyway)#hell to this day i hear people complain about written translations of indigenous words and how they 'aren't english enough'#even though they're using the ipa and use a system white english people came up with in the first place#and you can easily look up the proper pronunciation and hear it spoken#but there's such a double-standard where it's expected that other languages cater to english/english speakers#but that grace and accommodation doesn't go the other way#and it's the failing of non-english speakers when an english translation is broken#you see it whenever monolingual english speakers travel to other countries and utterly refuse to learn the language#but if someone doesn't speak in unaccented (to them) english fluently in their home country the person 'isn't trying hard enough'#this is just the new version of that where non-english speakers are supposed to do more work and put up with subpar translations#even as a native english speaker/writer i get a (much) lesser version of this because i write with canadian spelling#and some people get pissed if their internet experience is disrupted by 'ou' instead of 'o' or '-re' instead of '-er'#because dialects and regional phrasing/spelling is a thing#human translators can (or should) be able to account for it but llms are not smart enough to do so#and that's not even getting into slang and how llms don't account for it#or how llms can put slurs into translations because it doesn't do nuance or context and doesn't know the language#if you ever complained about buying something from another country that came with machine-translated instructions#you should be pissed at companies cutting english-to-[language] staff in favour of glorified google translate#because the companies are effectively saying they're fine with non-native speakers getting a wonky/broken version
22 notes · View notes
linguisticdiscovery · 1 year ago
Text
Cetacean Translation Initiative
The Cetacean Translation Initiative (CETI) is a nonprofit team of researchers applying advanced machine learning to understand whale communication!
If you want to learn more about animal communication, check out my curated list of pop science books on Animal Communication & Cognition!
55 notes · View notes