#machine language learning
Explore tagged Tumblr posts
Text
Lingthusiasm Episode 98: Helping computers decode sentences - Interview with Emily M. Bender
When a human learns a new word, we're learning to attach that word to a set of concepts in the real world. When a computer "learns" a new word, it is creating some associations between that word and other words it has seen before, which can sometimes give it the appearance of understanding, but it doesn't have that real-world grounding, which can sometimes lead to spectacular failures: hilariously implausible from a human perspective, just as plausible from the computer's.
In this episode, your host Lauren Gawne gets enthusiastic about how computers process language with Dr. Emily M. Bender, who is a linguistics professor at the University of Washington, USA, and cohost of the podcast Mystery AI Hype Theater 3000. We talk about Emily's work trying to formulate a list of rules that a computer can use to generate grammatical sentences in a language, the differences between that and training a computer to generate sentences using the statistical likelihood of what comes next based on all the other sentences, and the further differences between both those things and how humans map language onto the real world. We also talk about paying attention to communities not just data, the labour practices behind large language models, and how Emily's persistent questions led to the creation of the Bender Rule (always state the language you're working on, even if it's English).
Click here for a link to this episode in your podcast player of choice or read the transcript here.
Announcements: The 2024 Lingthusiasm Listener Survey is here! Itâs a mix of questions about who you are as our listener, as well as some fun linguistics experiments for you to participate in. If you have taken the survey in previous years, there are new questions, so you can participate again this year.
In this monthâs bonus episode we get enthusiastic about three places where we can learn things about linguistics!! We talk about two linguistically interesting museums that Gretchen recently visited: the Estonian National Museum, as well as Mundolingua, a general linguistics museum in Paris. We also talk about Lauren's dream linguistics travel destination: Martha's Vineyard.
Join us on Patreon now to get access to this and 90+ other bonus episodes. Youâll also get access to the Lingthusiasm Discord server where you can chat with other language nerds.
Also, Patreon now has gift memberships! If you'd like to get a gift subscription to Lingthusiasm bonus episodes for someone you know, or if you want to suggest them as a gift for yourself, here's how to gift a membership.
Here are the links mentioned in the episode:
Emily Bender
Emily Bender on Bluesky and Twitter
Mystery AI Hype Theater 3000
Mystery AI Hype Theater 3000: The Newsletter
The AI Con by Emily M. Bender and Alex Hanna
'Data Sovereignty and the Kaitiakitanga License' on Te Hiku
wordfreq by Robyn Speer on GitHub
Lingthusiasm Episode âMaking machines learn language - Interview with Janelle Shaneâ
Bonus with Janelle Shane: we do a dramatic reading of the funniest auto-generated Lingthusiasm episodes
You can listen to this episode via Lingthusiasm.com, Soundcloud, RSS, Apple Podcasts/iTunes, Spotify, YouTube, or wherever you get your podcasts. You can also download an mp3 via the Soundcloud page for offline listening.
To receive an email whenever a new episode drops, sign up for the Lingthusiasm mailing list.
You can help keep Lingthusiasm ad-free, get access to bonus content, and more perks by supporting us on Patreon.
Lingthusiasm is on Bluesky, Twitter, Instagram, Facebook, Mastodon, and Tumblr. Email us at contact [at] lingthusiasm [dot] com
Gretchen is on Bluesky as @GretchenMcC and blogs at All Things Linguistic.
Lauren is on Bluesky as @superlinguo and blogs at Superlinguo.
Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our production editor is Sarah Dopierala, our production assistant is Martha Tsutsui Billins, and our editorial assistant is Jon Kruk. Our music is âAncient Cityâ by The Triangles.
This episode of Lingthusiasm is made available under a Creative Commons Attribution Non-Commercial Share Alike license (CC 4.0 BY-NC-SA).
#linguistics#language#lingthusiasm#podcast#podcasts#episodes#episode 98#Emily M Bender#interview#machine language learning#language learning#ai#artificial intelligence#SoundCloud
19 notes
·
View notes
Text
AI hasn't improved in 18 months. It's likely that this is it. There is currently no evidence the capabilities of ChatGPT will ever improve. It's time for AI companies to put up or shut up.
I'm just re-iterating this excellent post from Ed Zitron, but it's not left my head since I read it and I want to share it. I'm also taking some talking points from Ed's other posts. So basically:
We keep hearing AI is going to get better and better, but these promises seem to be coming from a mix of companies engaging in wild speculation and lying.
Chatgpt, the industry leading large language model, has not materially improved in 18 months. For something that claims to be getting exponentially better, it sure is the same shit.
Hallucinations appear to be an inherent aspect of the technology. Since it's based on statistics and ai doesn't know anything, it can never know what is true. How could I possibly trust it to get any real work done if I can't rely on it's output? If I have to fact check everything it says I might as well do the work myself.
For "real" ai that does know what is true to exist, it would require us to discover new concepts in psychology, math, and computing, which open ai is not working on, and seemingly no other ai companies are either.
Open ai has already seemingly slurped up all the data from the open web already. Chatgpt 5 would take 5x more training data than chatgpt 4 to train. Where is this data coming from, exactly?
Since improvement appears to have ground to a halt, what if this is it? What if Chatgpt 4 is as good as LLMs can ever be? What use is it?
As Jim Covello, a leading semiconductor analyst at Goldman Sachs said (on page 10, and that's big finance so you know they only care about money): if tech companies are spending a trillion dollars to build up the infrastructure to support ai, what trillion dollar problem is it meant to solve? AI companies have a unique talent for burning venture capital and it's unclear if Open AI will be able to survive more than a few years unless everyone suddenly adopts it all at once. (Hey, didn't crypto and the metaverse also require spontaneous mass adoption to make sense?)
There is no problem that current ai is a solution to. Consumer tech is basically solved, normal people don't need more tech than a laptop and a smartphone. Big tech have run out of innovations, and they are desperately looking for the next thing to sell. It happened with the metaverse and it's happening again.
In summary:
Ai hasn't materially improved since the launch of Chatgpt4, which wasn't that big of an upgrade to 3.
There is currently no technological roadmap for ai to become better than it is. (As Jim Covello said on the Goldman Sachs report, the evolution of smartphones was openly planned years ahead of time.) The current problems are inherent to the current technology and nobody has indicated there is any way to solve them in the pipeline. We have likely reached the limits of what LLMs can do, and they still can't do much.
Don't believe AI companies when they say things are going to improve from where they are now before they provide evidence. It's time for the AI shills to put up, or shut up.
5K notes
·
View notes
Text
"As a Deaf man, Adam Munder has long been advocating for communication rights in a world that chiefly caters to hearing people.Â
The Intel software engineer and his wife â who is also Deaf â are often unable to use American Sign Language in daily interactions, instead defaulting to texting on a smartphone or passing a pen and paper back and forth with service workers, teachers, and lawyers.Â
It can make simple tasks, like ordering coffee, more complicated than it should be.Â
But there are life events that hold greater weight than a cup of coffee.Â
Recently, Munder and his wife took their daughter in for a doctorâs appointment â and no interpreter was available.Â
To their surprise, their doctor said: âItâs alright, weâll just have your daughter interpret for you!â ...
That day at the doctorâs office came at the heels of a thousand frustrating interactions and miscommunications â and Munder is not isolated in his experience.
âWhere I live in Arizona, there are more than 1.1 million individuals with a hearing loss,â Munder said, âand only about 400 licensed interpreters.â
In addition to being hard to find, interpreters are expensive. And texting and writing arenât always practical options â they leave out the emotion, detail, and nuance of a spoken conversation.Â
ASL is a rich, complex language with its own grammar and culture; a subtle change in speed, direction, facial expression, or gesture can completely change the meaning and tone of a sign.Â
âWriting back and forth on paper and pen or using a smartphone to text is not equivalent to American Sign Language,â Munder emphasized. âThe details and nuance that make us human are lost in both our personal and business conversations.â
His solution? An AI-powered platform called Omnibridge.Â
âMy team has established this bridge between the Deaf world and the hearing world, bringing these worlds together without forcing one to adapt to the other,â Munder said.Â
Trained on thousands of signs, Omnibridge is engineered to transcribe spoken English and interpret sign language on screen in seconds...
âOur dream is that the technology will be available to everyone, everywhere,â Munder said. âI feel like three to four years from now, we're going to have an app on a phone. Our team has already started working on a cloud-based product, and we're hoping that will be an easy switch from cloud to mobile to an app.â ...
At its heart, Omnibridge is a testament to the positive capabilities of artificial intelligence. "
-via GoodGoodGood, October 25, 2024. More info below the cut!
To test an alpha version of his invention, Munder welcomed TED associate Hasiba Haq on stage.Â
âI want to show you how this could have changed my interaction at the doctor appointment, had this been available,â Munder said.Â
He went on to explain that the software would generate a bi-directional conversation, in which Munderâs signs would appear as blue text and spoken word would appear in gray.Â
At first, there was a brief hiccup on the TED stage. Haq, who was standing in as the doctorâs office receptionist, spoke â but the screen remained blank.Â
âI donât believe this; this is the first time that AI has ever failed,â Munder joked, getting a big laugh from the crowd. âThanks for your patience.â
After a quick reboot, they rolled with the punches and tried again.
Haq asked: âHi, howâs it going?âÂ
Her words popped up in blue.Â
Munder signed in reply: âI am good.âÂ
His response popped up in gray.Â
Back and forth, they recreated the scene from the doctorâs office. But this time Munder retained his autonomy, and no one suggested a 7-year-old should play interpreter.Â
Munderâs TED debut and tech demonstration didnât happen overnight â the engineer has been working on Omnibridge for over a decade.Â
âIt takes a lot to build something like this,â Munder told Good Good Good in an exclusive interview, communicating with our team in ASL. âIt couldn't just be one or two people. It takes a large team, a lot of resources, millions and millions of dollars to work on a project like this.âÂ
After five years of pitching and research, Intel handpicked Munderâs team for a specialty training program. It was through that backing that Omnibridge began to truly take shape...
âOur dream is that the technology will be available to everyone, everywhere,â Munder said. âI feel like three to four years from now, we're going to have an app on a phone. Our team has already started working on a cloud-based product, and we're hoping that will be an easy switch from cloud to mobile to an app.âÂ
In order to achieve that dream â of transposing their technology to a smartphone â Munder and his team have to play a bit of a waiting game. Today, their platform necessitates building the technology on a PC, with an AI engine.Â
âA lot of things don't have those AI PC types of chips,â Munder explained. âBut as the technology evolves, we expect that smartphones will start to include AI engines. They'll start to include the capability in processing within smartphones. It will take time for the technology to catch up to it, and it probably won't need the power that we're requiring right now on a PC.âÂ
At its heart, Omnibridge is a testament to the positive capabilities of artificial intelligence.Â
But it is more than a transcription service â it allows people to have face-to-face conversations with each other. Thereâs a world of difference between passing around a phone or pen and paper and looking someone in the eyes when you speak to them.Â
It also allows Deaf people to speak ASL directly, without doing the mental gymnastics of translating their words into English.
âFor me, English is my second language,â Munder told Good Good Good. âSo when I write in English, I have to think: How am I going to adjust the words? How am I going to write it just right so somebody can understand me? It takes me some time and effort, and it's hard for me to express myself actually in doing that. This technology allows someone to be able to express themselves in their native language.âÂ
Ultimately, Munder said that Omnibridge is about âbringing humanity backâ to these conversations.Â
âWeâre changing the world through the power of AI, not just revolutionizing technology, but enhancing that human connection,â Munder said at the end of his TED Talk.Â
âItâs two languages,â he concluded, âsigned and spoken, in one seamless conversation.â"
-via GoodGoodGood, October 25, 2024
#ai#pro ai#deaf#asl#disability#translation#disabled#hard of hearing#hearing impairment#sign language#american sign language#languages#tech news#language#communication#good news#hope#machine learning
452 notes
·
View notes
Text
How plausible sentence generators are changing the bullshit wars
This Friday (September 8) at 10hPT/17hUK, I'm livestreaming "How To Dismantle the Internet" with Intelligence Squared.
On September 12 at 7pm, I'll be at Toronto's Another Story Bookshop with my new book The Internet Con: How to Seize the Means of Computation.
In my latest Locus Magazine column, "Plausible Sentence Generators," I describe how I unwittingly came to use â and even be impressed by â an AI chatbot â and what this means for a specialized, highly salient form of writing, namely, "bullshit":
https://locusmag.com/2023/09/commentary-by-cory-doctorow-plausible-sentence-generators/
Here's what happened: I got stranded at JFK due to heavy weather and an air-traffic control tower fire that locked down every westbound flight on the east coast. The American Airlines agent told me to try going standby the next morning, and advised that if I booked a hotel and saved my taxi receipts, I would get reimbursed when I got home to LA.
But when I got home, the airline's reps told me they would absolutely not reimburse me, that this was their policy, and they didn't care that their representative had promised they'd make me whole. This was so frustrating that I decided to take the airline to small claims court: I'm no lawyer, but I know that a contract takes place when an offer is made and accepted, and so I had a contract, and AA was violating it, and stiffing me for over $400.
The problem was that I didn't know anything about filing a small claim. I've been ripped off by lots of large American businesses, but none had pissed me off enough to sue â until American broke its contract with me.
So I googled it. I found a website that gave step-by-step instructions, starting with sending a "final demand" letter to the airline's business office. They offered to help me write the letter, and so I clicked and I typed and I wrote a pretty stern legal letter.
Now, I'm not a lawyer, but I have worked for a campaigning law-firm for over 20 years, and I've spent the same amount of time writing about the sins of the rich and powerful. I've seen a lot of threats, both those received by our clients and sent to me.
I've been threatened by everyone from Gwyneth Paltrow to Ralph Lauren to the Sacklers. I've been threatened by lawyers representing the billionaire who owned NSOG roup, the notoroious cyber arms-dealer. I even got a series of vicious, baseless threats from lawyers representing LAX's private terminal.
So I know a thing or two about writing a legal threat! I gave it a good effort and then submitted the form, and got a message asking me to wait for a minute or two. A couple minutes later, the form returned a new version of my letter, expanded and augmented. Now, my letter was a little scary â but this version was bowel-looseningly terrifying.
I had unwittingly used a chatbot. The website had fed my letter to a Large Language Model, likely ChatGPT, with a prompt like, "Make this into an aggressive, bullying legal threat." The chatbot obliged.
I don't think much of LLMs. After you get past the initial party trick of getting something like, "instructions for removing a grilled-cheese sandwich from a VCR in the style of the King James Bible," the novelty wears thin:
https://www.emergentmind.com/posts/write-a-biblical-verse-in-the-style-of-the-king-james
Yes, science fiction magazines are inundated with LLM-written short stories, but the problem there isn't merely the overwhelming quantity of machine-generated stories â it's also that they suck. They're bad stories:
https://www.npr.org/2023/02/24/1159286436/ai-chatbot-chatgpt-magazine-clarkesworld-artificial-intelligence
LLMs generate naturalistic prose. This is an impressive technical feat, and the details are genuinely fascinating. This series by Ben Levinstein is a must-read peek under the hood:
https://benlevinstein.substack.com/p/how-to-think-about-large-language
But "naturalistic prose" isn't necessarily good prose. A lot of naturalistic language is awful. In particular, legal documents are fucking terrible. Lawyers affect a stilted, stylized language that is both officious and obfuscated.
The LLM I accidentally used to rewrite my legal threat transmuted my own prose into something that reads like it was written by a $600/hour paralegal working for a $1500/hour partner at a white-show law-firm. As such, it sends a signal: "The person who commissioned this letter is so angry at you that they are willing to spend $600 to get you to cough up the $400 you owe them. Moreover, they are so well-resourced that they can afford to pursue this claim beyond any rational economic basis."
Let's be clear here: these kinds of lawyer letters aren't good writing; they're a highly specific form of bad writing. The point of this letter isn't to parse the text, it's to send a signal. If the letter was well-written, it wouldn't send the right signal. For the letter to work, it has to read like it was written by someone whose prose-sense was irreparably damaged by a legal education.
Here's the thing: the fact that an LLM can manufacture this once-expensive signal for free means that the signal's meaning will shortly change, forever. Once companies realize that this kind of letter can be generated on demand, it will cease to mean, "You are dealing with a furious, vindictive rich person." It will come to mean, "You are dealing with someone who knows how to type 'generate legal threat' into a search box."
Legal threat letters are in a class of language formally called "bullshit":
https://press.princeton.edu/books/hardcover/9780691122946/on-bullshit
LLMs may not be good at generating science fiction short stories, but they're excellent at generating bullshit. For example, a university prof friend of mine admits that they and all their colleagues are now writing grad student recommendation letters by feeding a few bullet points to an LLM, which inflates them with bullshit, adding puffery to swell those bullet points into lengthy paragraphs.
Naturally, the next stage is that profs on the receiving end of these recommendation letters will ask another LLM to summarize them by reducing them to a few bullet points. This is next-level bullshit: a few easily-grasped points are turned into a florid sheet of nonsense, which is then reconverted into a few bullet-points again, though these may only be tangentially related to the original.
What comes next? The reference letter becomes a useless signal. It goes from being a thing that a prof has to really believe in you to produce, whose mere existence is thus significant, to a thing that can be produced with the click of a button, and then it signifies nothing.
We've been through this before. It used to be that sending a letter to your legislative representative meant a lot. Then, automated internet forms produced by activists like me made it far easier to send those letters and lawmakers stopped taking them so seriously. So we created automatic dialers to let you phone your lawmakers, this being another once-powerful signal. Lowering the cost of making the phone call inevitably made the phone call mean less.
Today, we are in a war over signals. The actors and writers who've trudged through the heat-dome up and down the sidewalks in front of the studios in my neighborhood are sending a very powerful signal. The fact that they're fighting to prevent their industry from being enshittified by plausible sentence generators that can produce bullshit on demand makes their fight especially important.
Chatbots are the nuclear weapons of the bullshit wars. Want to generate 2,000 words of nonsense about "the first time I ate an egg," to run overtop of an omelet recipe you're hoping to make the number one Google result? ChatGPT has you covered. Want to generate fake complaints or fake positive reviews? The Stochastic Parrot will produce 'em all day long.
As I wrote for Locus: "None of this prose is good, none of it is really socially useful, but thereâs demand for it. Ironically, the more bullshit there is, the more bullshit filters there are, and this requires still more bullshit to overcome it."
Meanwhile, AA still hasn't answered my letter, and to be honest, I'm so sick of bullshit I can't be bothered to sue them anymore. I suppose that's what they were counting on.
If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
https://pluralistic.net/2023/09/07/govern-yourself-accordingly/#robolawyers
Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg
CC BY 3.0
https://creativecommons.org/licenses/by/3.0/deed.en
#pluralistic#chatbots#plausible sentence generators#robot lawyers#robolawyers#ai#ml#machine learning#artificial intelligence#stochastic parrots#bullshit#bullshit generators#the bullshit wars#llms#large language models#writing#Ben Levinstein
2K notes
·
View notes
Text
I suppose the thin silver lining to the discoverability of online resources going to shit because of SEO explotation is that all the folks who responded to reasonable questions with snarky "let me Google that for you" links which now lead to nothing but AI-generated gibberish look like real assholes now.
950 notes
·
View notes
Text
ĐĐ”Đč, ŃĐŸĐșĐŸĐ»Đž
You won't believe me what Google translate did to Ukrainian title of the 19th century song - "ĐĐ”Đč, ŃĐŸĐșĐŸĐ»Đž" (it's very, very popular in Poland too as "Hej sokoĆy" - it's one of these things we share)
I won't even comment on other parts of the translation which is... er... but from now on I'm going to call this song "Gay Falcons", and I snorted tea through my nose at work.
521 notes
·
View notes
Text
#COPILOT#TECHNOLOGY#AI#ARTIFICIAL INTELLIGENCE#MACHINE LEARNING#LARGE LANGUAGE MODEL#LARGE LANGUAGE MODELS#LLM#LLMS#MICROSOFT#TERRA FIRMA#COMPUTER#COMPUTERS#CODE#COMPUTER CODE#EARTH#PLANET EARTH
192 notes
·
View notes
Text
AUTOMATIC CLAPPING XBOX TERMINATOR GENISYS
#automatic#clapping#automatic clapping#xbox#xbox terminator#terminator#terminator genisys#taylor swift#genisys#automatic clapping xbox#automatic clapping xbox terminator#xbox terminator genisys#emilia clarke#arnold schwarzenegger#chris pine#star trek#star wars#star trek 2009#facebook#facebook llama#facebook llama large language model machine learning and artificial intelligence#artificial intelligence#machine learning#llama.meta#robot#robots#boston dynamics#boston dynamics atlas#boston dynamics spot#data
107 notes
·
View notes
Text
A phonetic alphabet for sperm whales proposed by Daniela Rus, Antonio Torralba and Jacob Andreas.
The open-access study published in Nature Communications titled 'Contextual and combinatorial structure in sperm whale vocalisations', has analysed sperm whale vocalizations and as part of that, a phonetic 'alphabet' has been proposed for them.
So cool. Dolphins next please.
MIT news also did an article on this
#whale#sperm whale#cetacean#language#animal language#animal vocalisation#click language#clicks#inspiration#conlang#language creation#spec bio#speculative biology#speculative evolution#speculative worldbuilding#speculative fiction#inspo#machine learning#ai#communication#phonetic#alphabet#so cool#cool shit#nature communications
57 notes
·
View notes
Link
MW: I donât think thereâs any evidence that large machine learning modelsâthat rely on huge amounts of surveillance data and the concentrated computational infrastructure that only a handful of corporations controlâhave the spark of consciousness.
We can still unplug the servers, the data centers can flood as the climate encroaches, we can run out of the water to cool the data centers, the surveillance pipelines can melt as the climate becomes more erratic and less hospitable.
I think we need to dig into what is happening here, which is that, when faced with a system that presents itself as a listening, eager interlocutor thatâs hearing us and responding to us, that we seem to fall into a kind of trance in relation to these systems, and almost counterfactually engage in some kind of wish fulfillment: thinking that theyâre human, and thereâs someone there listening to us. Itâs like when youâre a kid, and youâre telling ghost stories, something with a lot of emotional weight, and suddenly everybody is terrified and reacting to it. And it becomes hard to disbelieve.
FC: What you said just nowâthe idea that we fall into a kind of tranceâwhat Iâm hearing you say is thatâs distracting us from actual threats like climate change or harms to marginalized people.
MW: Yeah, I think itâs distracting us from whatâs real on the ground and much harder to solve than war-game hypotheticals about a thing that is largely kind of made up. And particularly, itâs distracting us from the fact that these are technologies controlled by a handful of corporations who will ultimately make the decisions about what technologies are made, what they do, and who they serve. And if we follow these corporationsâ interests, we have a pretty good sense of who will use it, how it will be used, and where we can resist to prevent the actual harms that are occurring today and likely to occur.Â
258 notes
·
View notes
Text
Transcript Episode 98: Helping computers decode sentences - Interview with Emily M. Bender
This is a transcript for Lingthusiasm episode âHelping computers decode sentences - Interview with Emily M. Benderâ. Itâs been lightly edited for readability. Listen to the episode here or wherever you get your podcasts. Links to studies mentioned and further reading can be found on the episode show notes page.
[Music]
Lauren: Welcome to Lingthusiasm, a podcast thatâs enthusiastic about linguistics! Iâm Lauren Gawne. Today, weâre getting enthusiastic about computers and linguistics with Professor Emily M. Bender.
But first, November is our traditional anniversary month! This year, weâre celebrating eight years of Lingthusiasm. Thank you for sharing your enthusiasm for linguistics with us. Weâre also running a Lingthusiasm listener survey for the third and final time. As part of our anniversary celebrations, weâre running the survey as a way to learn more about our listeners, get your suggestions for topics, and to run some linguistics experiments. If you did the survey in a previous year, thereâre new questions, so you can totally participate again this year. Thereâs also a spot for asking us your linguistics advice questions, since our first linguistics advice bonus episode was so popular.
You can hear about the results of the previous surveys in two bonus episodes, which weâll link to in the show notes. Weâll have the results from this yearâs survey in an episode for you next year. To do the survey or read more details, go to bit.ly/lingthusiasmsurvey24 â thatâs bit.ly/lingthusiasmsurvey24 (the numbers 2 and 4) â before December 15 anywhere on Earth. This project has ethics board approval from La Trobe University, and weâre already including results from previous surveys into some academic papers. You, too, could be part of science if you do the survey.
Our most recent bonus episode was a linguistics travelogue. We discuss Gretchenâs recent trip to Europe where she saw cool language museums, and what she did to prepare for encountering several different languages on the way, as well as planning our fantasy linguistic excursion to Marthaâs Vineyard. Go to patreon.com/lingthusiasm to hear this and many more bonus episodes and to help keep the show running ad-free.
Also, very exciting news from Patreon, which is that theyâre finally adding the ability to buy Patreon memberships as a gift for someone else. If youâd be excited to receive a Patreon membership to Lingthusiasm as a gift, weâll have a link in the show notes for you to forward to your friends and/or family with a little wink wink, nudge nudge. We also have lots of Lingthusiasm merch that makes a great gift for the linguistics enthusiast in your life.
[Music]
Lauren: Today, I am delighted to be joined by Emily M. Bender who is a professor at the University of Washington in the Department of Linguistics. She is the director of the Computational Linguistics Laboratory there. Emilyâs research and teaching expertise is in multilingual grammar engineering and societal impacts of language technologies. She runs the live-streaming podcast Mystery AI Hype Theater 3000 with sociologist Dr. Alex Hanna. Welcome to the show, Emily!
Emily: I am so enthusiastic to be on Lingthusiasm.
Lauren: We are so delighted to have you here today. Before we ask you about some of your current work with computational linguistics, how did you get into linguistics?
Emily: It was a while ago. Back when I was in high school, we didnât have things like the Lingthusiasm podcast â or podcasts for that matter â to spread the word about what linguistics was. I actually hadnât heard about linguistics until I got to university. Someone gave me the excellent advice to get the course catalogue ahead of time â it was a physical book in those days â and just flip through it and circle anything that looked interesting. There was this one class called âAn Introduction to Language.â In my second term, I was looking for a class that would fulfil some kind of requirements, and it did, and I took it. Let me tell you, I was hooked on the first day. Even though the first day was actually about the bee dance and other animal communication, I just fell in love with it immediately. I think, honestly, I had always been a linguist. I loved studying languages. My ideal undergraduate course of study wouldâve been, like, take the first year of all the languages I could.
Lauren: That would be an amazing degree. Just like, âI have a bachelors in introductory language.â
Emily: Yeah, I mean, speaking now as a university educator, I think thereâs some things missing from that, but as a linguist, how much fun would that be. I didnât know there was a way to study how languages work without studying all the languages. When I found it, I was just thrilled.
Lauren: Excellent. I think thatâs such a typical experience of a lot of people who get to university, and theyâre intrigued by something thatâs like, âHow can it be an intro to language when Iâve learnt a bunch of languages?â And then you discover thereâs linguistics, which brings you into the whole systematic nature of things.
Emily: Absolutely. My other favourite story to tell about this is I have a memory of being 11 or 12 and day dreaming and trying to figure out what the difference was between a consonant and a vowel.
Lauren: Amazing.
Emily: Because we were taught the alphabet. Thereâs five vowels and sometimes Y, and the other ones are consonants. Whatâs the difference? My regret with this story is that I didnât record what it was that I came up with. I have no idea if I was anywhere near the right track. But I donât think that your average non-linguist does things like that.
Lauren: Thatâs extremely proto-linguist behaviour. I love it. Iâm sad we donât have 11-year-old Emilyâs figuring out from first principles of the IPA.
Emily: Emily who definitely went on to be a syntax / semantics side linguistics and not a phonetics / phonology side linguist.
Lauren: How did you become a syntax-semantics linguist? How did you get into your research topic of interest?
Emily: In undergrad, it was definitely the syntax class that I connected with the most. I got to study Construction Grammar with Chuck Fillmore and Paul Kay at US Berkeley, which was amazing, and sort of was aware at the time that at Stanford there was work going on on two other frameworks called Lexical-Functional Grammar and Head-Driven Phrase-Structure Grammar. These are different ways of building up representations of language. I went to grad school at Stanford with the idea that I was going to create generalised Bay Area grammar and bring together everything that was best about each of the frameworks. They are similar in spirit. Theyâre sometimes described as âcousins.â Then I got to Stanford, and I took a class with Joan Bresnan on Lexical-Functional Grammar and a class with Ivan Sag on Head-Driven Phrase-Structure Grammar. I realised that itâs actually really valuable to have different toolkits because they help you focus on different aspects of the grammars of languages. Merging them all together really wasnât gonna be a valuable thing to do.
Lauren: Itâs good that you could see what each of them was bringing to â that we have syntax, and thereâs structure, but different ways of explaining it give different perspectives on things.
Emily: Exactly, and lead linguists to want to go explore different things about different languages. If youâre working with Lexical-Functional Grammar, then languages that do radical things with their word order, like some of the languages of Australia, are particularly interesting, and languages that put a lot of information into the morphology â so the parts of the words â are really interesting. If youâre doing Head-Driven Phrase-Structure Grammar, then itâs things like getting deep into the idiosyncrasies of particular languages â the idioms and the sub-patterns â and making them work together with the major patterns is a big focus of HPSG. Youâre just gonna work on different problems using the different frameworks.
Lauren: I love that. An incredibly annoying undergraduate proto-linguist behaviour I still remember in my syntax class â because you learn to draw syntax trees. One of my fellow students and I were like, âTrees are fine, but we need to keep extending them down because they only go as far as words,â and thereâs all this stuff happening in the morphology. We thought we were very clever for having this very clever thought. We were very lucky that our syntax professor was Rachel Nordlinger, who is another person who works with Lexical-Functional Grammar, which, as you said, is really interested in morphology. You could tell she was just like, âYou guys are gonna be so happy when we get to advanced syntax, but just hold on. Weâre just doing trees for now.â Thatâs how I got introduced to different forms of syntax helping answer different questions. Itâs like, âOh, this is one that accounts for the things that are happening inside words as well.â Itâs really cool.
Emily: One of the things about both LFG and HPSG is that theyâre associated with these long-term computational projects where people arenât just working out the grammars of languages with pen and paper but actually codifying them in rules that both people and computers can deal with. I got involved with the HPSG project like that as a graduate student at Stanford, and then later on while â my first job, actually â thatâs not true. My first job out of grad school was teaching for a year at UC Berkeley, but then I had a year after that where I was working in industry at a start up called âYY Technologiesâ that was using a large-scale grammar of English to create automated customer service response. Youâve got an email coming in, and the idea is that we parse the email, get some representation of whatâs being asked, look up in a database what an appropriate answer would be, and then send that answer back. The goal was to do it on the easy cases so that the harder cases that the automated system couldnât handle would get passed through to a representative. The start up was doing that for English, and they wanted to expand to Japanese. I had been working on the English grammar, actually, as a graduate student at Stanford because itâs an open source grammar, and I speak Japanese, and so I got to do this job where it was literally my job to build a grammar of Japanese on a computer. It was so cool. That was a fantastic job. In the course of that year, there was a project starting up in Europe that was interested in building more of these grammars for more languages. I picked up the task of saying, âHow can we abstract out of this big grammar for English,â which at that point was about seven years old, still under development. It is quite a bit older now, quite a bit bigger.
Lauren: Amazing.
Emily: âHow can we take what weâve learned about doing this for English and make it available for people to build grammars more quickly of other languages?â I took that English grammar and held it up next to the Japanese grammar I was working on and basically just stripped out everything that the Japanese made look English-specific and said, âOkay, hereâs a starter kit. This is the start of the grammar matrix that you can use to build a new grammar.â Thatâs the beginning of that project. I have since been developing that â we can talk more about what âdeveloping itâ means â together with students, now, for 23 years. Itâs a really long-standing project.
Lauren: Amazing. That is â in terms of linguistics research projects and, especially, computational linguistics projects â a really long time. It speaks to the fact that computers donât process language the same way we do. A human by the age of 23 is fully functional at a language by itself and can be sharing that language with other people, but for a computer, youâre finding more and more â I assume at this point itâs really specific rules or factors or edge cases.
Emily: For the English grammar that I was describing, yes, itâs basically that. The grammar matrix grows when people add facilities to it for handling new things that happen across languages. For example, in some languages, you have a situation where, instead of having just one verb to say something like âbathe,â it requires two words together. You might have a verb like âtakeâ that doesnât mean very much on its own and then the noun âbath,â and âtake a bathâ means the same thing as âbathe.â This phenomenon, which is called âlight verb constructions,â shows up in many different languages around the world in slightly different ways. When the student is done with her masterâs thesis, youâll be able to go to the grammar matrix website and enter in a description of light verb constructions in a language and have a grammar come out that can handle them.
Lauren: So excellent. And not something, if we were only working in English, that we would think about, but light verbs show up across different language families and across the grammars of languages that you want to build computational resources for, so it makes sense to add this kind of functionality.
Emily: Exactly. And light verbs do happen in English, but they happen in different ways and more intensively in other languages. You can kind of ignore them in English and get pretty far, but in a language like Bardi, for example, in Australia, you arenât gonna be able to do very much if you donât handle the light verbs.
Lauren: And now, hopefully at the end of this MA, weâll be able to.
Emily: Yes, exactly.
Lauren: Why is it useful to have resources and grammars that can be used for computers for languages like Bardi or, I mean, even large languages like Japanese?
Emily: Why would you want to build a grammar like this? Sometimes, itâs because you want to build a practical application where you can say, âOkay, Iâm gonna take in this Japanese string, and Iâm going to check it for grammatical errors,â or âIâm going to come up with a very precise representation of what it means that I can then use to do better question answering,â or things like that. But sometimes, what youâre really interested in is just whatâs going on in that language. The cool thing about building grammars in a computer is that your analysis of light verb constructions has to work together with your analysis of coordination and your analysis of negation and your analysis of adverbs because they arenât separate things, theyâre all part of one grammar.
Lauren: And so, if we can make computers understand it, itâs a good way of validating that we have understood it and that weâve described the phenomenon sufficiently.
Emily: And on top of that, if you have a collection of texts in the language, and youâve got your grammar that youâve built, and you wanna find what you havenât yet understood about the language, you try running that text through your grammar and find all of the places where the grammar canât process the sentence. Thatâs indicative of something new to look into.
Lauren: Itâs thanks to this kind of computational linguistics that all those blue squiggles turn up on my word processing, and I donât make major syntactic mess ups while Iâm writing.
Emily: Thatâs actually an interesting case. Historically, yes, the blue squiggles came from grammar engineering. I believe they are now done with the large language models. We can talk about that some if you want.
Lauren: Okay, sure. But it was that kind of grammar engineering that led to those initial developments in spell checkers and those kind of things.
Emily: Yes, exactly.
Lauren: Amazing. Attempting to get computers to understand human language has been something that has been part of the interest of computational scientists since the early days of 20th Century computing. I feel like a question that keeps popping up when you read the history of this is like, âAnd then someone figured something out, and they figured weâd solve language in five years.â Why havenât we solved getting computers to understand language yet?
Emily: I think part of it is that getting computers to understand language is a very imprecise goal, and it is one where, if you really want the computer to behave the same way that a person would behave if they heard something and understood it, then you need way more than linguistics. You need something â and I really hate the term âartificial intelligenceâ â but you basically need to solve all of the problems that building artificial intelligence â if that were a worthy goal â would require solving. You can ask much narrower questions and build useful language technology â so grammar checkers, spell checkers â that is computers processing natural languages to good effect. Machine translation â itâs not the case that the computer has understood and then is giving you a rendition in the output language. Machine translation is just âWell, weâre gonna take this string of characters and turn it into that string of characters because, according to all of the data that was used to develop the system, those patterns relate to each other.â
Lauren: I think itâs also easier to understand from a linguistic perspective that when people say, âsolve language,â they have this idea of language as a single, unified thing, but so far, weâve only been talking about written things and the issues that are around syntax and meaning. But dealing with understanding or processing written language versus processing voice going in versus creating voice â theyâre all different skills. They require different linguistic and computational skills to do well. Solving language involves solving, actually, hundreds and thousands of tiny different problems.
Emily: Many, many different problems, and theyâre problems that, you say, involve different skills. So, are you dealing with sound files? Are you dealing with if you actually wanted to process something more like what a person is doing? Do you have video going on? Are you capturing the gesture and figuring out what shades of meaning the gesture is adding?
Lauren: Nodding vigorously here.
Emily: I know I donât need to tell you that. [Laughs] But also pragmatics, right, we can get to a pretty clear representation for English at least of the âWho did what to whom?â in a sentence â the bare bones meaning in the form of semantics. But if we want to get to âOkay, but what did the person mean by saying that? How does that fit in with what weâve been discussing so far and the best understanding possible of what the person is trying to do with those words?â thatâs a whole other set of problems â thatâs called âpragmaticsâ â that is well beyond anything thatâs going on right now. Thereâs tiny little forays into computational pragmatics, but if you really want to understand language â a language, right, most of this work happens in English. We have a pretty good idea about how languages vary in their syntax. Variation at the level of semantics, less well studied. Variation in pragmatics, even less so. If we were going to solve language, we need to say which language.
Lauren: Which raises a very important point. As youâve said, most of this work happens in English. In terms of computational linguistics, thereâs been the sense that people are very pleased that weâve now got maybe a few hundred languages that we have pretty good models for, but thereâs still thousands of languages that we donât have any good computational models for. What is required to make that happen? If you had a very large budget and a great deal many computational linguists to train at your disposal, whatâs the first thing you would need to start doing?
Emily: The very first thing that I would start doing, I think, is engaging with communities and seeing which communities actually want computational work done on their languages. And then my ideal use of those resources would be to find the communities that want to do that, find the people in those communities who want to be computational linguists, and train them up rather than whatâs usually a much more extractive, âWeâre gonna grab your data and build somethingâ kind of a thing. And then it becomes a question of âOkay, well, what do you want computers to be able to do with your language?â â a question to the community. Do you want to be able to translate in and out of, maybe, English or French or some other world or colonial language? Do you want a spell checker? Do you want a grammar checker? Do you want a dialogue partner for people who are learning the language? Do you want a dictionary that makes it easier to look up words? If your language is the kind of language that has a whole bunch of prefixes, just alphabetical order, you know, the words, isnât gonna be very helpful. Whatâs needed? And then it depends â do you want automatic transcription? Do you want text-to-speech? Then depending on what the community is trying to build, you have different data requirements. If you wanna build a dictionary like that, thatâs a question of sitting down and writing the rules of morphology for the language and collecting a big lexicon. If you want text-to-speech, you need lots and lots of recordings that have been transcribed in the language. If you want machine translation, you need lots and lots of parallel text between that language and the language youâre translating into.
Lauren: And so, a lot of that will use the same computational grammar models but will have slightly different takes on what those models are and will need different data to help those models do their job.
Emily: In some cases, the same models, in some cases, different. I think if weâre talking speech processing, automatic transcription, or speech-to-text, weâre definitely in machine learning territory, and so thatâs one kind of model. Machine translation can be done in a model of the grammar mapped to semantics form, or it can be done with machine learning. The spell checker, especially if youâre dealing with a language that doesnât have enormous amounts of texts to start with, you definitely want to do that in a someone-writes-down-the-rules kind of a fashion. Thatâs a kind of grammar engineering, but itâs distinct from the kind that I do with syntax.
Lauren: And so, it just starts to unpack how complicated this idea of âComputers do languageâ is because theyâre doing lots of different things, and they need lots of different data. Obviously, we say âdataâ as though itâs some kind of objective, general pot of things, but when we say âdata,â we mean maybe peopleâs recordings, maybe peopleâs stories, maybe knowledge and language that they donât want people outside of their community to have. That creates different imperatives around whether these models are gonna be a way forward or useful for people.
Emily: And at the moment, we donât have very many great models for collecting data and then handling it respectfully. There are some great models, and then thereâs a lot of energy behind not doing that. The best example that I like to point to is the work of Te Hiku Media in Aotearoa (New Zealand). This is an organisation that grew out of a radio project for Te Reo MÄori. They were at a community level collecting transcriptions of radio shows in Teo Reo MÄori, which is the Indigenous language of Aotearoa (New Zealand). Forgive my pronunciation; Iâm trying my best. They had been approached over the years many, many times by big tech saying, âGive us that data. Weâd like to buy that data,â and they said, âNo, this belongs to the community.â They have developed something called the âKaitiakitanga License,â which is a way that works for them of granting access to the data and keeping data sovereignty â basically keeping community control of the data. There are ways of thinking about this, but it really requires strength and community against the interests of big tech that takes a very extractivist view of data.
Lauren: Itâs good that there are some models that are being developed and normalising of this as one possible way of going forward. As youâve said, youâve spent a lot of time working to build a grammar matrix for lots of different languages. This goes against a general trend of focusing on technologies in major languages where thereâre clear commercial and large-audience imperatives. Part of this work has been making visible the fact that English is very much a default language in the computational linguistics space. Can you give us an introduction to the way that you started going about making the English-centric nature of computational linguistics more visible?
Emily: I think that this really came to a head in 2019 when I was getting very fed up with people writing about English as if it werenât language. They would say, âHereâs an algorithm for doing machine reading comprehension,â or âHereâs an algorithm for doing spell checking,â or whatever it is. If it were English, they wouldnât say that. It seems like, âWell, thatâs a general solution,â and then anybody working on any other language would have to say, âWell, hereâs a system for doing spell checking in Bardi,â or âHereâs a system for doing spell checking in Swahili,â or whatever it is. Those papers tended to get read as, âWell, thatâs only for Bardi,â or âThatâs only for Swahili,â where the English ones â because English was treated as default â were taken as general. I made a pest of myself at a conference in 2019 â the conference is called âNAACLâ â where I basically just, after every talk where people didnât mention the name of the language, went to the microphone, introduced myself, and said, âExcuse me, what language was this on?â which is a ridiculous question, right, because it's obvious that itâs English. Itâs sort of face threatening. Itâs impolite because itâs âWhy are you asking this question?â but itâs also embarrassing as the asker. Like, âWhy would you ask this silly question?â But I was just making a point. Somewhere along the line, people dubbed that the âBender Rule,â that you have to name the language that youâre working on, especially if itâs English.
Lauren: I really appreciate your persistence, and I appreciate people who codified it into the Bender Rule because now itâs actually less threatening for me to âIâm just gonna evoke the Bender Rule and just check if this was just on English.â Youâve given us a very clear model where we can all very politely make pests of ourselves to remind people that solving something for English or improving a process for English doesnât automatically translate to that working for other languages as well.
Emily: Exactly. And I like to think that, basically, by lending my name to it, Iâm allowing people to ask that question while blaming it on me.
Lauren: Great. Thank you very much. I do blame it on you all the time in the nicest possible way.
Emily: Excellent.
Lauren: This seems to be part of a larger process youâve been working on. Obviously, thereâs people working on computational processes for English, and youâre trying to be very much a linguist at them, but it seems like you also are spending a lot of time, especially in terms of ethical use of computational processes, trying to explain linguistics to computer scientists as well. How is that work going? Are computer scientists receptive to what linguistics has to offer?
Emily: Computer scientists are a large and diverse group in terms of their attitudes. They are an unfortunately un-diverse group in other ways. Itâs an area of research and development that has a lot of money in it right now. Thereâs always new people coming in, and so it feels like no matter how much teaching of linguistics I do, there is still just as many people who donât know about it as there ever were because new people are coming in. That said, I think itâs going well. I have written two books that I call, informally, âThe 100 Thingsâ books because they started off as tutorials at these computational linguistics conferences with the title, â100 Things You Always Wanted to Know About Linguistics But Were Afraid to Askâ and then subtitle, âFor Fear of Being Told 1,000 More.â [Laughter]
Lauren: I mean, itâs not a mischaracterisation of linguists, thatâs for sure.
Emily: Weâre gonna keep linguisting at you, right. In both cases, the first one is about morphology and syntax. I basically just wrote down, literally, 100 things that I wish that people working in natural language processing in general knew about how language works because they tend to see language as just strings of words without structure. Worse than that, they tend to see language as directly being the information theyâre interested in. I used to have really confusing conversations with colleagues in computer science here â people who were interested in gathering information from large collections of texts, like the web (this is a process called âinformation extractionâ) â and when I finally realised that we were focusing on different things â I was interested in the language, and they were interested in the information that was expressed in the language â the conversations started making sense. I came up with a metaphor to help myself, which is, if you live somewhere rainy, can you picture youâve got a rain-splattered window. You can focus on the raindrops, or you can focus on the scene through the window distorted by the raindrops. Language and its structures are the raindrops, which have an effect on what it is that you can see through the window, but it is very easy to look right through them and imagine youâre just seeing the information of the world outside. When I realised that, as a computational linguist, Iâm interested in the raindrops, but some of these people working in computer language processing are just staring straight through them at the stuff outside, it helped me communicate a lot better.
Lauren: I feel like Iâve had a lot of conversations with computational scientists where theyâre like, âAh, we did a big semantic analysis of ââ so thereâs a process you can apply where you have a whole bunch of processes and algorithms that run, and it says, â80% of the people in this chat ââ or this series of, I think theyâre like, used to pulling things from Reddit. You could do that easily. Itâs like, â80% of people in this hate chocolate ice cream.â Iâd always be like, âOkay, but did you account for the person whoâs like, âOh my god, I hate how delicious this ice cream isâ?â And theyâre just like, âAhâŠwell, no, because we just used â âhateâ was negative so⊠âdeliciousâ was positive, so this person probably came out in the wash.â Iâm like, âNo, this is a person who extremely likes this ice cream,â and itâs also a very idiomatic, informal kind of English. I certainly wouldnât write that in a professional reference for someone â âI hate how amazing this person is. You should hire them.â As a linguist, Iâm really interested in these nuanced, novel edge cases, and as a computational scientist, theyâre like, âOh, we just hope we get enough data that they disappear in the noise.â
Emily: And the words are the data. The words are the meaning. Thereâs no separation there. Thereâs no structure to the raindrops. âIf I have the words, I have the meaningâ seems to be the attitude.
Lauren: Well, itâs great that youâre doing the work of slowly letting them down from that assumption.
Emily: Weâre trying. Oh, one other thing about these books. The first one is morphology and syntax, the second one is semantics and pragmatics. In both of them â the second one is co-authored with Alex Lascarides â in both of them I have the concept index and the index of languages. Every time we have an example sentence, it shows up as an entry in the index for languages. Thereâs an index entry for English. Even though it indexes almost every single page in the book, itâs in there because English is a language.
Lauren: Thereâs this thing called the âBender Rule.â I donât know if youâve heard of it, but Iâm really glad that youâre following its principles. A lot of the work youâve been doing is with a type of computational linguistics where you are building rules to process language and create useful computational outputs, but there are other models for how people can use language computationally.
Emily: I tend to do symbolic or rule-based computational linguistics. Iâm really interested in âWhat are the rules of grammar for this language or for this phenomenon across languages? How can I encode them so that I can get the machine to test them, but also, I can still read them?â But a lot of work in computational linguistics, instead, uses statistical models, so building models that can represent patterns across large bodies of text.
Lauren: Oh, so thatâs like predictive text on my mobile phone where itâs so used to reading all of the data that it has from other peopleâs text messages and my text messages that sometimes it can just predict its way through a whole message for me.
Emily: Yes, exactly. And in fact, I donât know if this is so true anymore, but for a while, you could see that the models were all different on different phones. Remember we used to play that game where you typed in, âSorry Iâm late, IâŠâ and then just picked the middle option over and over again, and people would get different, fun answers.
Lauren: Yes, and youâd get wildly different answers.
Emily: That reflects local statistics being gathered based on how youâve been using that phone versus a model that it may have started with that was based on something more generic. That is, yes, an example of statistical patterns. You also see these â and this is fun in automatic transcriptions, like the closed captioning in TV shows if youâre thinking about live news or something where it wasnât done ahead of time, and they get to a name of a person or a place which clearly wasnât in the training data already represented in the model, and ridiculous, funny things come out because the system has to fall back to statistical patterns about what that word might have been, and it reveals interesting things about the training data.
Lauren: We used to always put the show through a first pass on YouTube, where Lingthusiasm is also hosted, before Sarah Dopierala came in and transformed our lives by being an amazing transcriptionist. For years, YouTube would transcribe âLingthusiasmâ â a word it has never encountered before, in its defence, as a computer â it would come up with âLink Susy I amâ most often. We still occasionally refer to âLink Susy I am.â It was interesting when it finally, clearly, had enough episodes with Lingthusiasm with our manually updated transcripts that it got the hang of it, but that was definitely a case where it needed to learn. We definitely have a much higher success rate of perfect, first-time transcripts with Sarah.
Emily: That pattern that you saw happening with YouTube, that change, shows you that Google was absolutely taking your data and using it to train their models. In the podcast that I run, Mystery AI Hype Theater 3000, we have some phrases that are uncommon, and we do use a first-pass auto-transcriber. For example, we refer to the so-called AI models as âMathy Maths.â
Lauren: âMathy Maths,â yeah.
Emily: Thatâll come out as like, âMatthew Math.â
Lauren: Oh, my good friend Matthew Math.
Emily: [Laughs] And the phrase âstochastic parrotsâ sometimes comes out as like, âsarcastic parrotsâ or things like that.
Lauren: And you and Alex both have, I would say, relatively standard North American English accents, which is really important for these models because, so far, weâve just been talking about data where itâs found, and like, weâre linguists working with it and processing it before the computer gets to it. But with a lot of these new statistical models, itâs just taking what you give it. That means, as an Australian English speaker, Iâm relatively okay, but itâs not as good for me as it is for a Brit or an American. And then if youâre a Singaporean English or Indian English speaker, even as a native English speaker, the models arenât trained with you in mind as the default user. It just gets more and more challenging.
Emily: Exactly. Some of that is a question of âWhat could the companies training these models easily get their hands on?â But some of it is also a question of âWho were they designing for in the first instance? Whose data did they think of as ânormal dataâ that they wanted to collect?â
Lauren: These are deliberate choices that are being made.
Emily: Absolutely.
Lauren: With these statistical models, how do they differ from the grammars that youâve created?
Emily: In a rule-based grammar system, somebody is sitting down and actually writing all the rules. Then when you try a sentence, and it doesnât work as expected, you can trace through âWhat rule was used and shouldnât have been used?â âWhat rule did you expect to have showing up in that analysis that wasnât there?â and you can debug like that. The statistical models, instead, you build the model thatâs the receptacle for the statistics. You gather a whole bunch of data, and then you use this receptacle model to process the data item by item and have it output according to its current statistics, likely answers, and then compare them to whatâs actually there, and then update the statistics every time itâs wrong. You do that over and over and over again, and it becomes more and more effective at closely modelling the patterns in the data, but you canât open it up and say, âOkay, this part is why it gives that output, and I want to change that.â Itâs much more amorphous, in a sense, much more of a âblack boxâ is the terminology that gets used a lot.
Lauren: In 2020 we were really lucky to have Janelle Shane join us on the show and walk us through one of these generative statistical models from that era. She generated some Lingthusiasm transcripts based off the first 40 or so episodes of transcripts that we had. When it generated transcripts, the model had this real fixation on soup. It got the intro to Lingthusiasm right because we say that 40 times across 40 episodes. Weâll be like, âToday, weâre talking about soup.â And we were like, âJanelle, whatâs with the soup?â and sheâs like, âI canât tell you. Itâs a black box in there,â literally referred to as hidden layers in the processing. So, because we donât know why it was fixated on soup, thereâs some great fake Lingthusiasm transcripts that we read â very soup-focused, very focused on a couple of major pieces of fan fiction literature, which, again, is classic fan fiction favourite IP because it read a bunch of fan fiction as well. You can make some guesses about why itâs talking about wizards a whole bunch, but you canât make many guesses about why itâs talking about soup a whole bunch, and that makes it hard to debug that issue.
Emily: Hard to debug, yeah. But also, if you donât know the original training data â so it sounds like she took a model that had been trained on some collection of data â
Lauren: Yes, so that it could be coherent with only those 40 transcripts.
Emily: Exactly, yeah. But if you donât know whatâs in that training data, then you are even more poorly placed to figure out âWhy soup?â
Lauren: And since we did that episode, I think the big thing thatâs changed is that the models are being given enough extra data that theyâre no longer fixated on soup, but theyâve also just become easier for everyday people to use. Part of why we were really grateful for her to come on the show is that she walked us through the fact that she was still using scripting language to ingest those transcripts and to generate the new fabricated text. It all looked very straightforward if youâre a computer person, but you need to be a person whoâs comfortable with scripting languages. Thatâs no longer the case with these new chat-based interfaces. Thatâs really changed the extent to which people interact with these models.
Emily: Yes, exactly. Thereâs a few things that have changed. One is thereâs been some engineering that allowed companies to make models that could actually take advantage of very large data sets. There has been the collection of very large data sets in a not very consent-based fashion. Then there has been the establishment of these chat interfaces, as you say, where you can just go and poke at it and get something back. Honestly, the biggest thing that happened â the reason that all of a sudden everybodyâs talking about ChatGPT and so-called âAIâ â was that OpenAI set up this interface where anybody could go poke at it, and then they had a million people sharing their favourite examples. It was this marketing win for OpenAI and a big loss for the rest of us.
Lauren: I think the sharing of examples is really important as well because people donât talk very often about the human curation that goes into picking funny or coherent or relevant examples. We had to junk so many of those fake transcripts to find the handful that were funny enough to pretend read and give a rendition of. When people are sharing their favourite things that come out of these machines, thatâs a level of human interaction with them that I think is often missing but making it very easy for people to generate a whole bunch of content and then pick their favourite and share it has really normalised the use of these large language model ways of playing with language.
Emily: Exactly. If you were someone whoâs not playing with it, or even if you are, most of the output youâre going to see is other people sharing their favourites. You get a very distorted view of what itâs doing.
Lauren: In terms of what it is doing, you know, we talked before about when a computer is doing translation between two languages, itâs not that itâs understanding, itâs replacing one string of texts with another string of text with these generative models that are creating this text that, on an initial read, reads like English. What are some of the limitations of these models?
Emily: Just like with machine translation, itâs not understanding. The chat interface encourages you to think that you are asking the chat bot a question, and it is answering you. This isnât whatâs happening. You are imputing a string, and then the model is programmed to come up with a likely continuation of that string. But a lot of its training data is dialogues, and so something that takes the form of a question provokes as a likely continuation an answer. But it hasnât understood. It doesnât have a database that itâs consulting. It doesnât have access to factual information. Itâs just coming out with a likely next string given what you put in. Any time it seems to make sense, itâs because the person using it is the one making sense of it.
Lauren: And because itâs had enough input because it basically took large chunks of the English speaking internet that thereâs a statistical likelihood it is going to say something that is correct, but that is only a statistical chance. It doesnât actually have the ability to verify its own factual information.
Emily: Exactly. I really dislike this term, but people talk about âhallucinationsâ with these models to describe cases where it outputs something that is not factually correct.
Lauren: Okay, why is âhallucinationâ not an appropriate word for you?
Emily: Thereâs two problems with it. One speaks to what you were just talking about which is if it says something that is factually correct, that is also just by chance. Itâs always doing the same thing; itâs just that sometimes it corresponds to something we take to be true and sometimes it doesnât. But also, if you think about the term âhallucination,â it refers to perceiving things that arenât there. That suggests that these chat bots are perceiving things, which they very much arenât. Thatâs why I donât like the term.
Lauren: Fair enough. Itâs a bit too human for what theyâre actually doing, which is a pretty cool party trick, but it is just a party trick. One thing Iâve really appreciated about your critiquing of these systems is that you situate the linguistic issues around lack of actual understanding and real pragmatic capability, but you also talk about it in terms of these larger systems issues in terms of problems with the data and problems with the amount of computer processing it takes to perform this party trick, which are a combination of alarming issues. Can you talk to some of those issues and maybe some of the other issues that youâve seen crop up with these models?
Emily: Itâs so vexed. So, one place to start is a paper that I wrote with six other people in late 2020 called âOn the Dangers of Stochastic Parrots: Can Language Models Be Too Big? đŠâ and then the parrot emoji is part of the title.
Lauren: Excellent.
Emily: This paper became famous in large part because five of the co-authors were at Google, and Google decided, after approving it for submission to a conference, that, in fact, it should be either retracted or have their names taken off of it, and, ultimately, three of the authors took their names off, and two others got fired over it.
Lauren: Right, okay. That is big impact for a conference paper.
Emily: In part, the paper, in the aftermath of that, the impact was enhanced by the fact that the first author to get fired, Dr. Timnit Gebru, was masterful at taking the ensuing media attention and using it to shine a light on the mistreatment of black women in tech. She did an amazing job. Dr. Margaret Mitchell was the other one who got fired. It took a couple more months in her case.
Lauren: Oh, you mean, her name is not âShmargaret Shmitchellâ? [Laughter] That was a pseudonym?
Emily: That was a pseudonym, yeah. Who wouldâve thought?
Lauren: I canât believe it.
Emily: We wrote that paper because Dr. Gebru came to me in a Twitter DM in September of 2020 saying, âHey, has anyone written about the problems with these large language models and what we should be considering?â because she was a research scientist in AI ethics at Google. It was literally her job to research this stuff and write about it. She had seen people around her pushing for ever bigger language models. This is 2020. So, the 2020 language models are small compared to the ones that we have now. Doing her job, she said, âHey, we should be looking into what to look out for down this path.â I wrote back saying, âI donât know of any such papers, but off the top of my head, here are the issues that I would expect to find in one based on independent papers,â so looking at things one by one in the literature. That was things like environmental impact, like the fact that they pick up biases and systems of oppression from the training data, like the fact that if you have a system that can output plausible-looking synthetic text that nobody is accountable for, that can cause various problems down the road when people believe it to be a real text. Then a beat or so later, I said, âHey, this looks like a paper outline. Do you wanna write it?â Thatâs how the paper came to be. Thereâs two really important things that we didnât realise at the time. One is the extent to which creating these systems relies on exploitative labour practices. That is both basically stealing everybodyâs text without consent, but then also, in order to keep the systems from routinely outputting bigoted garbage, thereâs this extra layer of so-called training where poorly paid workers, working long hours without psychological support, have to look at all the awful stuff and say, âThatâs bad. Thatâs bad. This oneâs okay,â and so on. This tends to be outsourced. Thereâs famously workers in Kenya who had been doing this. We didnât know about that at the time, though some of the information was available, we could have.
Lauren: And it keeps outputting highly bigoted, disgusting text because itâs been trained on the internet, which as we all know is a bastion of enlightened and equal opportunity conversation.
Emily: Yes. But even if you go with only, for example, scientific papers, which are supposed to not be awful, guess what? Thereâs such a thing as scientific racism, and it is well embedded in the scientific literature. There was a large language model that Meta put together called âGalactica.â It came out right before ChatGPT. It was built as a way to access the worldâs scientific knowledge, which of course it isnât because if you take a whole bunch of scientific text, chop it up, and turn it into paper mĂąchĂ©, what you get out is not science but paper mĂąchĂ©, right. But anyway, people were poking at this and very quickly got it to say racist and otherwise terrible things in the guise of being scientific. I think it was the linguist Rikker Dockum who asked it something about stigmatisation of linguist varieties, and it came out with something about how African Americans donât have a language of their own.
Lauren: Oh. A thing that we donât even need to fact check because that is incorrect.
Emily: Anyway, you can certainly get to bigoted stuff starting with things less awful than the stuff thatâs out there on the internet, but also, these models are trained on whatâs out there on the internet. Labour exploitation was one thing that we missed. The other thing that we missed in the stochastic parrots paper was we had no idea that people were gonna get so excited about synthetic text. In the section where we actually introduce the term âstochastic parrotâ to describe these machines that are outputting text with no understanding and no accountability, we thought we were going out on thin ice. Like, âPeople arenât really gonna do this.â But now, itâs all over the place, and everyone is trying to sell it to you as something you might pay for.
Lauren: Yes, in many ways itâs a paper that was very prescient about a technology that has really become very quickly normalised, which creates a compounding effect in terms of data because now everyoneâs sharing the synthetic text that theyâre creating for fun, but people are also using it to populate webpages, and heavens knows a lot of spam in my inbox is getting longer because it can just be generated with these machines and processes as well. What used to be human-created data that it was trained on, now, if you try to scrape the internet, thereâd be all of this synthetic machine-created language as well. It will just start training on its own output, which Iâm not a computational linguist, but that just sounds like it's not a great idea.
Emily: If you think about what it is that you want to use these for, then ultimately, data quality really, really matters and, ideally, data quality that is not only good data but well-documented data, so you can decide, âHey, is this good for my use case?â The ability to use the web as corpus to do linguistic studies is rapidly degrading. In fact, thereâs a computational linguist named Robyn Speer who used to maintain a project called âwordfreqâ which counted frequencies of words in web text over time. She has discontinued it because she says, âThereâs too much synthetic garbage out there anymore. I canât actually do anything reliable here. So, this is done.â
Lauren: So, itâs bad for computational linguistics. Itâs bad for linguistics. And just to be clear, with these models, thereâs no magic tweak that we can make to make them be factual.
Emily: No. Not at all. Because theyâre no representing facts. Theyâre representing co-occurrences of words in text. Does this spelling happen a lot next to that spelling? Do they happen in the same places? Then theyâre likely to be output in the same places that sometimes reflects things that happen in the world because sometimes the training text is things that people said because they were describing the actual world, but if it outputs something factual, itâs just by accident.
Lauren: So, your work on the stochastic parrots paper really set the tone for this conversation in linguistics. And youâve been continuing to talk about the issues and challenges with these large language models and other kinds of generative models because, obviously, similar processes are used for image creation, and weâve only really talked about the text-based stuff, and thereâs a whole bunch of things happening with audio and spoken language as well. But thereâll be heaps more of that on Mystery AI Hype Theater 3000, and also in your book The AI Con, which is coming out in spring 2025.
Emily: Yes, I am super excited for this book. It was a delight to work with Dr. Alex Hanna, who is my co-host on Mystery AI Hype Theater 3000, to put together a book that is for popular audiences. One of the things that I think worked really well is that sheâs a sociologist, and Iâm a linguist, and so we have different technical terms. We were able to basically catch each other, itâs like, âI donât really know what that word means,â and so the general audience isnât gonna know what that word means. Hopefully, it will be nice and accessible. The subtitle, by the way â so the title, The AI Con, and the subtitle is âHow to Fight Big Techâs Hype and Create the Future We Want.â Itâll be out in May of 2025.
Lauren: And it seems like, given the limitations of these big models, thereâs still lots of space for the kind of symbolic grammar-processing work that you do.
Emily: Yes, thereâs definitely space for symbolic grammar-based work, especially if youâre interested in something that will get a correct answer, if it gets an answer at all. And youâre in a scenario where itâs okay to say, âNo possibility here. Letâs send this on to a human,â for example. But also, thereâs a lot of room for linguistics in designing better statistical natural language processing in understanding what it is that the person is going to be doing with the computer and how people relate to language so that we can design systems that are not misleading but, in fact, are useful tools.
Lauren: If you could leave people knowing one thing about linguistics, what would it be?
Emily: In light of this conversation, the thing that I would want people to know is that linguistics is the area that lets us zoom in on language and pick apart the rain drops and understand their structure so that we can then zoom back out and have a better idea of whatâs going on with the language in the world.
Lauren: Thank you so much for joining us today, Emily.
Emily: Itâs been an absolute pleasure.
[Music]
Lauren: For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on all of the podcast platforms or lingthusiasm.com. You can get transcripts of every episode on lingthusiasm.com/transcripts. You can follow @lingthusiasm on all social media sites. You can get scarves with lots of linguistics patterns on them including IPA, branching tree diagrams, bouba and kiki, and our favourite esoteric Unicode symbols, plus other Lingthusiasm merch â like our âEtymology isnât Destinyâ t-shirts and Gavagai pin buttons â at lingthusiasm.com/merch.
My social media and blog is Superlinguo. Links to Gretchenâs social media can be found at gretchenmcculloch.com. Her blog is AllThingsLinguistic.com. Her book about internet language is called Because Internet.
Lingthusiasm is able to keep existing thanks to the support of our patrons. If you want to get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just want to help keep the show running ad-free, go to patreon.com/lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include behind-the-scenes on the Tom Scott Language Files with Tom and team, linguistics travel, and also xenolinguistics and what alien languages might be like. If you canât afford to pledge, thatâs okay, too. We really appreciate it if you can recommend Lingthusiasm to anyone in your life whoâs curious about language.
Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins, and our Editorial Assistant is Jon Kruk. Our music is âAncient Cityâ by The Triangles.
Emily: Stay lingthusiastic!
[Music]
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
#language#linguistics#lingthusiasm#podcast#transcripts#episode 98#Emily M Bender#interview#ai#artificial intelligence#machine language learning#machine learning
7 notes
·
View notes
Text
There is no such thing as AI.
How to help the non technical and less online people in your life navigate the latest techbro grift.
I've seen other people say stuff to this effect but it's worth reiterating. Today in class, my professor was talking about a news article where a celebrity's likeness was used in an ai image without their permission. Then she mentioned a guest lecture about how AI is going to help finance professionals. Then I pointed out, those two things aren't really related.
The term AI is being used to obfuscate details about multiple semi-related technologies.
Traditionally in sci-fi, AI means artificial general intelligence like Data from star trek, or the terminator. This, I shouldn't need to say, doesn't exist. Techbros use the term AI to trick investors into funding their projects. It's largely a grift.
What is the term AI being used to obfuscate?
If you want to help the less online and less tech literate people in your life navigate the hype around AI, the best way to do it is to encourage them to change their language around AI topics.
By calling these technologies what they really are, and encouraging the people around us to know the real names, we can help lift the veil, kill the hype, and keep people safe from scams. Here are some starting points, which I am just pulling from Wikipedia. I'd highly encourage you to do your own research.
Machine learning (ML): is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. (This is the basis of most technologically people call AI)
Language model: (LM or LLM) is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. (This would be your ChatGPT.)
Generative adversarial network (GAN): is a class of machine learning framework and a prominent framework for approaching generative AI. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss. (This is the source of some AI images and deepfakes.)
Diffusion Models: Models that generate the probability distribution of a given dataset. In image generation, a neural network is trained to denoise images with added gaussian noise by learning to remove the noise. After the training is complete, it can then be used for image generation by starting with a random noise image and denoise that. (This is the more common technology behind AI images, including Dall-E and Stable Diffusion. I added this one to the post after as it was brought to my attention it is now more common than GANs.)
I know these terms are more technical, but they are also more accurate, and they can easily be explained in a way non-technical people can understand. The grifters are using language to give this technology its power, so we can use language to take it's power away and let people see it for what it really is.
12K notes
·
View notes
Link
âA 27-year-old PhD scholar finally cracked the riddle which has defeated Sanskrit experts since the 5th Century BCâby decoding a rule taught by âthe father of linguisticsâ PÄáčini.
The discovery makes it possible to âderiveâ any Sanskrit wordâto construct millions of grammatically correct words including âmantraâ and âguruââusing PÄáčiniâs revered âlanguage machineâ which is widely considered to be one of the great intellectual achievements in history.
Leading Sanskrit scholars have described the discovery as ârevolutionaryââand it now means that PÄáčiniâs grammar can be taught to computers for the first time...
PÄáčiniâs systemâ4,000 rules detailed in his greatest work, the AáčŁáčÄdhyÄyÄ« which is thought to have been written around 500 BCâis meant to work like a machine. Feed in the base and suffix of a word and it should turn them into grammatically correct words and sentences through a step-by-step process.
However, until now, there had been a huge problem. Scientists say that, often, two or more of PÄáčiniâs rules are simultaneously applicable at the same step, leaving scholars to agonize over which one to choose...
Thought to have lived in a region in what is now north-west Pakistan and south-east Afghanistan, PÄáčini taught a âmetaruleâ to help decide which rule should be applied in the event of a conflict...
Traditionally, scientists have interpreted PÄáčiniâs metarule as meaning: in the event of a conflict between two rules of equal strength, the rule that comes later in the grammarâs serial order wins.
Rajpopat rejects this, arguing instead that PÄáčini meant that between rules applicable to the left and right sides of a word respectively. PÄáčini wanted us to choose the rule applicable to the right side. Employing this interpretation, Rajpopat found PÄáčiniâs language machine produced grammatically correct words with almost no exceptions...
âThis discovery will revolutionize the study of Sanskrit at a time when interest in the language is on the rise.â
Sanskrit is an ancient and classical Indo-European language from South Asia. It is the sacred language of Hinduism, but also the medium through which much of Indiaâs greatest science, philosophy, poetry, and other secular literature have been communicated for centuries.
While only spoken in India by an estimated 25,000 people today, Sanskrit has influenced many other languages and cultures around the world.
Rajpopat, who was born in Mumbai and learned Sanskrit in high school, explained, âSome of the most ancient wisdom of India has been produced in Sanskrit and we still donât fully understand what our ancestors achieved.
âI hope this discovery will infuse students in India with confidence, pride, and hope that they too can achieve great things.â
He said that a major implication of his discovery is that now we have the algorithm that runs PÄáčiniâs grammar, we could potentially teach this grammar to computers.
âComputer scientists working on Natural language processing gave up on rule-based approaches over 50 years ago. So teaching computers how to combine the speakerâs intention with PÄáčiniâs rule-based grammar to produce human speech would be a major milestone in the history of human interaction with machines, as well as in Indiaâs intellectual history.ââ -via Good News Network, 12/16/22
2K notes
·
View notes
Text
Interestingly enough I think calling large language models a.i. is doing too much to humanize them. Because of how scifi literature has built up a.i. as living beings with actual working thought processes deserving of the classification of person (bicentennial man etc) a lot of people want to view a.i. as entities. And corporations pushing a.i. can take advantage of your soft feelings toward it like that. But LLMs are nowhere close to that, and tbh I don't even feel the way they learn approaches it. Word order guessing machines can logic the way to a regular sounding sentence but thats not anything approaching having a conversation with a person. Remembering what you said is just storing the information you are typing into it, its not any kind of indication of existence. And yet, so many people online are acting like when my grandma was convinced siri was actually a lady living in her phone. I think we need to start calling Large Language Models "LLMs" and not giving the corps pushing them more of an in with the general public. Its marketing spin, stop falling for it.
#ai#llms#chatgpt#character ai#the fic ive seen written with it is also so sad and bland#even leaving the ethical qualms behind in the fact its trained off uncompensated work stolen off the internet and then used to make#commercial work outside the fic sphere#it also does a bad job#please read more quality stuff so you can recognize this#edit: in the og post I used the term language learning models instead of large language models because of the ways nueral networks were#described to me in the past but large language models is the correct terminology so I edited the post#this has zero effect on the actual post messaging because large language models are indeed the same ones I was describing#advanced mad libs machines are not sentient and nothing about them approaches a mode of becoming sentient#stop talking to word calculators and absolutely never put them in a management situation
122 notes
·
View notes
Text
I have to wonder how many people celebrating AI translation also complain about "broken English" and how obvious it is something was Google translated from another language without a fluent English speaker involved to properly clean up the translation/grammar.
Because I bet it's a lot.
I know why execs are all for itâAI is the new buzzword and it lets them cut jobs thus "save" money and not have to worry about pesky labour laws when one employs humansâbut everyone else?
There was some outcry when Crunchyroll fired many of their translators in favour of AI translation (with some people to "clean up the AI's work") but I can't help but think that was in part because it was Japanese-to-English and personally affected them. Same when Duolingo fired many of their translators in favour of LLM translation. Meanwhile companies are firing staff when it's English to another language and there's this idea that that's fine or not as big a deal because English is "easy" to translate and/or because people don't think of how it will impact people in non-English countries.
Also it doesn't affect native English speakers so it doesn't get much headway in the news cycle or online anyway because so much of the dominant media is from English-speaking countries and English-speakers dominate social media.
But different languages have different grammar structures that LLMs don't do, and I grew up on "jokes" about people speaking in "broken English" and mocking people who use the wrong word when it was clearly a literal translation but the meaning was obvious long before LLMs were a thing, too. In fact, the specific way a character spoke broken English has been a way to denote their native tongue for decades, usually in a racist way.
Then Google translate came out and "Google-translated English" became an insult for people and criticism of companies because it was clearly wonky to native speakers. Even now, LLMsâwhich are heavily trained on English compared to other languagesâdon't have a natural output so native English speakers can clock LLM-generated text if it's longer than a sentence or two.
But, for whatever reason, it's not seen as a problem when it goes the other way because fuck non-English readers or people who want to read in their native tongue I guess.
#and it's not like no people were doing translations so wonky translations were better than nothing#it's actual translators being fired for a subpar replacement#and anyone who keeps their job suddenly being responsible for cleaning up llm output rather than what they trained in#(which can take just as much time or longer than doing the translation by hand from scratch)#(if you want it done right anyway)#hell to this day i hear people complain about written translations of indigenous words and how they 'aren't english enough'#even though they're using the ipa and use a system white english people came up with in the first place#and you can easily look up the proper pronunciation and hear it spoken#but there's such a double-standard where it's expected that other languages cater to english/english speakers#but that grace and accommodation doesn't go the other way#and it's the failing of non-english speakers when an english translation is broken#you see it whenever monolingual english speakers travel to other countries and utterly refuse to learn the language#but if someone doesn't speak in unaccented (to them) english fluently in their home country the person 'isn't trying hard enough'#this is just the new version of that where non-english speakers are supposed to do more work and put up with subpar translations#even as a native english speaker/writer i get a (much) lesser version of this because i write with canadian spelling#and some people get pissed if their internet experience is disrupted by 'ou' instead of 'o' or '-re' instead of '-er'#because dialects and regional phrasing/spelling is a thing#human translators can (or should) be able to account for it but llms are not smart enough to do so#and that's not even getting into slang and how llms don't account for it#or how llms can put slurs into translations because it doesn't do nuance or context and doesn't know the language#if you ever complained about buying something from another country that came with machine-translated instructions#you should be pissed at companies cutting english-to-[language] staff in favour of glorified google translate#because the companies are effectively saying they're fine with non-native speakers getting a wonky/broken version
22 notes
·
View notes
Text
Cetacean Translation Initiative
The Cetacean Translation Initiative (CETI) is a nonprofit team of researchers applying advanced machine learning to understand whale communication!
If you want to learn more about animal communication, check out my curated list of pop science books on Animal Communication & Cognition!
55 notes
·
View notes