lady-inkyrius - Tumblr blog

lady-inkyrius · 19 minutes ago

Text

Linguists deal with two kinds of theories or models.

First, you have grammars. A grammar, in this sense, is a model of an individual natural language: what sorts of utterances occur in that language? When are they used and what do they mean? Even assembling this sort of model in full is a Herculean task, but we are fairly successful at modeling sub-systems of individual languages: what sounds occur in the language, and how may they be ordered and combined?—this is phonology. What strings of words occur in the language, and what strings don't, irrespective of what they mean?—this is syntax. Characterizing these things, for a particular language, is largely tractable. A grammar (a model of the utterances of a single language) is falsified if it predicts utterances that do not occur, or fails to predict utterances that do occur. These situations are called "overgeneration" and "undergeneration", respectively. One of the advantages linguistics has as a science is that we have both massive corpora of observational data (text that people have written, databases of recorded phone calls), and access to cheap and easy experimental data (you can ask people to say things in the target language—you have to be a bit careful about how you do this—and see if what they say accords with your model). We have to make some spherical cow type assumptions, we have to "ignore friction" sometimes (friction is most often what the Chomskyans call "performance error", which you do not have to be a Chomskyan to believe in, but I digress). In any case, this lets us build robust, useful, highly predictive, and falsifiable, although necessarily incomplete, models of individual natural languages. These are called descriptive grammars.

Descriptive grammars often have a strong formal component—Chomsky, for all his faults, recognized that both phonology and syntax could be well described by formal grammars in the sense of mathematics and computer science, and these tools have been tremendously productive since the 60s in producing good models of natural language. I believe Chomsky's program sensu stricto is a dead end, but the basic insight that human language can be thought about formally in this way has been extremely useful and has transformed the field for the better. Read any descriptive grammar, of a language from Europe or Papua or the Amazon, and you will see (in linguists' own idiosyncratic notation) a flurry regexes and syntax trees (this is a bit unfair—the computer scientists stole syntax trees from us, also via Chomsky) and string rewrite rules and so on and so forth. Some of this preceded Chomsky but more than anyone else he gave it legs.

Anyway, linguists are also interested in another kind of model, which confusingly enough we call simply a "theory". So you have "grammars", which are theories of individual natural languages, and you have "theories", which are theories of grammars. A linguistic theory is a model which predicts what sorts of grammar are possible for a human language to have. This generally comes in the form of making claims about

the structure of the cognitive faculty for language, and its limitations

the pathways by which language evolves over time, and the grammars that are therefore attractors and repellers in this dynamical system.

Both of these avenues of research have seen some limited success, but linguistics as a field is far worse at producing theories of this sort than it is at producing grammars.

Capital-G Generativism, Chomsky's program, is one such attempt to produce a theory of human language, and it has not worked very well at all. Chomsky's adherents will say it has worked very well—they are wrong and everybody else thinks they are very wrong, but Chomsky has more clout in linguistics than anyone else so they get to publish in serious journals and whatnot. For an analogy that will be familiar to physics people: Chomskyans are string theorists. And they have discovered some stuff! We know about wh-islands thanks to Generativism, and we probably would not have discovered them otherwise. Wh-islands are weird! It's a good thing the Chomskyans found wh-islands, and a few other bits and pieces like that. But Generativism as a program has, I believe, hit a dead end and will not be recovering.

Right, Generativism is sort of, kind of attempting to do (1), poorly. There are other people attempting to do (1) more robustly, but I don't know much about it. It's probably important. For my own part I think (2) has a lot of promise, because we already have a fairly detailed understanding of how language changes over time, at least as regards phonology. Some people are already working on this sort of program, and there's a lot of work left to be done, but I do think it's promising.

Someone said to me, recently-ish, that the success of LLMs spells doom for descriptive linguistics. "Look, that model does better than any of your grammars of English at producing English sentences! You've been thoroughly outclassed!". But I don't think this is true at all. Linguists aren't confused about which English sentences are valid—many of us are native English speakers, and could simply tell you ourselves without the help of an LLM. We're confused about why. We're trying to distill the patterns of English grammar, known implicitly to every English speaker, into explicit rules that tell us something explanatory about how English works. An LLM is basically just another English speaker we can query for data, except worse, because instead of a human mind speaking a human language (our object of study) it's a simulacrum of such.

Uh, for another physics analogy: suppose someone came along with a black box, and this black box had within it (by magic) a database of every possible history of the universe. You input a world-state, and it returns a list of all the future histories that could follow on from this world state. If the universe is deterministic, there should only be one of them; if not maybe there are multiple. If the universe is probabilistic, suppose the machine also gives you a probability for each future history. If you input the state of a local patch of spacetime, the machine gives you all histories in which that local patch exists and how they evolve.

Now, given this machine, I've got a theory of everything for you. My theory is: whatever the machine says is going to happen at time t is what will happen at time t. Now, I don't doubt that that's a very useful thing! Most physicists would probably love to have this machine! But I do not think my theory of everything, despite being extremely predictive, is a very good one. Why? Because it doesn't tell you anything, it doesn't identify any patterns in the way the natural world works, it just says "ask the black box and then believe it". Well, sure. But then you might get curious and want to ask: are there patterns in the black box's answers? Are there human-comprehensible rules which seem to characterize its output? Can I figure out what those are? And then, presto, you're doing good old regular physics again, as if you didn't even have the black box. The black box is just a way to run experiments faster and cheaper, to get at what you really want to know.

General Relativity, even though it has singularities, and it's incompatible with Quantum Mechanics, is better as a theory of physics than my black box theory of everything, because it actually identifies patterns, it gives you some insight into how the natural world behaves, in a way that you, a human, can understand.

In linguistics, we're in a similar situation with LLMs, only LLMs are a lot worse than the black box I've described—they still mess up and give weird answers from time to time. And more importantly, we already have a linguistic black box, we have billions of them: they're called human native speakers, and you can find one in your local corner store or dry cleaner. Querying the black box and trying to find patterns is what linguistics already is, that's what linguists do, and having another, less accurate black box does very little for us.

Now, there is one advantage that LLMs have. You can do interpretability research on LLMs, and figure out how they are doing what they are doing. Linguists and ML researchers are kind of in a similar boat here. In linguistics, well, we already all know how to talk, we just don't know how we know how to talk. In ML, you have these models that are very successful, buy you don't know why they work so well, how they're doing it. We have our own version of interpretability research, which is neuroscience and neurolinguistics. And ML researchers have interpretability research for LLMs, and it's very possible theirs progresses faster than ours! Now with the caveat that we can't expect LLMs to work just like the human brain, and we can't expect the internal grammar of a language inside an LLM to be identical to the one used implicitly by the human mind to produce native-speaker utterances, we still might get useful insights out of proper scrutiny of the innards of an LLM that speaks English very well. That's certainly possible!

But just having the LLM, does that make the work of descriptive linguistics obsolete? No, obviously not. To say so completely misunderstands what we are trying to do.

#yeah #linguistics #long post

26 notes · View notes

lady-inkyrius · 29 minutes ago

Photo

polypore

#favourite #gender

5K notes · View notes

lady-inkyrius · 2 hours ago

Text

Laios....

26K notes · View notes

lady-inkyrius · 5 hours ago

Text

i love her so much

23 notes · View notes

lady-inkyrius · 8 hours ago

Text

i know the normal amount of stuff about everything

180 notes · View notes

lady-inkyrius · 8 hours ago

Text

getting unzipped has to feel so good for the compressed file

9K notes · View notes

lady-inkyrius · 1 day ago

Text

#average web dev moment

5K notes · View notes

lady-inkyrius · 1 day ago

Text

6K notes · View notes

lady-inkyrius · 1 day ago

Text

Henri Rivière (1864-1951) Les Aspects de la nature (1897-1899)

2K notes · View notes

lady-inkyrius · 1 day ago

Text

the problem with dyson spheres is they only work during the daytime

2K notes · View notes

lady-inkyrius · 2 days ago

Text

ok so i'm very much not a story/character-generator. making up a guy and then thinking about that guy is not really a thing that i do. and relatedly making up stories and thinking about or writing down or whatever those stories is not a thing i do. i might be able to train to be able to do it? it feels pretty alien to my mind. i like reading stories! but generating stories doesnt feel like what my brain is designed to output. my brain outputs statements and questions.

story-generators seem very common on tumblr but id also suspect tumblr selects for them, so im not sure about the genpop. nonetheless im curious how common this is. i feel like it should be fairly obvious to one's self whether youre a story generator, but here's some prompts to consider: do you often daydream, about fictional people? have you written a piece of narrative fiction, or fanfiction? do you have an "original character" who you often think about? if the answer to any of these is yes, i would say you are a story/character-generator

#god I wish I was

236 notes · View notes

lady-inkyrius · 3 days ago

Text

Apparently at high altitudes it snows galena and bismuthinite on Venus.

80 notes · View notes

lady-inkyrius · 3 days ago

Text

youtube

Came across this yesterday and had to immediately listen through it like four times in a row varying my focus between the music, the translation, and the translator's notes. ni li epiku.

#Youtube

6 notes · View notes

lady-inkyrius · 3 days ago

Text

Little Ghost, what have you done?

2K notes · View notes

lady-inkyrius · 3 days ago

Text

OK so my shitpost R&D department was researching the viability of a jocular analogy between national language regulators, war rationing, and soviet bread lines. This isn't a viable product right now so you'll have to just kind of imagine that it's funny, but the idea is, like, people are running out of words because they offshored development and then a war footing devastated international trade, so now there aren't enough words to go around and the government is publishing all these posters encouraging people not to waste them. The government has stepped into nationalize word production and distribution but because all the best words are going to the Posters on the war front, the public has to spend hours in line just to get a random selection of words that they can hardly use. People have to find a way to smuggle in illegal foreign words or rely on unsafe home-brewed vocabulary while repurposing all the new words for munitions and war strategy to talk about groceries and romance. Barter dominates, especially in the provinces, as people try to scrounge together a functional vocabulary to educate their children.

Anyway I'm dropping it because I realized that while this is hard to make into a good joke, it would actually be a fantastic strategy/puzzle game. Someone go make that!

107 notes · View notes

lady-inkyrius · 4 days ago

Text

soweli, grazing (tux paint, 2024)

777 notes · View notes

lady-inkyrius · 4 days ago

Text

the most fun a boy can have is taking the contrapositive. if taking the contrapositive is illegal then you can lock me up

286 notes · View notes