#i think openai is shady af but i am not convinced there is anything truly illegal about how chatgpt was created | Explore Tumblr posts and blogs

aftokrator-official · 1 year ago

Text

Screenshotting because it's not rebloggable anymore (omitting OP's name for that reason) but this post - more specifically the replies to it - have been bothering me since I saw it a day or so ago and I finally decided to come back and engage lmao. (Source link from OP)

OP's totally right. OpenAI is a garbage company with garbage business practices but this is not the way to do this, people. I'm glad y'all have so much faith in the legal system here, but I don't, and if this goes through it's not going to harm the entities you want it to harm. believe me.

(i am not against AI when used ethically but i think that is a moot point here bc i do not believe OpenAI is an ethical developer of AI tech. anyway)

Here's the "rebuttal" that has been irritating me so much I couldn't leave it alone:

THIS IS NOT REMOTELY TRUE.

First of all, unless I've missed something big, OpenAI has never disclosed the contents of the proprietary dataset they use to train their LLMs. We have no idea whether ChatGPT was trained on George RR Martin's books or not. Presumably, neither does George RR Martin. So all we have to go on is that ChatGPT "knows" characters and details from ASOIAF. Okay.

The problem is that ASOIAF is a massively popular series with some massively popular multimedia adaptations and spinoffs, and processing the text of the novels is far from the only way ChatGPT could have learned to produce those details.

Let's try a little experiment.

GPT-2 is an open source model released by OpenAI when they were just starting out. It works more or less the same way as ChatGPT and its ilk, just on a vastly smaller scale. It's much, much more limited, but the underlying algorithms work on the same concepts. So, what would GPT-2 give us if we ask it for a summary of a hypothetical GRRM novel?

I typed up the first paragraph here, and GPT-2 gave me the rest. (GPT-2 isn't a chatbot, but works more like autocomplete, so I didn't prompt it directly.)

Okay, I have a feeling that plot doesn't make much sense as a prequel, but hey, it's AI (and an elderly one at that), it's not going to be particularly good at this left to its own devices. And look, it DID pull out a few details specific to ASOIAF - Jon Snow, the Night Watch, the Wildlings. So case closed, right? GPT-2 must have had ASOIAF novels in its training data too, just like its nasty little great-grandchild.

Except we know what was in GPT-2's dataset - it was trained on a 40GB corpus of data scraped from publicly-available web pages, specifically pages linked from Reddit. We don't have all the exact texts that were used, but we DO have the top 1000 domains contained in the dataset. All of which is a hell of a lot more information than we have on ChatGPT.

What websites are ranked #75 and #160 in the list of 1000 domains? Why, it's Fanfiction.net and AO3. Hmm, I wonder where it learned about the very popular fictional characters from George RR Martin's novels! (Certainly not just from fanfiction, either - sites like IMDB and Wikia were ranked much higher in the list of sources, and entertainment news and fan wiki articles would also contain a lot of text about ASOIAF/GoT.)

You can certainly argue that using these websites as training data is also unethical or should be illegal - but that's not what's being argued in this lawsuit. As far as I know, ChatGPT has never spat out a perfect recreation (or even a vaguely paraphrased recreation) of any of GRRM's writing, so the only evidence for violation of his copyright in this case is the generation of what is essentially a machine-created derivative work. That is really, really worrying, even if you don't think the machine should be allowed to do that. I'm not a lawyer, just a fanfic writer and software developer, so I have no idea how legitimate the legal argument here is... but it's going down a road that is very dangerous for fandom, whether you believe it is or not.

#ai #discourse #ugh #not going into my thoughts on the ethics of ai training GENERALLY here but #i think openai is shady af but i am not convinced there is anything truly illegal about how chatgpt was created #or any way to codify that into law without hurting far more people than it protects #anyway.#going to go back to trying my darnedest to ignore the ai discourse now it stresses me tf out

10 notes · View notes