#currently it kinda sucks at all three. but this would help a metric ton with the latter and at least a little with the former | Explore Tumblr posts and blogs

walugus-grudenburg · 11 months ago

Text

I'm hoping megacorps stop using shitty mass gathered data for their ML algorithms (Machine Learning, sometimes known as AI, which I will use for brevity but is a poor choice of words for it as while perfectly Artificial it is very functionally different than Intelligence.) The current trend of unlabeled zero QA datasets are horrid and often cause severe stupidity (use Google Docs or similar and you'll see what I mean.) It is extraordinarily expensive to get curated quality-tested datasets that you own to train an AI on. But, it not only solves 99% of moral issues with AI (if you own what it's trained on the "is it stealing" debate goes from a very subjective and contentious battle to pretty much vanishing entirely!) but it also increases the quality to an incredible degree! (though not necessarily a cost-effective one) Now I'm no machine learning scientist or businessperson, but surely at some point going that route's worth it to these companies just to get the courts off your back, right? Sure it's immensely expensive, but they're megacorps. They have the funds. They already spend so much on compute for these, surely they can afford some big data. (An additional benefit is since the data is better, it won't take as much of it, so less compute per quality. This helps decrease long-term costs some (though not as much as it costs to build the datasets unfortunately) but also helps the environment some by spending less power.)

#(decided to turn that rb addition I made into a full blown post too)#(anyone who knows their stuff more than some hobbyist (me) feel free to correct any of this if it's wrong btw.)#I am firmly convinced that proper data that's legal is the next step in not just AI effectiveness but also making AI moral and legal #currently it kinda sucks at all three. but this would help a metric ton with the latter and at least a little with the former #notably the AIs that try a little bit of this approach (while still stealing the other 95% of the data) are by far the most effective #(those are GPTs. especially Chat)#AI #machine learning

2 notes · View notes

walugus-grudenburg · 11 months ago

Text

This is an example of why unregulated mass-fed (scraped or user generated) data is BAD for making AIs! Big corporations sweetie you're poisoning them! These unlabeled zero QA datasets are horrid and create messes like these! It is extraordinarily expensive to get curated quality-tested datasets that you own to train an AI on. But, it not only solves 99% of moral issues with AI (if you own what it's trained on the "is it stealing" debate goes from a very subjective and contentious battle to vanishing entirely!) but it also increases the quality to an incredible degree! Now I'm no machine learning scientist or businessperson, but surely at some point going that route's worth it to these companies just to get the courts off your back, right? Sure it's immensely expensive, but they're megacorps. (An additional benefit is since the data is better, it won't take as much of it, so less compute per quality. This helps decrease long-term costs some (though not as much as it costs to build the datasets unfortunately) but also helps the environment some by spending less power.)

googledocs you are getting awfully uppity for something that can’t differentiate between “its” and “it’s” correctly

#I am firmly convinced that proper data that's legal is the next step in not just AI effectiveness but also making AI moral and legal #currently it kinda sucks at all three. but this would help a metric ton #notably the AIs that try a little bit of this approach (while still stealing the other 95% of the data) are by far the most effective #(those are GPTs. especially Chat)

225K notes · View notes