the rum soaked fist: internal martial arts forum

by **everything** on Sun Oct 22, 2023 10:01 am

https://www.popularmechanics.com/techno ... -collapse/

- AI-generated content is beginning to fill the internet, and that could be bad news for future AI models.
- Language models like ChatGPT are trained by using content found online, and as AI creates more “synthetic” content, it could create an engineering problem known as “model collapse.”
- Filtering synthetic data out of training models is becoming a major research area, and will likely grow as AI content begins to fill the internet.

An ouroboros is the famous ancient symbol of a snake devouring its own tail. But what’s old is new again, and in the age of AI, this gluttonous iconography takes on a whole new and poignant meaning. As editorial content created by AI language models like ChatGPT begin to fill the internet—often to the dismay of the very human editors that work on these websites—lots and lots of errors are coming with them.

And that’s a big problem, because the internet is the very source material on which these language models are trained. In other words, AI is eating its own tail. In what can be best described as a terrible game of telephone, AI could begin training on error-filled, synthetic data until the very thing it was trying to to create becomes absolute gibberish. This is what AI researchers call “model collapse.”

by **Steve James** on Sun Oct 22, 2023 12:47 pm

"Model collapse" is an example of why AI don't "think" or understand their output. Something is non-sensical to a mind that knows what makes sense.

by **everything** on Sun Oct 22, 2023 4:22 pm

"garbage in, garbage out".

this article doesn't seem to say what the solution is. I guess be more careful about the inputs.

by **Steve James** on Sun Oct 22, 2023 5:22 pm

We might only know when the output is nonsense. But, yeah, garbage will beget garbage. The problem is that any valid data resource will have contradictions.

by **Dmitri** on Tue Oct 24, 2023 5:37 am

Kind of like inbreeding... yeah this is bad news for "general AI" development - for which the current models aren't even intended (are not "smart enough").
The real (and only) application for the existing software is "narrow", i.e. they need to be trained with real data in a very specific/narrow field, like making a reservation or initial support-related customer interactions. So this is sort of an artificial problem. Good enough for a "pop science" article though.

by **Steve James** on Tue Oct 24, 2023 6:42 am

i.e. they need to be trained with real data in a very specific/narrow field,

Imo, it depends on the task, or prompt, and what's expected to be done with the response. The other issue is that, in order to keep up with current information on anything, it'll be necessary to use the internet. Boom. There's the problem

.

For ex, will limiting a tcc AI to the info on RSF make it more or less accurate? Other than pure math, is there a subject where there can be universal agreement on answers?

Remember that movie "War Games'? It'd be scary to have an AI controlling missile defense if it just relied on mathematical calculations. In the movie, for some reason, the AI decides it's senseless to fire the missiles. That might be considered model collapse to the designer, or it might be considered a kind of computer-based morality. Otoh, chess AI have no problem sacrificing everything to win.

by **everything** on Tue Oct 24, 2023 7:07 am

I guess it really depends on the use cases and what is really needed in the "training data" for those use cases.

The other day, a friend of mine needed a rib rub recipe, so he just scanned a picture of his spices, and ChatGPT (plus) did the image recognition and "generated" a recipe. For it to recognize images correctly, the "training data" needs a bunch of images that are verified as "correct". If all images (of buses or whatever) from Captcha are "wrong", it wouldn't recognize those spices correctly. But that example falls into a narrow case that seems "easy" to keep correct. If it's medical data, such as MRI images, presumably that's heavily controlled/curated.

If it's "general text on the internet" for a large language model to generate general text, "eating its tail" seems like an eventual problem. I read an "article" the other day essentially summarizing some house pricing and sales. At the bottom, it said the article was AI generated from public listing data. If ChatGPT "read" this article as "knowlege" or "data" to learn from, not sure that article would be helpful, but not sure it would be harmful. I'm not sure it's an "artificial problem", though. If it's designated "popular", what keeps it from being "serious" as well? Calling it "popular" doesn't mean it's not an actual problem. "Popular" kind of just means the general reader can slightly follow it (as we are trying to do).

by **origami_itto** on Tue Oct 24, 2023 7:44 am

Garbage in, garbage out

by **everything** on Tue Oct 24, 2023 9:27 am

everything wrote:"garbage in, garbage out".

this article doesn't seem to say what the solution is. I guess be more careful about the inputs.

origami_itto wrote:Garbage in, garbage out

yeah

by **yeniseri** on Fri Oct 27, 2023 7:26 am

AI is man made so it has the attributes of the programmer/organization(s)(human ::)

) creating it and by the results generated are just that.
Interestingly, AI has also created references and links having no previous real world background meaning the references cited do not exist! AKin to
someone who is copying stuff for an exam and ends up writing the name of the person who lend him/her the cheat sheet and their address and phone number

by **everything** on Fri Oct 27, 2023 8:40 am

right the "hallucinations" that can happen. even though we made the AI, the "algorithms" it comes up with are not necessarily understood by us (chess ones being the easiest example. we told it the rules and moves and no algorithms, it came up with algorithms that grandmasters don't understand).

there was this Cruise driverless car traffic jam in Austin about a month ago:

this is pretty funny, but the human drivers near me are nearly as "stupid" lol.

by **origami_itto** on Fri Oct 27, 2023 12:19 pm

The latest thing is poisoning data sets and spoiling mics and cameras through various exploits that take advantage of cheap manufacturing.

At this point it becomes an arms race against entropy and sabotage to remain useful, or else just the birth of more carefully curated training sets.

by **everything** on Fri Oct 27, 2023 12:42 pm

a curated training set in a "closed" domain (like chess) is as you or someone mentioned probably the way to go (and very useful) for a while. a lot of "narrow AIs" that will all be "smarter" than humans at the narrow area (like playing chess or pre-reading or flagging MRI images). or sometimes the narrow area can have datasets so large that it would take humans practically forever to go through (drug discovery pattern identification, even a "large language model" seems like this category .... the "texts" that chatGPT "read" are too voluminous for any single human to have read in many lifetimes, but in the end it's just "text" and not something else like driving cars, so it's arguably still narrow in a domain sense; it just has a large dataset).

if it's "general AI" (where the "dataset" is essentially the entire world including the kids using cones in front of or on self driving cars), it seems "too big" to us ... for now. i guess once you have a gigantic number of narrow AIs ... what happens when you "add"/"multiply" them?

by **vadaga** on Tue Oct 31, 2023 5:27 am

I have been going through Jaron Lanier's books lately. He has a lot to say about what is or isnt truly AI- not sure if I am misunderstanding at this point but it seems that generative AI is not a program that mimics the function of the human brain, rather it is an algorithm that uses the 'law of large numbers' approach based on publicly available information on the internet to simulate an accurate response, or is it more than that? I can't believe for example that it's that simple.

A thought experiment then would be, what if we added the text '2+2=5' to every program and website in the world. Would a generative AI model pick this up as being a truth due to it being repeated everywhere.

by **everything** on Tue Oct 31, 2023 6:06 am

i'm not an expert, but have worked "around" machine learning experts for a while.

think that's it as well. it's absorbed ("read") more text than any other program so it can "predict" what words/phrases should be "generated". then there is the "reinforcement" when humans tell it to regenerate or give it thumbs up or down (similar to how humans reporting spam give feedback to the spam predictor). done on large scale, it seems to mimic real understanding. even though we know it isn't "thinking", since humans like to express most thoughts in words (well we can use gestures etc.... but image recognition could learn that, too), the result seems quite similar if not "smarter".

if i ask it about the 2+2=5 example, it references George Orwell's "1984" where 2+2=5 represents state control/reality distortion. funny.

i asked it what about 2.3 (rounds to 2) plus itself equals 4.6 which rounds to 5, but it insisted 2+2 still is 4 in that case. lol.

the rum soaked fist: internal martial arts forum

AI is starting to eat its own tail

AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Re: AI is starting to eat its own tail

Who is online