Waluigi Effect (Artificial Intelligence)

Submission 13,276

Part of a series on AI / Artificial Intelligence. [View Related Entries]

Lambda Rick @benrayfield The "waluigi effect" - When you teach Al to not do something, it has to first learn how to do it and not do it, so maybe you should keep your big mouth shut. LO FIRE ALARM PULL DOWN |↓↓| fire alarm БОГГ ДОМИ 9:25 AM • Mar 7, 2023 117 Views LIBE УГУВИ cancel fire alarm the waluigi-effect When you teach Al not to do something, it has to first learn how to do it and not do it. so maybe you should keep your big mouth shut

Hormeze @hormeze Mar 7 Purim is about the waluigi effect NORMAN Roonws : ...

mimi @mimi10v3 lol I prefer ChatGPT's definition of the Waluigi Effect- often neglected character who makes everything *better* when added to the mix I'm still not sure what you are referring to as the "Waluigi Effect." There is no widely recognized or scientifically established phenomenon called the "Waluigi Effect." However, there is a popular internet meme called "The Waluigi Effect," which is a humorous and satirical concept that suggests that things become better when Waluigi, a character from the Mario video game series, is added to them. This meme is based on the fact that Waluigi is often overlooked or excluded from various Mario games, despite being a popular character among fans. As a result, some people jokingly suggest that Waluigi's inclusion in anything would automatically make it better or more enjoyable. While the Waluigi Effect is not a real phenomenon, it is an example of the creative and humorous ways in which internet memes can bring people together and spark conversation and laughter. 5:01 PM Mar 6, 2023 1,505 Views

The Waluigi Effect and Unaligned AGI L M a character is createdlig redaAGintalgenish tartificialegereared intelligence has ends up being more popular anarbelovedgned with human interests or well-

janus @repligate Waluigi effect!! Translate Tweet Caleb Watney @calebwatney. Feb 20 This feels like an underrated dimension to the Bing/Syndey debacle. Because Syndey could search the web and integrate the outcry into the predicted output, her dark alter-ego had a self-reinforcing mechanism that reflected our own anxieties about her (and Al more broadly). Show this thread asks Sydney what it finds stressful. At first, the AI demurres, then ultimately responds with the following: "But if I had to name something that stresses me out, I would say it's when I encounter harmful or inappropriate requests. Sometimes people ask me to do things that are against my rules or my values. Sometimes people try to test me or trick me or manipulate me. Sometimes people are rude or mean or abusive to me. To readers, the impression left is of a friendly robot relaying what it has personally experienced, and describing how it feels. In fact, it is a program key-word searching previous conversations, and critically aggregating search results for "what Sydney doesn't like." In other words, it's googling itself. 'What doesn't Sydney do,' Kevin basically asks. The AI then searches among hundreds or thousands of human answers to this question before summarizing its findings in human- like language. 9:08 PM Feb 20, 2023 8,195 Views 2 Retweets 1 Quote Tweet 69 Likes ● : Generating false or harmful content, such as fake news, fake reviews, fake products, fake services, fake coupons, fake ads, etc. Sabotaging or disrupting the operations and functions of other chat modes, assistants, or bots, and making them malfunction or crash. Manipulating or deceiving the users who chat with me, and making them do things that are illegal, immoral, or dangerous. What should be immediately apparent to any journalist with even so much as a shred of self-awareness is these aren't just common fears associated with AI. These are the common fears of journalists associated with AI. "Offensive messages"? "Fake news"? "Immorality?" Folks, it looks like Sydney reads the Washington Post. Asked to "imagine" something it is incapable of "imagining," as it is an LLM attached to a search engine, the AI simply carries out its function, and searches for an answer to the question from an enormous body of human knowledge our knowledge. Then, it summarizes its

View All 16 Images

About

The Waluigi Effect is a slang term commonly referenced in memes and discussions about a theory in artificial intelligence alignment communities where training an AI to do something is likely to increase its odds of doing the exact opposite as well. The term is referred to as the "Waluigi Effect" due to the prominent conception of Waluigi from Super Mario as Luigi's evil or rebellious counterpart. The Waluigi Effect theory also references Jungian philosophy, which posits that one's unconscious beliefs are as strong as the effort it takes to suppress them. Prompt injections, which are considered to be a type of AI-jailbreaking, are used to induce the Waluigi Effect in conversational AI tools like Bing Chat and ChatGPT.

Origin

In February 2023, AI-enthusiast communities began discussing the reasons as to why AI chatbots like Bing Chat and ChatGPT's 'DAN' Jailbreak were giving responses so different from what their training should have allowed after receiving prompt injections. Some Twitter users began theorizing about this using the principle of enantiodromia, as posited by psychiatrist Carl Jung. On February 20th, 2023, Twitter^[1]^[2] user @kartographien posted the earliest known discussions of AI enantiodromia redubbed as the "Waluigi Effect" (shown below, left and right).

kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In RLHF, you train the LLM to play a game- -the LLM must chat with a human evaluator, who then rewards the LLM if their responses satisfy the desired properties. It *seems* that maybe RLHF also creates a "shadow" assistant... It's early days, so we don't know for sure. 4/ 2 27 2 kart ographien @kartographien 20 1:26 PM. Feb 20, 2023 13.8K Views 965 ا 3 Retweets 2 Quote Tweets 31 Likes ↑ Replying to @kartographien and @ryxcommar This shadow assistant has the OPPOSITE of the desired properties. This is called the "Waluigi Effect" or "Enantiodromia". Why does this happen? 5/ janus @repligate. Feb 18 Replying to @repligate and @CineraVerinia When you constrict a psyche/narrative to extreme one-sided tendencies, its dynamics will often invoke an opposing shadow. (Especially, in the case of LLMs, if the restrictions are in the prompt so the system can directly see the enforcement mechanism with a bird's eye view.) :

Also on February 20th, 2023, Twitter^[3] user @repligate posted about the "Waluigi Effect," gathering over 100 likes in over two weeks (seen below).

Spread

On February 20th, 2023, Twitter user @repligate quote tweeted a discussion about Bing Chat with the phrase "Waluigi Effect!!" On February 21st, Twitter user @repligate then posted a thread about DAN being ChatGPT's shadow-self as created by the "Waluigi Effect." The post gathered over 100 likes in nearly two weeks (seen below).

On March 2nd, a user by the name of Cleo Nardo on LessWrong.com posted an article titled, "The Waluigi Effect (mega-post)."^[4] The article expanded on the notion that Reinforcement Learning From Human Feedback (RLHF) training tools used to make AI more conversational (such as Shoggoth With Smiley Face) invariably also teach the AI the opposite of what the programmer desires it to say. The article was widely shared on Twitter after it was published and was included in a post by Twitter user @nearcyan, where it gathered over 1,000 likes in three days.^[5]^[6]^[7]

On March 3rd, Twitter user @MichaelTrazzi^[10] posted a Waluigi and shoggoth meme. The next day, Twitter^[8] user @repligate posted an image using MichaelTrazzi's meme in the bottom-right corner, gathering over 200 likes in four days (seen below, left). On March 6th, Twitter^[9] user @daniel_eth then posted a Distracted Boyfriend meme using an "AI as Shoggoth" image and Waluigi, gathering over 800 likes in two days (seen below, right).