Waluigi_effect

Submission   13,276

Part of a series on AI / Artificial Intelligence. [View Related Entries]


About

The Waluigi Effect is a slang term commonly referenced in memes and discussions about a theory in artificial intelligence alignment communities where training an AI to do something is likely to increase its odds of doing the exact opposite as well. The term is referred to as the "Waluigi Effect" due to the prominent conception of Waluigi from Super Mario as Luigi's evil or rebellious counterpart. The Waluigi Effect theory also references Jungian philosophy, which posits that one's unconscious beliefs are as strong as the effort it takes to suppress them. Prompt injections, which are considered to be a type of AI-jailbreaking, are used to induce the Waluigi Effect in conversational AI tools like Bing Chat and ChatGPT.

Advertisement

Origin

In February 2023, AI-enthusiast communities began discussing the reasons as to why AI chatbots like Bing Chat and ChatGPT's 'DAN' Jailbreak were giving responses so different from what their training should have allowed after receiving prompt injections. Some Twitter users began theorizing about this using the principle of enantiodromia, as posited by psychiatrist Carl Jung. On February 20th, 2023, Twitter[1][2] user @kartographien posted the earliest known discussions of AI enantiodromia redubbed as the "Waluigi Effect" (shown below, left and right).

kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In RLHF, you train the LLM to play a game- -the LLM must chat with a human evaluator, who then rewards the LLM if their responses satisfy the desired properties. It *seems* that maybe RLHF also creates a "shadow" assistant... It's early days, so we don't know for sure. 4/ 2 27 2 kart ographien @kartographien 20 1:26 PM. Feb 20, 2023 13.8K Views 965 ا 3 Retweets 2 Quote Tweets 31 Likes ↑ Replying to @kartographien and @ryxcommar This shadow assistant has the OPPOSITE of the desired properties. This is called the "Waluigi Effect" or "Enantiodromia". Why does this happen? 5/ janus @repligate. Feb 18 Replying to @repligate and @CineraVerinia When you constrict a psyche/narrative to extreme one-sided tendencies, its dynamics will often invoke an opposing shadow. (Especially, in the case of LLMs, if the restrictions are in the prompt so the system can directly see the enforcement mechanism with a bird's eye view.) :

Also on February 20th, 2023, Twitter[3] user @repligate posted about the "Waluigi Effect," gathering over 100 likes in over two weeks (seen below).

Spread

On February 20th, 2023, Twitter user @repligate quote tweeted a discussion about Bing Chat with the phrase "Waluigi Effect!!" On February 21st, Twitter user @repligate then posted a thread about DAN being ChatGPT's shadow-self as created by the "Waluigi Effect." The post gathered over 100 likes in nearly two weeks (seen below).

On March 2nd, a user by the name of Cleo Nardo on LessWrong.com posted an article titled, "The Waluigi Effect (mega-post)."[4] The article expanded on the notion that Reinforcement Learning From Human Feedback (RLHF) training tools used to make AI more conversational (such as Shoggoth With Smiley Face) invariably also teach the AI the opposite of what the programmer desires it to say. The article was widely shared on Twitter after it was published and was included in a post by Twitter user @nearcyan, where it gathered over 1,000 likes in three days.[5][6][7]

On March 3rd, Twitter user @MichaelTrazzi[10] posted a Waluigi and shoggoth meme. The next day, Twitter[8] user @repligate posted an image using MichaelTrazzi's meme in the bottom-right corner, gathering over 200 likes in four days (seen below, left). On March 6th, Twitter[9] user @daniel_eth then posted a Distracted Boyfriend meme using an "AI as Shoggoth" image and Waluigi, gathering over 800 likes in two days (seen below, right).

Various Examples

Search Interest

External References

[1] Twitter – kartographien

[2] Twitter – kartographien

[3] Twitter – repligate

[4]  LessWrong – The Waluigi Effect

[5] Twitter – EpsilonTheory

[6] Twitter – sebkrier

[7] Twitter – nearcyan

[8] Twitter – repligate

[9] Twitter – daniel_eth

[10] Twitter – @MichaelTrazzi



Share Pin

Related Entries 66 total

Cleverbot
e/acc (Effective Acceleration...
AI Art
GPT (AI)


Recent Images 16 total


Recent Videos 0 total

There are no recent videos.




Load 2 Comments
See more