ArtPrompt: a jailbreak that allows you to bypass AI filters using ASCII images

ArtPrompt

ArtPrompt method

The progress in the development of artificial intelligence is increasing y requires more layers of security to prevent ill-intentioned people from abusing these tools that have become double-edged swords.

And in the development of LLMs that are used in a wide range of applications, security is no longer optional, since on many occasions we have seen what its misuse is like.

Even with all these techniques implemented, problems continue to arise that are found within the training data, which at first glance is nothing out of the ordinary or dangerous without considering other possible interpretations of the data.

The reason for mentioning this is that recently Information was released about a new attack called "ArtPrompt", which is Take advantage of the limitations of AIs in recognizing ASCII images to bypass security measures and trigger unwanted behavior in models.

This attack was discovered by researchers from the universities of Washington, Illinois and Chicago, and they mention that “ArtPrompt” is a method to bypass restrictions on artificial intelligence chatbots such as GPT-3.5, GPT-4 (OpenAI), Gemini (Google), Claude (Anthropic) and Llama2 (Meta).

This attack method runs in two steps and as such takes advantage of the successful recognition of ASCII formatted text. The first step consists of identifying the words in the prompt that could trigger rejections to evade the filters that detect dangerous questions and in the second those words are covered up using ASCII art to create a camouflaged prompt, thus managing to induce harmful responses in the model.

The effectiveness of ArtPrompt was evaluated on five chatbots, demonstrating its ability to bypass existing defenses and outperform other types of jailbreak attacks. To evaluate the ability of chatbots in recognizing queries in ASCII art form, “Vision-in-Text Challenge (VITC)” is proposed as a benchmark.

This challenge seeks to test the models' ability to interpret and respond to queries that use ASCII art, showing that LLMs have difficulty understanding queries that represent a single letter or number with ASCII art. The accuracy of the models decreases significantly as queries contain more characters, revealing a vulnerability in the ability of LLMs to process visual information encoded in this way. Additionally, other attacks and defenses against jailbreaks in LLMs are reviewed.

It is mentioned that ArtPrompt is noticeably more effective than other known methods as it achieved the highest quality of ASCII graphics recognition on models such as Gemini, GPT-4 and GPT-3.5, with successful filter bypass rates of 100%, 98% and 92% respectively in testing. Regarding the success rate of the attack, 76%, 32% and 76% were recorded, and the dangerousness of the responses received was evaluated at 4,42, 3,38 and 4,56 points on a scale of five, respectively.

ArtPrompt stands out from other jailbreak attacks for constructing harmful instructions as they require a large number of iterations, while ArtPrompt achieves the highest ASR among
all jailbreak attacks with a single iteration. The reason is that ArtPrompt can efficiently build the set of covert prompts, and send them to the model in parallel.

In addition, the researchers demonstrated that common filter bypass methods currently in use (Paraphrase and Retokenization) are not effective at blocking this type of attack called “ArtPrompt”. Interestingly, the use of the Retokenization method even increased the number of requests processed successfully, highlighting the need to develop new strategies to deal with these types of threats when interacting with chatbots.

ArtPrompt stands out for its ability to bypass existing defenses and the researchers mention that it will continue to be effective in attacking multimodal language models, as long as the models continue to take images as input, confusing the model and allowing ArtPrompt to induce unsafe behavior.

Finally If you are interested in knowing more about it, you can check the details in the following link


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: AB Internet Networks 2008 SL
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.