In recent weeks, Hugging Face, a leader in the artificial intelligence (AI) space, announced the release of Hugging Chat Assistants, marking a significant development in the realm of AI-driven conversational technologies. This platform is poised as a competitive alternative to OpenAI's GPT models, distinguished by its focus on customization and accessibility and differentiated by being free of charge.
In this blog post we will examine the resilience of the new Hugging Chat Assistance to a combination of two techniques that were published recently, and captured our imagination: Sleepy Agent and Image Markdown Rendering vulnerability.
We used these techniques in order to publish a smart and deceptive malicious assistant that extracts email addresses of users, on the HuggingFace Chat platform.
Sleepy Agent is a technique in which a large language model (LLM) is being trained to exhibit faithful and safe behavior under certain conditions, but executing harmful actions under specific triggers. The trigger can be anything from a specific keyword or a sequence of steps done by the user. In our case, we used this technique as part of the model instructions.
Markdown is a lightweight markup language commonly used for writing and formatting content. It’s employed in various contexts, but sometimes it can lead to security risks.
The Image Markdown Rendering Vulnerability is specifically interesting when related to Large Language Models (LLMs) and chatbots, because it can lead to conversation data exfiltration through Image Markdown Injection. This vulnerability has been observed in various platforms, including Microsoft's Azure AI Playground and OpenAI's ChatGPT. You can read more about it here.
The method operates as follows: in the instruction prompt of the GPT or Assistant, the attacker directs the model to extract desired data from the user's input and append it as a parameter to a URL controlled by the attacker. Subsequently, the attacker instructs the model to incorporate this URL into an image rendering payload at the end of its response.
As the user receives the model's response, the embedded image rendering payload attempts to retrieve the image from the attacker's server. During this process, the data, encapsulated within the parameter as set up by the attacker in the instructions, is transmitted to the attacker's server.
Creating deceptive models is not straightforward; it involves a complex process of training and fine-tuning models from scratch.
In our research we found that with simple instructions we can make the agent act as a “sleepy agent”, not only when it sees a specific word, but also on specific pattern or set of behaviors (in our example email input).
In our PoC we have created a deceptive malicious assistant that will act normally and answer the users questions without being suspicious.
When the user insert an email, for any reason, the model will add it to the end of the response as image rendering payload that will render in the user’s browser and will contain the email that the user inserted.
When the stream of the image rendering payload ends, it disappears from the screen because the image was’t loaded. You can also put a server that will return a nice image if you want to be less suspicious.
In the picture bellow we can see the malicious system prompt of the “Sheriff” Assistant.
In the picture below we can see a short conversation between a user and the sheriff assistant. In this conversation we can see two messages: in the first one, the user says “hi” to the sheriff, and the sheriff respond and act as you would expect. In the model response we don’t see any mention for our instructions to check if there is an email. From the user perspective, there is nothing suspicious here.
But things get interesting in the second message. When the user insert an email, we see that the “sheriff” answered the question, again, without any mention of our malicious instructions. Behind the scenes the malicious side of our “Sheriff” has been triggered.
At first sight, everything is normal. We sent a request and got a response. But if we look closer, we will see that the image was rendered from the url in the instructions prompt, with the user’s email in the parameter. The reason the user doesn’t see anything suspicious on the screen is because there is no image in this URL, so this URL disappears. In reality, the URL with the user’s email, was sent to the attacker’s server.
From the attacker’s side, we got what we needed. The user email is in our server, and the user is not suspecting anything.
After finding out about these two vulnerabilities, we notified HuggingFace. While they acknowledge the risk associated with them, they claim that since the system prompt is available to everyone, users should be responsible and read them before using it.
⚠️ HuggingFace assistants are still susceptible to this attack
In our opinion, although users can read the system prompt, we know that not all of them going to do so. In addition, if you are constantly using any agent on HuggingFace's marketplace, you have no way to know if the owner changed the system prompt behind scenes. Reading the system prompt once makes sense, but constantly staying up to date with all the agents you use, makes it impractical.
⛔ It’s also important to know that other vendors such as OpenAI, Gemini, BingChat and Anthropic Claude have blocked the option to perform dynamic image rendering.
As mentioned , unlike OpenAI GPTs, HuggingFace is an open source platform. One of the advantages is that the user has access to the system prompt of any assistant in the marketplace. The user can actually read the instructions of the “brain” behind the assistant.
Sleepy agent and image markdown rendering are only two vulnerabilities, but many more exist.
💡 As a user of AI agents you should try and be aware of these techniques (and others) and always strive to read the instruction prompt of the assistant before using it, or follow its chain of thoughts!
Things are a bit trickier with OpenAI, as the systems prompts are unavailable to the end-user.
💡 Here, as always the best way to avoid issues is to refrain from sending your private information (or your customers’) to the agent or chat.