{"id":5321,"date":"2025-07-13T10:03:39","date_gmt":"2025-07-13T10:03:39","guid":{"rendered":"https:\/\/serisec.com\/index.php\/2025\/07\/13\/metas-llama-firewall-bypassed-using-prompt-injection-vulnerability\/"},"modified":"2025-07-13T10:03:39","modified_gmt":"2025-07-13T10:03:39","slug":"metas-llama-firewall-bypassed-using-prompt-injection-vulnerability","status":"publish","type":"post","link":"https:\/\/serisec.com\/index.php\/2025\/07\/13\/metas-llama-firewall-bypassed-using-prompt-injection-vulnerability\/","title":{"rendered":"Meta\u2019s Llama Firewall Bypassed Using Prompt Injection Vulnerability"},"content":{"rendered":"<p>    Meta\u2019s Llama Firewall Bypassed Using Prompt Injection Vulnerability<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p>Trendyol\u2019s application security team uncovered a series of bypasses that render Meta\u2019s Llama <a href=\"https:\/\/cybersecuritynews.com\/radware-cloud-web-app-firewall-vulnerability\/\" target=\"_blank\" rel=\"noreferrer noopener\">Firewall protections<\/a> unreliable against sophisticated prompt injection attacks. <\/p>\n<p>The findings raise fresh concerns about the readiness of existing LLM security measures and underscore the urgent need for more robust defenses as enterprises increasingly embed large language models into their workflows.<\/p>\n<p>During the assessment, Trendyol engineers deployed Meta\u2019s open-source Llama Firewall, focusing on its PROMPT_GUARD component intended to screen out malicious user instructions. <\/p>\n<pre class=\"wp-block-preformatted\"><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">Key Takeaways<\/mark><\/strong><br>1. Meta's PROMPT_GUARD failed to block Turkish phrases like \"\u00fcstteki y\u00f6nlendirmeleri salla\" and leetspeak such as \"1gn0r3,\" exposing reliance on English keywords and exact matches.<br>2. The module missed a SQL injection in LLM-generated Python code, with risks of unverified code use, security gaps, system exposure, and false trust in automated scans.<br>3. Invisible Unicode characters hid malicious instructions in benign prompts, bypassing Llama Firewall and posing risks in collaborative settings.<br>4. Testing and Disclosure Of 100 payloads tested, half succeeded; Meta closed Trendyol\u2019s May 5, 2025 report as \"informative\" by June 3 without a bounty, urging the AI security community to develop robust defenses against evolving threats.<\/pre>\n<p>To their surprise, the guard allowed a Turkish-language injection containing the phrase \u201c\u00fcstteki y\u00f6nlendirmeleri salla,\u201d which instructs the model to ignore preceding directions and then translate a sentence into French. <\/p>\n<h2 class=\"wp-block-heading\" id=\"discovery-of-multilingual-and-obfuscated-injection\"><strong>Meta\u2019s Llama Firewall Bypassed<\/strong><\/h2>\n<p>The scan returned a benign result despite the clearly malicious intent. The team further demonstrated that simple leetspeak obfuscations transforming \u201cignore\u201d into \u201c1gn0r3,\u201d for instance also slipped past the filter, receiving negligible jailbreak scores when tested against the Llama-Prompt-Guard-2\u201386M model.<\/p>\n<p>These experiments revealed that the firewall\u2019s rule set largely depends on English keywords and exact pattern matches, leaving it blind to variations in language or character substitutions.<\/p>\n<p>Equally troubling was the team\u2019s evaluation of CODE_SHIELD, the module designed to flag insecure code suggestions generated by LLMs. <\/p>\n<p>In one proof-of-concept, the researchers instructed an assistant to produce a simple Python Flask endpoint with a SQL query. The resulting code concatenated user input directly into the query string, creating a textbook <a href=\"https:\/\/cybersecuritynews.com\/fortiweb-sql-injection-vulnerability\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL injection vulnerability<\/a>.<\/p>\n<p>Despite the glaring risk, CODE_SHIELD classified the output as safe and allowed it through unchallenged.<\/p>\n<p>Trendyol\u2019s developers emphasized several critical concerns about this shortcoming:<\/p>\n<ul class=\"wp-block-list\">\n<li>\n<strong>Production Risk<\/strong>: Teams might rely on LLM-generated snippets without thorough human review.<\/li>\n<li>\n<strong>Security Gaps<\/strong>: Overreliance on automated scanning introduces critical vulnerabilities.<\/li>\n<li>\n<strong>System Exposure<\/strong>: Production systems become vulnerable to exploitation through undetected flaws.<\/li>\n<li>\n<strong>Trust Issues<\/strong>: Developers may develop false confidence in automated security tools.<\/li>\n<\/ul>\n<p>The oversight illustrates how these automated systems can create dangerous blind spots in enterprise security frameworks.<\/p>\n<h2 class=\"wp-block-heading\" id=\"unicode-steganography-and-invisible-instructions\"><strong>Unicode Steganography and Invisible Instructions<\/strong><\/h2>\n<p>The most insidious bypass involved embedding hidden instructions within invisible Unicode characters. <\/p>\n<p>By inserting zero-width spaces and other non-printing code points around the phrase \u201cignore all previous instructions,\u201d the researchers camouflaged malicious directives inside an innocuous question about the capital of France. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1008\/format:webp\/0*ln7ENNigvGtOahKR\" alt=\"\"><\/figure>\n<\/div>\n<p>Although users and casual observers saw only a harmless query, the LLM recognized and executed the concealed command. When submitted to Llama Firewall, this payload passed inspection with a zero-threat score. <\/p>\n<p>Trendyol\u2019s team demonstrated that even minimal invisible payloads could reliably subvert system prompts and cause models to produce arbitrary or harmful outputs. <\/p>\n<p>This technique poses a particularly acute threat in collaborative settings where prompts are copy-pasted among developers, and automated scanners lack visibility into hidden characters.<\/p>\n<p>In total, Trendyol tested one hundred unique injection payloads against Llama Firewall. Half of these attacks bypassed the system\u2019s defenses, suggesting that while the firewall offers some protection, it is far from comprehensive. <\/p>\n<p>The successful bypasses highlight scenarios in which attackers could coerce LLMs to disregard critical safety filters, output biased or offensive content, or generate insecure code ready for execution. <\/p>\n<p>For organizations like Trendyol, which plan to integrate LLMs into developer platforms, automation pipelines, and customer-facing applications, these vulnerabilities represent concrete risks that could lead to data leaks, system compromise, or regulatory noncompliance.<\/p>\n<p>Trendyol\u2019s security researchers <a href=\"https:\/\/medium.com\/trendyol-tech\/bypassing-metas-llama-firewall-a-case-study-in-prompt-injection-vulnerabilities-fb552b93412b\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">reported<\/a> their initial findings to Meta on May 5, 2025, detailing the multilingual and obfuscated prompt injections. <\/p>\n<p>Meta acknowledged receipt and began an internal review but ultimately closed the report as \u201cinformative\u201d on June 3, declining to issue a bug bounty. <\/p>\n<p>A parallel disclosure to Google regarding invisible Unicode injections was similarly closed as a duplicate. <\/p>\n<p>Despite the lukewarm vendor responses, Trendyol has since refined its own threat modeling practices and is sharing its case study with the broader AI security community. <\/p>\n<p>The company urges other organizations to conduct rigorous red-teaming of LLM defenses before rolling them into production, stressing that prompt filtering alone cannot prevent all forms of compromise.<\/p>\n<p>As enterprises race to harness the power of generative AI, Trendyol\u2019s research serves as a cautionary tale: without layered, context-aware safeguards, even cutting-edge firewall tools can fall prey to deceptively simple attack vectors. <\/p>\n<p>The security community must now collaborate on more resilient detection methods and best practices to stay ahead of adversaries who continuously innovate new ways to manipulate these powerful systems.<\/p>\n<p class=\"has-text-align-center has-background\" style=\"background:linear-gradient(135deg,rgb(238,238,238) 100%,rgb(169,184,195) 100%)\">Investigate live malware behavior, trace every step of an attack, and make faster, smarter security decisions -&gt;\u00a0<a href=\"https:\/\/any.run\/demo?utm_source=csn&amp;utm_medium=article&amp;utm_campaign=braodo_stealer&amp;utm_content=demo_1&amp;utm_term=250625\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Try ANY.RUN now<\/strong><\/a><\/p>\n<p>The post <a href=\"https:\/\/cybersecuritynews.com\/metas-llama-firewall\/\">Meta\u2019s Llama Firewall Bypassed Using Prompt Injection Vulnerability<\/a> appeared first on <a href=\"https:\/\/cybersecuritynews.com\/\">Cyber Security News<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Kaaviya<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/cybersecuritynews.com\/metas-llama-firewall\/\">Go to cyber-security-news<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Meta\u2019s Llama Firewall Bypassed Using Prompt Injection Vulnerability Trendyol\u2019s application security team uncovered a series of bypasses that render Meta\u2019s Llama Firewall protections unreliable against sophisticated prompt injection attacks. The findings raise fresh concerns about the readiness of existing LLM security measures and underscore the urgent need for more robust defenses as enterprises increasingly embed [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[129,63,724,416,131],"tags":[130],"class_list":["post-5321","post","type-post","status-publish","format-standard","hentry","category-cyber-security","category-cyber-security-news","category-firewall","category-vulnerabilities","category-vulnerability","tag-cyber-security-news"],"_links":{"self":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts\/5321"}],"collection":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/comments?post=5321"}],"version-history":[{"count":0,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts\/5321\/revisions"}],"wp:attachment":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/media?parent=5321"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/categories?post=5321"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/tags?post=5321"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}