From exploitation to protection: analysis of methods for defending against attacks on LLMS

Machine learning and knowledge control systems
Authors:
Abstract:

Modern large language models demonstrate impressive capabilities but remain vulnerable to attacks that can manipulate their behavior, extract confidential data, or bypass built-in restrictions. This paper focuses on methods for protecting language models from prompt injection attacks, which allow adversaries to exploit the system for malicious purposes. Various defense strategies are examined and analyzed, including query filtering, context isolation, training on perturbed data, and other approaches. A comparative analysis of the effectiveness of defense mechanisms is conducted, highlighting their limitations and identifying future directions for enhancing the security of language models.