Cluster analysis of vector representations of malicious query language models: comparison of methods of obtaining embeddings based on character N-grams, individual words a nd whole sentences
This study presents a comparative analysis of tokenization strategies and text vectorization methods for detecting harmful jailbreak prompts submitted to large language models. Using a dataset of both benign and malicious queries, three approaches were evaluated: aggregated embeddings of character-level N-grams, aggregated word-level embeddings, and semantic representations of entire prompts. The results show that token-based methods achieve a high recall of malicious prompts by capturing repetitive local patterns, though often at the cost of increased false positives. In contrast, semantic embeddings of full prompts provide high precision in detecting threats, but may overlook obfuscated or rare attacks. A key finding of this work is that vector representations demonstrate clear cluster separation between benign and harmful prompts, making it possible to apply lightweight classification algorithms for effective filtering, especially in systems with limited computational resources. The study also supports a two-stage protection framework, where clustering is used as a preliminary filter, and only suspicious inputs proceed to deeper analysis. In some configurations, the approach successfully identified up to 96% of jailbreak prompts, confirming its practical relevance for integration into large language model access pipelines.