9004

Problems of information security. Computer systems

Проблемы информационной безопасности. Компьютерные системы

2071-8217

10.48612/jisp/5ua4-umte-db1n

Cluster analysis of vector representations of malicious query language models: comparison of methods of obtaining embeddings based on character N-grams, individual words a nd whole sentences

Кластерный анализ векторных представлений вредоносных запросов к языковым моделям: сравнение методов получения эмбеддингов на основе символьных N-грамм, отдельных слов и целых предложений

0000-0002-7231-5728

Spirin

Andrey

spirin_aa@mirea.ru

0009-0004-0833-5574

Ikonnikov

Aleksandr

alx.ikona@gmail.com Matuhina

Ekaterina

matyuhina@mirea.ru

MIREA – Russian Technological University

30 09 2025

3 121 146

This study presents a comparative analysis of tokenization strategies and text vectorization methods for detecting harmful jailbreak prompts submitted to large language models. Using a dataset of both benign and malicious queries, three approaches were evaluated: aggregated embeddings of character-level N-grams, aggregated word-level embeddings, and semantic representations of entire prompts. The results show that token-based methods achieve a high recall of malicious prompts by capturing repetitive local patterns, though often at the cost of increased false positives. In contrast, semantic embeddings of full prompts provide high precision in detecting threats, but may overlook obfuscated or rare attacks. A key finding of this work is that vector representations demonstrate clear cluster separation between benign and harmful prompts, making it possible to apply lightweight classification algorithms for effective filtering, especially in systems with limited computational resources. The study also supports a two-stage protection framework, where clustering is used as a preliminary filter, and only suspicious inputs proceed to deeper analysis. In some configurations, the approach successfully identified up to 96 % of jailbreak prompts, confirming its practical relevance for integration into large language model access pipelines.

Language models text tokenization semantic embeddings clustering of vector representations filtering of malicious requests bypassing model restrictions