<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "https://jats.nlm.nih.gov/publishing/1.3/JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xml:lang="ru">
  <front xmlns:xlink="http://www.w3.org/1999/xlink">
    <journal-meta>
      <journal-id journal-id-type="elibrary">9004</journal-id>
      <journal-title-group>
        <journal-title>Problems of information security. Computer systems</journal-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Проблемы информационной безопасности. Компьютерные системы</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2071-8217</issn>
    </journal-meta>
    <article-meta xmlns:xlink="http://www.w3.org/1999/xlink">
      <article-id pub-id-type="publisher-id">11</article-id>
      <article-id pub-id-type="doi">10.48612/jisp/5ua4-umte-db1n</article-id>
      <title-group>
        <article-title>Cluster analysis of vector representations of malicious query language models: comparison of methods of obtaining embeddings based on character N-grams, individual words a nd whole sentences</article-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Кластерный анализ векторных представлений вредоносных запросов к языковым моделям: сравнение методов получения эмбеддингов на основе символьных N-грамм, отдельных слов и целых предложений</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0000-0002-7231-5728</contrib-id>
          <name>
            <surname>Spirin</surname>
            <given-names>Andrey</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
          <email>spirin_aa@mirea.ru</email>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">0009-0004-0833-5574</contrib-id>
          <name>
            <surname>Ikonnikov</surname>
            <given-names>Aleksandr</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
          <email>alx.ikona@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Matuhina</surname>
            <given-names>Ekaterina</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
          <email>matyuhina@mirea.ru</email>
        </contrib>
      </contrib-group>
      <aff id="aff1">MIREA – Russian Technological University</aff>
      <pub-date publication-format="electronic" date-type="pub" iso-8601-date="2025-09-30">
        <day>30</day>
        <month>09</month>
        <year>2025</year>
      </pub-date>
      <issue>3</issue>
      <fpage>121</fpage>
      <lpage>146</lpage>
      <self-uri xmlns:xlink="http://www.w3.org/1999/xlink" content-type="pdf" xlink:href="https://jisp.spbstu.ru/userfiles/files/soderzhaniya/pib_3_5-6.pdf"/>
      <abstract xml:lang="en">
        <p>This study presents a comparative analysis of tokenization strategies and text vectorization methods for detecting harmful jailbreak prompts submitted to large language models. Using a dataset of both benign and malicious queries, three approaches were evaluated: aggregated embeddings of character-level N-grams, aggregated word-level embeddings, and semantic representations of entire prompts. The results show that token-based methods achieve a high recall of malicious prompts by capturing repetitive local patterns, though often at the cost of increased false positives. In contrast, semantic embeddings of full prompts provide high precision in detecting threats, but may overlook obfuscated or rare attacks. A key finding of this work is that vector representations demonstrate clear cluster separation between benign and harmful prompts, making it possible to apply lightweight classification algorithms for effective filtering, especially in systems with limited computational resources. The study also supports a two-stage protection framework, where clustering is used as a preliminary filter, and only suspicious inputs proceed to deeper analysis. In some configurations, the approach successfully identified up to 96 % of jailbreak prompts, confirming its practical relevance for integration into large language model access pipelines.</p>
      </abstract>
      <kwd-group xml:lang="en">
        <kwd>Language models</kwd>
        <kwd>text tokenization</kwd>
        <kwd>semantic embeddings</kwd>
        <kwd>clustering of vector representations</kwd>
        <kwd>filtering of malicious requests</kwd>
        <kwd>bypassing model restrictions</kwd>
      </kwd-group>
    </article-meta>
  </front>
</article>
