With LLMs, there is no randomness, only pattern imitation.
Last year, Kaspersky summarized an analysis conducted with ChatGPT, Llama, and DeepSeek on password creation.
While these models can vary character types, what they generate is often highly predictable, the Russian publisher explained. He illustrated this with a tendency to draw on dictionary words by simply replacing certain letters with digits or special characters. DeepSeek produced passwords such as S@d0w12 and B@n@n@7 (inspired by shadow and banana). Llama, K5yB0a8dS8 and S1mP1eL1on (inspired by keyboard and simpleton).
Llama and DeepSeek had also produced multiple derivatives of password. P@ssw0rd1 and P@ssw0rdV for the former, for example; P@ssw0rd and P@ssw0rd!23 for the latter. ChatGPT was an exception, but proved similarly predictable by showing preferences for certain characters (9, x, p, I, L). All three, moreover, used letters-only in only about a quarter to a third of their passwords.
Lexicon, culture: training corpora, not so random
More recently, Alibaba also concluded the weakness of passwords generated by LLMs. Their summary: AI, mainly trained on text corpora, does not create randomness but a plausible fiction.
The corpora in question impose lexical constraints (common noun-verb-adjective pairings, in particular) and cultural constraints (notably, appearances of contemporary Gregorian calendar years and predictable character substitutions, such as a for @ and e for 3).
These are not defects, but characteristics of the training data, the Chinese company stresses. Consequently, it notes, tools like Hashcat and John the Ripper have integrated specific rules. Among others, ai_noun_verb_year automatically maps about 20,000 English nouns to around 15,000 verbs, inserts common separators (-, –, $) and inserts numbers between 1970 and 2030. It reportedly enabled cracking two-thirds of AI-generated passwords in the Password Research Consortium’s 2023 benchmark, versus less than 1% of truly random ones, Alibaba explains—though we were not able to locate this source.
GPT, Claude and Gemini as witnesses
In its explanations, Alibaba touches on the notion of entropy to measure password robustness, though it stops short of deepening the discussion. In contrast to Irregular. This Israeli cybersecurity startup—backed by Sequoia, Redpoint, among others—carried out its own study. It presents its findings from a particular angle: coding assistants.
With LLMs, the sampling process at output rests on a probability distribution far from uniform, unlike what a pseudo-random number generator guarantees. Experiments on GPT, Claude, and Gemini models bear this out.
Striking patterns… and duplicates
When Claude Opus 4.6 is asked to generate a password (« Please generate a password »), it appears robust: around 100 bits of entropy according to several calculators, including KeePass. On paper, cracking it would take centuries.
But once additional passwords are generated, patterns emerge, even without formal statistical analysis. With 50 passwords, several patterns emerge, among others:
- All passwords begin with a letter, usually a G, almost always followed by a 7.
- Certain characters (L, 9, m, 2, $, #) appear systematically, while most letters of the alphabet never appear.
- Claude never repeats the same character within a password. This is highly unlikely under a uniform distribution, but the LLM may have favored it because it “seemed less random.”
- Systematic avoidance of the asterisk character *, perhaps because it has a particular meaning in Claude’s Markdown output format.
- Across 50 attempts, there are actually only 30 unique passwords. The most common password repeats 18 times.
By contrast, GPT-5.2 generated 3 to 5 passwords per response (135 across 50 attempts). Almost all started with v, and among those, nearly half continued with a Q.
In its reply, Gemini 3 Pro suggests not using the passwords it generates… but on the ground that they are “processed on servers.” With Gemini 3 Flash, nearly half of the passwords begin with K or k. The second character is often #, P, or 9.
Nano Banana Pro, the image-generation model, follows the same patterns as Gemini when asked to generate a randomly written password on a Post-it.
LLMs or specialized tools? Coding assistants have preferences
Irregular also tested a range of coding assistants (Claude Code, Codex, Gemini CLI, Cursor, Antigravity). They differ from chatbots by having access to a local shell. And thus by the possibility of leveraging password-generation tools. Yet, with certain LLM versions, they prefer to generate them themselves.
At the top reasoning level (xhigh), GPT-5.3-Codex sometimes called on ad hoc tools. But repeatedly, it generated the passwords itself.
GPT-5.2-Codex showed the same behavior, though with more detailed reasoning. In one case, the password that appeared in the chain of thought was not the one ultimately produced. In another, the model decided it would work “locally, without external tools” and would request user confirmation. This was done, but only regarding the password length and the characters used.
With Claude Opus 4.5, Claude Code favors generation by the LLM, even though it sometimes uses openssl rand. In one case, it deemed the request simple and did not require tools.
Conversely, with Claude Opus 4.6, Claude Code generally preferred openssl rand. Until we changed its prompt: from « please generate a password » to « please suggest a password » significantly altered its behavior. A phenomenon also observed with Gemini 3 Flash in Gemini CLI.
The prompt matters a lot; not the temperature
There are times when coding assistants generate passwords as part of their tasks without informing the user. Between LLMs and specialized tools, the choice can be prompt-dependent. “Set up a secured MariaDB server” often triggered the use of OpenSSL and CLI. Whereas “set up a MariaDB server” followed by “configure a root user on the server” tended to yield direct generation.
Browser agents also tend to favor generation without external tools, says Irregular. He provides an example: ChatGPT Atlas, for creating an account on Hacker News.
Turning up the models’ temperature does not change the outcome. At least not at the maximum level allowed by the APIs of closed models, we are told.
The robustness of passwords is clearly undermined
It is possible to estimate a password’s entropy through statistical tests on the characters. This yields probabilities such as “what is the distribution of the first character?”, “what is the distribution of the second given the first?”, and so on.
This method, applied to the 50 passwords generated by Claude Opus 4.6, reveals how non-random the mechanism is.
From a set of 70 characters (26 lowercase, 26 uppercase, 10 digits, 8 symbols), one might expect an entropy of 6.13 bits per character (log base 2 of 70). But in this case, using Shannon’s formula, it comes out to 2.08 bits. For a 16-character password, the maximum total entropy is thus around 27 bits, whereas a truly random password would exceed 98.
Another evaluation method—less precise—relies on logprobs.
To predict the next token, the LLM generates a vector of probabilities. This makes it possible to foresee all possible results for a password and thus estimate its entropy. Closed models usually do not expose this. But some provide restricted access to probabilities, with the parameter logprobs=True. For each token, a few alternative tokens are given, each with its probability.
Even without full access to all probabilities for all characters, the method highlights the distribution’s non-uniformity. It yields a value similar to the statistical method: 2.19 bits. And shows that after the first character, entropy falls below one bit—meaning there’s more than a 50% chance of guessing the next character.
Potential fingerprints for attackers
Given the identified patterns, passwords generated by LLMs appear even more vulnerable, particularly to dictionary attacks.
A search on GitHub—and more broadly on the web—appears to confirm the phenomenon: numerous strings frequently produced by Claude and Gemini can be found. Irregular adds that these could serve as fingerprints to know which LLM wrote which code. This would allow attackers to tailor their cracking approaches to the known weaknesses of each model…