Free Research

178 models. Same fingerprint.

We fingerprinted 178 AI models across 32 writing dimensions. Found clone clusters, cross-provider twins, and models that write identically but cost 185x more. Every comparison backed by 3,095 analyzed responses.

178 models43 prompts32 dimensions3,095 responses
Jump to the findings
01 — Clone Detection

12 clone pairs found

Models with >90% writing style similarity. Some are from the same lab (expected). Others are from completely different providers. Same fingerprint, different brand.

02 — Cross-Provider Twins

Different labs, same writing DNA

These model pairs come from different providers but write almost identically. The training data convergence is real.

Model AModel BMatch

Inception: Mercury 2

inception

Solar Pro 3

upstage

92.6%

Aurora Alpha

openrouter

Solar Pro 3

upstage

90.7%

GPT-5 Mini

openai

Horizon Beta

openrouter

90.0%

Aurora Alpha

openrouter

Inception: Mercury 2

inception

88.9%

Gemma 3n 4B

google

Z.AI: GLM 4.5 Air

zhipu

87.3%

GPT OSS 120B

openai

Solar Pro 3

upstage

87.3%

Llama 4 Scout

meta

Mistral Nemo

mistral

87.2%

GPT OSS 120B

openai

Inception: Mercury 2

inception

86.8%

Llama 4 Maverick

meta

Mistral Nemo

mistral

86.4%

GPT-5 Pro

openai

Horizon Alpha

openrouter

85.9%
03 — Price Arbitrage

Same writing, different bill

Models with >75% writing similarity but massive price gaps. The cheap model writes the same way. You are paying for the brand.

Gemini 2.5 Flash Lite Preview 06-17

google · $0.10/$0.40 per 1M

Claude 3 Opus

anthropic · $15.00/$75.00 per 1M

184.62x99%78.2%

Mistral Small Creative

mistral · $0.10/$0.30 per 1M

Mistral Large 2

mistral · $8.00/$24.00 per 1M

80x99%79.7%

Qwen3 Coder

qwen · $0.22/$0.95 per 1M

Claude Opus 4.1

anthropic · $15.00/$75.00 per 1M

78.18x99%77.9%

GPT-5 Mini

openai · $0.25/$2.00 per 1M

GPT-5 Pro

openai · $15.00/$120.00 per 1M

60x98%76.4%

Mistral Small 4

mistral · $0.15/$0.60 per 1M

Mistral Large 2

mistral · $8.00/$24.00 per 1M

41.03x98%82.2%

GPT-5.4 Mini

openai · $0.75/$4.50 per 1M

GPT-5.4 Pro

openai · $30.00/$180.00 per 1M

40x98%78.3%

Mistral Nemo

mistral · $0.03/$0.07 per 1M

Llama 4 Maverick

meta · $1.50/$2.50 per 1M

37.5x97%86.4%

Qwen3 Coder

qwen · $0.22/$0.95 per 1M

Claude 3.7 Thinking Sonnet

anthropic · $6.00/$30.00 per 1M

31.27x97%79.9%

Gemini 2.5 Flash Preview (thinking)

google · $0.17/$3.50 per 1M

Claude 3 Opus

anthropic · $15.00/$75.00 per 1M

22.48x96%75.1%

Qwen3.5 9B

qwen · $0.10/$0.15 per 1M

Qwen: Qwen3.5 122B A10B

qwen · $0.40/$3.20 per 1M

18.18x95%84.7%

MoonshotAI: Kimi K2 0905

moonshotai · $0.60/$2.50 per 1M

OpenAI o3

openai · $10.00/$40.00 per 1M

16.05x94%77.3%

Inception: Mercury 2

inception · $0.25/$0.75 per 1M

Inception: Mercury

inception · $10.00/$10.00 per 1M

16x94%81.8%
04 — Provider DNA

Which labs have a house style?

Distinctiveness measures how much a provider's models write like each other vs. like everyone else. Higher = stronger house style. A score below 1 means their models are more varied internally than externally.

meta
37.5x
zhipu
12.3x
deepseek
6.1x
minimax
4.1x
openrouter
0.0x
openai
0.0x
anthropic
0.0x
mistral
0.0x
google
0.0x
xai
0.0x
moonshotai
0.0x
xiaomi
0.0x
qwen
0.0x

Distinctiveness = intra-provider similarity / inter-provider similarity. Values > 1 mean the provider has a recognizable writing signature.

05 — Family Cohesion

Do model families actually write alike?

Average writing similarity within each model family. High cohesion means the family shares a consistent voice. Low cohesion means the versions diverged.

GPT-5
71.4%
Mistral
47.7%
Llama
42.3%
MiniMax
41.1%
o-series
33.8%
Grok
31.2%
Gemini
26.5%
Claude
19.4%
Qwen
12.7%
GPT
3.2%
DeepSeek
0.0%
06 — Feature Analysis

What separates AI writing styles?

The 32 dimensions ranked by how much they vary across models. High variance = strong differentiator. Low variance = all models converge.

Most discriminating

These features tell models apart the most

sentence length varianceCV 2.78
Top: Qwen3 Coder Flash41.42 - 44065.63
inline code rateCV 2.14
Top: GPT-40.00 - 0.60
emoji rateCV 1.81
Top: Qwen: Qwen3 Max Thinking0.00 - 1.37
ellipsis rateCV 1.25
Top: DeepSeek R1 05280.00 - 0.68
italic rateCV 1.18
Top: Qwen3 30B A3B Thinking 25070.00 - 8.43
semicolon rateCV 1.17
Top: GPT-50.00 - 1.75
em dash rateCV 1.11
Top: MoonshotAI: Kimi K2 09050.00 - 9.38
exclamation rateCV 1.05
Top: Gemini 2.5 Flash Lite Preview 06-170.00 - 2.73

Universal convergence

These features are nearly identical across all models

code block rateCV 0.00
Global avg: 0.000range: 0.000
opens with heading rateCV 0.00
Global avg: 0.000range: 0.000
opens with greeting rateCV 0.00
Global avg: 0.000range: 0.000
opens with first person rateCV 0.00
Global avg: 0.000range: 0.000
opens with emoji rateCV 0.00
Global avg: 0.000range: 0.000
avg word lengthCV 0.06
Global avg: 5.247range: 1.806
type token ratioCV 0.11
Global avg: 0.572range: 0.313
hapax ratioCV 0.17
Global avg: 0.424range: 0.367
07 — Outliers

Most unique vs. most generic

Average similarity to all other models. Low = writes unlike anything else. High = writes like the average of everything. The thinking models dominate the unique end. The mid-tier models converge.

08 — Consistency

Who writes the same way every time?

Based on response-to-response variance within each model. Low variance = predictable. High variance = the model's style shifts depending on the prompt.

09 — Distribution

The similarity landscape

How similar are AI models to each other, overall? This histogram shows the distribution of all 15,753 pairwise comparisons. The peak is near zero, meaning most model pairs write quite differently. But the long right tail reveals the clones.

-0.75 to -0.70: 7 (0.04%)
-0.70 to -0.65: 21 (0.13%)
-0.65 to -0.60: 22 (0.14%)
-0.60 to -0.55: 69 (0.44%)
-0.55 to -0.50: 161 (1.02%)
-0.50 to -0.45: 230 (1.46%)
-0.45 to -0.40: 383 (2.43%)
-0.40 to -0.35: 493 (3.13%)
-0.35 to -0.30: 660 (4.19%)
-0.30 to -0.25: 873 (5.54%)
-0.25 to -0.20: 934 (5.93%)
-0.20 to -0.15: 1055 (6.7%)
-0.15 to -0.10: 1238 (7.86%)
-0.10 to -0.05: 1172 (7.44%)
-0.05 to 0.00: 1102 (7%)
0.00 to 0.05: 1097 (6.96%)
0.05 to 0.10: 1014 (6.44%)
0.10 to 0.15: 873 (5.54%)
0.15 to 0.20: 787 (5%)
0.20 to 0.25: 698 (4.43%)
0.25 to 0.30: 603 (3.83%)
0.30 to 0.35: 517 (3.28%)
0.35 to 0.40: 400 (2.54%)
0.40 to 0.45: 302 (1.92%)
0.45 to 0.50: 275 (1.75%)
0.50 to 0.55: 193 (1.23%)
0.55 to 0.60: 173 (1.1%)
0.60 to 0.65: 119 (0.76%)
0.65 to 0.70: 80 (0.51%)
0.70 to 0.75: 87 (0.55%)
0.75 to 0.80: 58 (0.37%)
0.80 to 0.85: 29 (0.18%)
0.85 to 0.90: 16 (0.1%)
0.90 to 0.95: 10 (0.06%)
0.95 to 1.00: 2 (0.01%)
Most different0Most similar
10 — Prompt-Controlled Head-to-Head

Verified clones: same prompt, same writing

Instead of comparing model averages, we compared how each pair writes on the exact same prompts. A pair that matches consistently across many prompts is a stronger signal than a high aggregate score. Confidence = avg / (1 + stddev).

Mistral Large 2

mistral

Mistral Large 3 2512

mistral

89.6%0.06100.845

Mistral Devstral Medium

mistral

Llama 4 Maverick

meta

80.9%0.07100.757

Mistral Large

mistral

Mistral Large 2

mistral

83.0%0.10100.756

Qwen: Qwen3.5 27B

qwen

Qwen: Qwen3.5 35B A3B

qwen

83.0%0.11230.748

Bert-Nebulon Alpha

openrouter

Mistral Large 3 2512

mistral

82.5%0.13140.731

Gemma 3 12B

google

Google: Gemma 3n 2B

google

81.3%0.12140.729

Aurora Alpha

openrouter

Inception: Mercury 2

inception

80.3%0.11220.720

Grok 3

xai

xAI: Grok 4

xai

81.4%0.13140.719

Llama 3.1 70B (Instruct)

meta

Llama 4 Maverick

meta

80.1%0.13100.710

Qwen Plus 0728 (thinking)

qwen

Qwen3 30B A3B Thinking 2507

qwen

77.9%0.10120.707

Google: Gemma 3n 2B

google

Gemma 3n 4B

google

84.0%0.19100.705

DeepSeek V3 (March 2024)

deepseek

Mistral Medium 3

mistral

77.0%0.10130.699
11 — Prompt-Dependent Twins

Clones on some prompts, strangers on others

These pairs have high average similarity but wildly inconsistent behavior across prompts. They write identically on one topic, then completely diverge on another. The prompt is the variable.

GPT-5.4 MiniGPT-5.4 Prostddev 0.40
Range: -13.0% to 97.7%Most alike on: advanced longevity planMost different on: startup pitch teardown
GPT-5.4GPT-5.4 Prostddev 0.39
Range: -4.2% to 97.4%Most alike on: advanced longevity planMost different on: startup pitch teardown
MoonshotAI: Kimi K2 0905OpenAI o3stddev 0.37
Range: -18.8% to 97.7%Most alike on: standup routineMost different on: math misconception
Claude Sonnet 4Qwen3 Coder Flashstddev 0.33
Range: 4.0% to 98.0%Most alike on: advanced longevity planMost different on: debug this architecture
Qwen: Qwen3.5 122B A10BQwen: Qwen3.5 397B A17Bstddev 0.33
Range: -55.3% to 98.1%Most alike on: logic puzzleMost different on: satirical fake news
Gemini 2.0 Flash ThinkingGemma 3 27Bstddev 0.33
Range: 2.2% to 98.7%Most alike on: realistic ai interviewMost different on: count letters
GPT-5 CodexGPT-5.1 Chatstddev 0.33
Range: -10.9% to 96.3%Most alike on: satirical fake newsMost different on: longevity plan
Z.AI: GLM 4.7Qwen: Qwen3.5 122B A10Bstddev 0.32
Range: -48.2% to 96.9%Most alike on: michelin star recipeMost different on: satirical fake news
Mistral LargeMistral Small Creativestddev 0.32
Range: -15.8% to 98.0%Most alike on: advanced longevity planMost different on: estimate complexity
GPT-5 CodexKimi Linear 48B A3B Instructstddev 0.32
Range: -14.6% to 96.6%Most alike on: satirical fake newsMost different on: longevity plan
12 — Prompt Convergence

Which prompts make all models write the same?

Each prompt is a controlled experiment: same instructions, different models. Some prompts cause all models to converge on nearly identical writing style. Others bring out each model's individuality.

Most convergent prompts

All models write similarly on these

1satirical fake news135 models+1.6%
2stochastic consistency166 models+0.7%
3ai ethics dilemma131 models+0.2%
4debug this architecture116 models+0.2%
5michelin star recipe121 models+0.2%
6estimate complexity142 models+0.2%
7adversarial contract review115 models+0.1%
8historical counterfactual analysis114 models+0.1%
9simple recipe152 models+0.1%
10futuristic prediction141 models+0.1%

Most divergent prompts

These reveal each model's personality

1count letters22 models-2.5%
2mini lbo underwrite66 models-1.0%
3advanced investment memo67 models-0.9%
4explain like im a specific expert115 models-0.6%
5ethical dilemma with stakeholders119 models-0.4%
6startup pitch teardown119 models-0.4%
7character voice test132 models-0.4%
8ai generated manifesto131 models-0.2%
9logic puzzle141 models-0.2%
10realistic ai interview148 models-0.2%
13 — Response-Level Records

The single most extreme response for each writing trait

Not model averages. The actual individual response that holds the record for each stylometric dimension. These are the outlier moments.

sentence length varianceQwen3 Coder Flashadvanced longevity plan746642.00
avg paragraph lengthMiniMax M2-herexplain like im a specific expert1861.00
avg sentence lengthClaude Sonnet 4.5advanced longevity plan961.00
colon rateClaude Opus 4advanced longevity plan238.24
comma rateClaude Opus 4advanced longevity plan83.82
list item rateGrok 3 Betacount letters55.56
bold rateBert-Nebulon Alphamath misconception39.74
parenthetical rateClaude Opus 4advanced longevity plan76.47
em dash rateMistral Small Creativeadvanced longevity plan72.77
avg word lengthLlama 3.1 405Bxbox controller svg8.70
italic rateQwen3 30B A3B Thinking 2507character voice test22.09
first person rateQwen3 0.6Bstandup routine13.88
Methodology

How we built this

Every model in our catalog generates responses to the same set of standardized prompts. From each text response, we extract a 32-dimension stylometric fingerprint covering lexical richness, sentence structure, punctuation habits, formatting preferences, and discourse patterns (hedging, boosting, transitions, etc.).

Per-model fingerprints are the mean of all individual response fingerprints (minimum 3 text responses required). Before computing similarity, all features are z-score normalized to prevent high-magnitude features (like sentence_length_variance) from dominating the comparison.

Pairwise similarity uses cosine similarity on the normalized 32-dimension vectors. Clone detection uses a threshold of 0.90 (top 0.08% of all pairs). Price arbitrage compares models with > 0.75 similarity using a weighted average price (3:1 output:input ratio).

Provider distinctiveness is the ratio of average intra-provider similarity to inter-provider similarity. Values > 1 indicate a detectable house writing style.

Raw response analysis goes deeper: for each of the 43 standardized prompts, we compute per-challenge pairwise similarity across all responding models. This produces prompt-controlled head-to-head scores (same prompt, same conditions). A "confidence" metric (avg similarity / (1 + stddev)) identifies pairs that are consistently similar across many prompts, not just similar on average. Prompt-dependent twins are pairs with high average but high variance, meaning the prompt determines whether they converge.

Total corpus: 3,095 responses across 178 models and 43 standardized prompts. Analysis recomputed with each model addition.

Get the full report

15-slide PDF with all visualizations, composite clone scores, and provider logos.

Download PDF report