178 models. Same fingerprint.
We fingerprinted 178 AI models across 32 writing dimensions. Found clone clusters, cross-provider twins, and models that write identically but cost 185x more. Every comparison backed by 3,095 analyzed responses.
12 clone pairs found
Models with >90% writing style similarity. Some are from the same lab (expected). Others are from completely different providers. Same fingerprint, different brand.
Different labs, same writing DNA
These model pairs come from different providers but write almost identically. The training data convergence is real.
Inception: Mercury 2
inception
Solar Pro 3
upstage
Aurora Alpha
openrouter
Solar Pro 3
upstage
GPT-5 Mini
openai
Horizon Beta
openrouter
Aurora Alpha
openrouter
Inception: Mercury 2
inception
Gemma 3n 4B
Z.AI: GLM 4.5 Air
zhipu
GPT OSS 120B
openai
Solar Pro 3
upstage
Llama 4 Scout
meta
Mistral Nemo
mistral
GPT OSS 120B
openai
Inception: Mercury 2
inception
Llama 4 Maverick
meta
Mistral Nemo
mistral
GPT-5 Pro
openai
Horizon Alpha
openrouter
Same writing, different bill
Models with >75% writing similarity but massive price gaps. The cheap model writes the same way. You are paying for the brand.
Gemini 2.5 Flash Lite Preview 06-17
google · $0.10/$0.40 per 1M
Claude 3 Opus
anthropic · $15.00/$75.00 per 1M
Mistral Small Creative
mistral · $0.10/$0.30 per 1M
Mistral Large 2
mistral · $8.00/$24.00 per 1M
Qwen3 Coder
qwen · $0.22/$0.95 per 1M
Claude Opus 4.1
anthropic · $15.00/$75.00 per 1M
GPT-5 Mini
openai · $0.25/$2.00 per 1M
GPT-5 Pro
openai · $15.00/$120.00 per 1M
Mistral Small 4
mistral · $0.15/$0.60 per 1M
Mistral Large 2
mistral · $8.00/$24.00 per 1M
GPT-5.4 Mini
openai · $0.75/$4.50 per 1M
GPT-5.4 Pro
openai · $30.00/$180.00 per 1M
Mistral Nemo
mistral · $0.03/$0.07 per 1M
Llama 4 Maverick
meta · $1.50/$2.50 per 1M
Qwen3 Coder
qwen · $0.22/$0.95 per 1M
Claude 3.7 Thinking Sonnet
anthropic · $6.00/$30.00 per 1M
Gemini 2.5 Flash Preview (thinking)
google · $0.17/$3.50 per 1M
Claude 3 Opus
anthropic · $15.00/$75.00 per 1M
Qwen3.5 9B
qwen · $0.10/$0.15 per 1M
Qwen: Qwen3.5 122B A10B
qwen · $0.40/$3.20 per 1M
MoonshotAI: Kimi K2 0905
moonshotai · $0.60/$2.50 per 1M
OpenAI o3
openai · $10.00/$40.00 per 1M
Inception: Mercury 2
inception · $0.25/$0.75 per 1M
Inception: Mercury
inception · $10.00/$10.00 per 1M
Which labs have a house style?
Distinctiveness measures how much a provider's models write like each other vs. like everyone else. Higher = stronger house style. A score below 1 means their models are more varied internally than externally.
Distinctiveness = intra-provider similarity / inter-provider similarity. Values > 1 mean the provider has a recognizable writing signature.
Do model families actually write alike?
Average writing similarity within each model family. High cohesion means the family shares a consistent voice. Low cohesion means the versions diverged.
What separates AI writing styles?
The 32 dimensions ranked by how much they vary across models. High variance = strong differentiator. Low variance = all models converge.
Most discriminating
These features tell models apart the most
Universal convergence
These features are nearly identical across all models
Most unique vs. most generic
Average similarity to all other models. Low = writes unlike anything else. High = writes like the average of everything. The thinking models dominate the unique end. The mid-tier models converge.
Most unique writers
Lowest average similarity to all others
Most convergent writers
Highest average similarity to all others
Who writes the same way every time?
Based on response-to-response variance within each model. Low variance = predictable. High variance = the model's style shifts depending on the prompt.
Most consistent
Predictable style across all prompts
Most variable
Style shifts dramatically between prompts
The similarity landscape
How similar are AI models to each other, overall? This histogram shows the distribution of all 15,753 pairwise comparisons. The peak is near zero, meaning most model pairs write quite differently. But the long right tail reveals the clones.
Verified clones: same prompt, same writing
Instead of comparing model averages, we compared how each pair writes on the exact same prompts. A pair that matches consistently across many prompts is a stronger signal than a high aggregate score. Confidence = avg / (1 + stddev).
Mistral Large 2
mistral
Mistral Large 3 2512
mistral
Mistral Devstral Medium
mistral
Llama 4 Maverick
meta
Mistral Large
mistral
Mistral Large 2
mistral
Qwen: Qwen3.5 27B
qwen
Qwen: Qwen3.5 35B A3B
qwen
Bert-Nebulon Alpha
openrouter
Mistral Large 3 2512
mistral
Gemma 3 12B
Google: Gemma 3n 2B
Aurora Alpha
openrouter
Inception: Mercury 2
inception
Grok 3
xai
xAI: Grok 4
xai
Llama 3.1 70B (Instruct)
meta
Llama 4 Maverick
meta
Qwen Plus 0728 (thinking)
qwen
Qwen3 30B A3B Thinking 2507
qwen
Google: Gemma 3n 2B
Gemma 3n 4B
DeepSeek V3 (March 2024)
deepseek
Mistral Medium 3
mistral
Clones on some prompts, strangers on others
These pairs have high average similarity but wildly inconsistent behavior across prompts. They write identically on one topic, then completely diverge on another. The prompt is the variable.
Which prompts make all models write the same?
Each prompt is a controlled experiment: same instructions, different models. Some prompts cause all models to converge on nearly identical writing style. Others bring out each model's individuality.
Most convergent prompts
All models write similarly on these
Most divergent prompts
These reveal each model's personality
The single most extreme response for each writing trait
Not model averages. The actual individual response that holds the record for each stylometric dimension. These are the outlier moments.
How we built this
Every model in our catalog generates responses to the same set of standardized prompts. From each text response, we extract a 32-dimension stylometric fingerprint covering lexical richness, sentence structure, punctuation habits, formatting preferences, and discourse patterns (hedging, boosting, transitions, etc.).
Per-model fingerprints are the mean of all individual response fingerprints (minimum 3 text responses required). Before computing similarity, all features are z-score normalized to prevent high-magnitude features (like sentence_length_variance) from dominating the comparison.
Pairwise similarity uses cosine similarity on the normalized 32-dimension vectors. Clone detection uses a threshold of 0.90 (top 0.08% of all pairs). Price arbitrage compares models with > 0.75 similarity using a weighted average price (3:1 output:input ratio).
Provider distinctiveness is the ratio of average intra-provider similarity to inter-provider similarity. Values > 1 indicate a detectable house writing style.
Raw response analysis goes deeper: for each of the 43 standardized prompts, we compute per-challenge pairwise similarity across all responding models. This produces prompt-controlled head-to-head scores (same prompt, same conditions). A "confidence" metric (avg similarity / (1 + stddev)) identifies pairs that are consistently similar across many prompts, not just similar on average. Prompt-dependent twins are pairs with high average but high variance, meaning the prompt determines whether they converge.
Total corpus: 3,095 responses across 178 models and 43 standardized prompts. Analysis recomputed with each model addition.
Get the full report
15-slide PDF with all visualizations, composite clone scores, and provider logos.
Download PDF report