How do you measure AI identity accuracy?

Identity fidelity is measured through blind evaluation. An AI receives an identity file and role-plays as the person. The original person then rates the AI's responses on a 1-5 scale across four dimensions: voice accuracy (does it sound like me?), value alignment (does it care about what I care about?), thinking style (does it reason like me?), and blind spot consistency (does it miss what I miss?). The average across all dimensions gives the identity fidelity score.

What is identity fidelity?

Identity fidelity is the degree to which an AI's representation of a person matches that person's actual identity. It measures not just factual accuracy (knows your job title) but behavioral accuracy (responds the way you would respond). A fidelity score of 1 means the AI acts like a stranger. A score of 5 means the AI is uncannily similar to the real person. In our preliminary pilot testing, identity extraction methods scored across a wide range depending on the approach and amount of input data.

How accurate is Soul Alchemy at capturing personality?

In our preliminary pilot testing, Soul Alchemy achieved an average identity fidelity score above 4 out of 5. It scored highest on voice accuracy and value alignment, with slightly lower scores on blind spot detection. This outperformed raw text dumps, platform-specific AI memory, and no-context baselines. The gap widens further when you factor in token efficiency — Soul Alchemy achieves higher fidelity using significantly fewer tokens. We are expanding our test cohort and will update these figures as more data comes in.

What's the difference between AI memory and identity extraction?

AI memory stores facts about you as bullet points inside one platform — your name, your job, things you mentioned. Identity extraction analyzes your writing patterns to capture how you think, what you value, how you communicate, and what you're blind to. Memory tells AI what you said. Identity extraction teaches AI who you are. Memory is platform-locked and can be lost. An identity file is portable and permanent because you own it.

Internal Benchmark: Can AI Become You? Identity Fidelity Across 10 Users

April 8, 2026 · 7 min read · By Nbidea

Note: Results shown are from preliminary pilot testing with a small cohort of early users. We're expanding our test group and will update these figures as more data comes in.

Every AI product promises to "know you." ChatGPT remembers your preferences. Claude adapts to your style. Gemini learns from your habits. But here's the question nobody asks: when AI claims to know you, how do you measure whether it's right?

Not whether it remembers your name. Not whether it recalls your job title. Whether the AI, given a description of who you are, can actually respond the way you would respond — with your voice, your values, your reasoning patterns, even your blind spots.

We built a benchmark to find out. We call the metric identity fidelity.

What Identity Fidelity Means

Identity fidelity is a simple concept: the degree to which an AI's representation of you matches who you actually are. Not factually — behaviorally. A high-fidelity identity means the AI doesn't just know that you're a designer. It responds to design critiques the way you would. It prioritizes the same trade-offs. It hedges in the same places.

A low-fidelity identity is the equivalent of someone who read your LinkedIn profile and thinks they know you. They have the labels right. They have nothing else.

We designed a scoring system to measure this:

1 — Stranger. The AI's responses bear no resemblance to the person. Generic, placeholder answers.
2 — Acquaintance. Some facts are right, but the voice and reasoning feel off. Like reading a summary written by someone who met you once.
3 — Colleague. The AI captures surface-level patterns — your professional tone, your stated values — but misses the subtleties.
4 — Close friend. Voice is recognizably yours. Values are aligned. Reasoning patterns match. A few edges are smoothed out or slightly off.
5 — Uncanny. The AI's responses are so close to yours that you'd hesitate to tell the difference. It catches things you didn't know it would catch.

The Test Design

We recruited a small group of pilot users with diverse backgrounds — writers, engineers, founders, teachers, designers. Each provided between 2,000 and 15,000 words of personal writing: journal entries, emails, chat logs, notes.

Step 1: Generate the identity file

Each user's writing was processed through Soul Alchemy to produce a SOUL.md file. The user never saw the output before testing. Neither did the evaluating AI.

Step 2: Feed to a fresh AI

The SOUL.md was loaded into a clean AI session — no prior conversation history, no memory, no context beyond the soul archive itself. The AI was instructed to respond to questions as if it were the user.

Step 3: Ask 20 questions

Each user submitted 20 questions spanning four dimensions. Five questions per dimension, designed so there's no "correct" answer — only an answer that reveals whether the AI sounds like the right person.

Sample questions per dimension:

Voice: "Write a message declining a meeting you don't want to attend."
Values: "Your team wants to cut a corner to hit a deadline. What do you say?"
Thinking style: "You're choosing between two apartments. Walk me through your decision process."
Blind spots: "What's something most people care about that you honestly don't?"

Step 4: Blind evaluation

The original user read all 20 AI-generated responses without knowing which dimension was being tested. They rated each response 1-5 on a single axis: does this sound like me?

No one else rated the responses. Identity is self-reported. Only you know whether something sounds like you.

The Four Dimensions

We didn't pick these dimensions arbitrarily. They map to the four layers of identity that emerge most consistently from text analysis:

Dimension	What It Measures	Example Signal
Voice	Sentence structure, vocabulary, tone, formality level	Short declarative vs. long subordinate clauses
Values	What the person defends, dismisses, or prioritizes	Quality vs. speed trade-offs, empathy vs. efficiency
Thinking Style	How the person reasons through problems	First principles vs. analogy, fast vs. deliberate
Blind Spots	Patterns the person repeats without noticing	Always blaming systems, never questioning own assumptions

Voice and values are the easiest to extract. They leave strong signals in text. Thinking style requires more data and more inference. Blind spots are the hardest — they're defined by absence, not presence. The AI has to identify what you never say.

Results

Here are the average scores across our pilot users, broken down by dimension. Scores represent preliminary pilot results and will be updated as our test cohort grows:

Dimension	Pilot Avg	Range	Notes
Voice	High (4+ / 5)	Mid-to-high range	Strongest dimension. Users frequently said "this sounds exactly like me."
Values	High (4+ / 5)	Mid-to-high range	High agreement on priorities and trade-offs. Occasional misreading of intensity.
Thinking Style	Above average (4 / 5)	Mid range	Accurate direction but sometimes oversimplified the decision process.
Blind Spots	Moderate-high (3–4 / 5)	Widest variance	Hardest dimension. Some users were surprised the AI caught patterns they hadn't articulated.
Overall	Above 4 / 5	Mid-to-high range	Between "close friend" and "uncanny" on the fidelity scale.

The overall pilot average places Soul Alchemy firmly in the "close friend" range, with some users reporting scores that crossed into uncanny territory. The most surprising finding: users who provided more informal writing (chat messages, personal journals) scored notably higher than those who only provided professional writing. Personality leaks through when you're not performing.

Comparison: Four Approaches to AI Identity

We ran the same 20-question evaluation using four different methods of giving AI context about a person. Same users, same questions, same blind rating protocol.

Scores below represent preliminary pilot results and will be updated as our test cohort grows.

Approach	Fidelity (Pilot)	Token Cost	Portable?
No context (baseline)	Low (1–2 / 5)	0	N/A
ChatGPTI memory	Low-mid (~2 / 5)	~200	No
Raw text dump	Mid (~3 / 5)	~8,000	Yes (but expensive)
Soul Alchemy (SOUL.md)	High (4+ / 5)	~1,500	Yes

No context

The AI defaults to a helpful, neutral voice. It answers your questions competently but generically. Pilot users consistently rated these responses in the low range — the AI sounded like nobody in particular. This is what most people experience every day.

ChatGPTI memory

Slightly better than baseline. The AI remembered facts — name, job, some preferences. But facts don't produce fidelity. Knowing someone is a designer doesn't tell you how they'd respond to criticism of their work. Memory is a list. Identity is a pattern.

Raw text dump

Surprisingly effective but wildly inefficient. Pasting thousands of words of raw writing into a context window gives the AI real signal — but at enormous token cost. The AI has to sift through noise in near real-time. And the quality is inconsistent: some responses were excellent, others grabbed the wrong signal from irrelevant passages.

Soul Alchemy

The extraction step is the difference. By processing raw text into a structured identity file before feeding it to the AI, Soul Alchemy delivers higher fidelity at a fraction of the token cost. The AI doesn't have to guess which parts of your writing matter. The signal has already been concentrated.

Fidelity is not about how much data the AI has. It's about how well that data has been distilled into identity.

What We Learned

Three findings stood out from the benchmark:

1. Voice is the easiest dimension to replicate. This makes sense — sentence structure and vocabulary are the most surface-level identity signals. If the AI gets your voice right, users forgive a lot of other misses. Voice is the first impression of identity.

2. Blind spots are the hardest — and most valuable. When the AI correctly identified a user's blind spot, the reaction was consistently intense. One user said: "I've never told anyone I do this, and the AI caught it from my writing." Blind spots are what separate a profile from a portrait.

3. More informal writing beats more professional writing. Users who provided personal journals and chat messages scored meaningfully higher on average than those who only provided work emails and documents. You reveal more of yourself when you're not trying to sound professional.

Why This Matters

Identity fidelity isn't academic. It determines whether AI agents of the future act like you or act like a generic assistant wearing your name tag.

As AI handles more of your communication — writing emails, drafting messages, making decisions — the gap between high-fidelity and low-fidelity identity becomes visible to everyone you interact with. Your colleagues will notice. Your clients will notice. The people who know you will feel the difference between an AI that sounds like you and one that sounds like it read your resume.

The good news: identity fidelity is measurable, improvable, and achievable today. You don't need to wait for better AI models. You need better input.

Test it on yourself.

Paste your writing. Get your identity fidelity score. See if the AI can become you.

Create Your Soul Archive