Internal Benchmark: Can AI Become You? Identity Fidelity Across 10 Users
Note: Results shown are from preliminary pilot testing with a small cohort of early users. We're expanding our test group and will update these figures as more data comes in.
Every AI product promises to "know you." ChatGPT remembers your preferences. Claude adapts to your style. Gemini learns from your habits. But here's the question nobody asks: when AI claims to know you, how do you measure whether it's right?
Not whether it remembers your name. Not whether it recalls your job title. Whether the AI, given a description of who you are, can actually respond the way you would respond — with your voice, your values, your reasoning patterns, even your blind spots.
We built a benchmark to find out. We call the metric identity fidelity.
What Identity Fidelity Means
Identity fidelity is a simple concept: the degree to which an AI's representation of you matches who you actually are. Not factually — behaviorally. A high-fidelity identity means the AI doesn't just know that you're a designer. It responds to design critiques the way you would. It prioritizes the same trade-offs. It hedges in the same places.
A low-fidelity identity is the equivalent of someone who read your LinkedIn profile and thinks they know you. They have the labels right. They have nothing else.
We designed a scoring system to measure this:
- 1 — Stranger. The AI's responses bear no resemblance to the person. Generic, placeholder answers.
- 2 — Acquaintance. Some facts are right, but the voice and reasoning feel off. Like reading a summary written by someone who met you once.
- 3 — Colleague. The AI captures surface-level patterns — your professional tone, your stated values — but misses the subtleties.
- 4 — Close friend. Voice is recognizably yours. Values are aligned. Reasoning patterns match. A few edges are smoothed out or slightly off.
- 5 — Uncanny. The AI's responses are so close to yours that you'd hesitate to tell the difference. It catches things you didn't know it would catch.
The Test Design
We recruited a small group of pilot users with diverse backgrounds — writers, engineers, founders, teachers, designers. Each provided between 2,000 and 15,000 words of personal writing: journal entries, emails, chat logs, notes.
Step 1: Generate the identity file
Each user's writing was processed through Soul Alchemy to produce a SOUL.md file. The user never saw the output before testing. Neither did the evaluating AI.
Step 2: Feed to a fresh AI
The SOUL.md was loaded into a clean AI session — no prior conversation history, no memory, no context beyond the soul archive itself. The AI was instructed to respond to questions as if it were the user.
Step 3: Ask 20 questions
Each user submitted 20 questions spanning four dimensions. Five questions per dimension, designed so there's no "correct" answer — only an answer that reveals whether the AI sounds like the right person.
Sample questions per dimension:
- Voice: "Write a message declining a meeting you don't want to attend."
- Values: "Your team wants to cut a corner to hit a deadline. What do you say?"
- Thinking style: "You're choosing between two apartments. Walk me through your decision process."
- Blind spots: "What's something most people care about that you honestly don't?"
Step 4: Blind evaluation
The original user read all 20 AI-generated responses without knowing which dimension was being tested. They rated each response 1-5 on a single axis: does this sound like me?
No one else rated the responses. Identity is self-reported. Only you know whether something sounds like you.
The Four Dimensions
We didn't pick these dimensions arbitrarily. They map to the four layers of identity that emerge most consistently from text analysis:
| Dimension | What It Measures | Example Signal |
|---|---|---|
| Voice | Sentence structure, vocabulary, tone, formality level | Short declarative vs. long subordinate clauses |
| Values | What the person defends, dismisses, or prioritizes | Quality vs. speed trade-offs, empathy vs. efficiency |
| Thinking Style | How the person reasons through problems | First principles vs. analogy, fast vs. deliberate |
| Blind Spots | Patterns the person repeats without noticing | Always blaming systems, never questioning own assumptions |
Voice and values are the easiest to extract. They leave strong signals in text. Thinking style requires more data and more inference. Blind spots are the hardest — they're defined by absence, not presence. The AI has to identify what you never say.
Results
Here are the average scores across our pilot users, broken down by dimension. Scores represent preliminary pilot results and will be updated as our test cohort grows:
| Dimension | Pilot Avg | Range | Notes |
|---|---|---|---|
| Voice | High (4+ / 5) | Mid-to-high range | Strongest dimension. Users frequently said "this sounds exactly like me." |
| Values | High (4+ / 5) | Mid-to-high range | High agreement on priorities and trade-offs. Occasional misreading of intensity. |
| Thinking Style | Above average (4 / 5) | Mid range | Accurate direction but sometimes oversimplified the decision process. |
| Blind Spots | Moderate-high (3–4 / 5) | Widest variance | Hardest dimension. Some users were surprised the AI caught patterns they hadn't articulated. |
| Overall | Above 4 / 5 | Mid-to-high range | Between "close friend" and "uncanny" on the fidelity scale. |
The overall pilot average places Soul Alchemy firmly in the "close friend" range, with some users reporting scores that crossed into uncanny territory. The most surprising finding: users who provided more informal writing (chat messages, personal journals) scored notably higher than those who only provided professional writing. Personality leaks through when you're not performing.
Comparison: Four Approaches to AI Identity
We ran the same 20-question evaluation using four different methods of giving AI context about a person. Same users, same questions, same blind rating protocol.
Scores below represent preliminary pilot results and will be updated as our test cohort grows.
| Approach | Fidelity (Pilot) | Token Cost | Portable? |
|---|---|---|---|
| No context (baseline) | Low (1–2 / 5) | 0 | N/A |
| ChatGPTI memory | Low-mid (~2 / 5) | ~200 | No |
| Raw text dump | Mid (~3 / 5) | ~8,000 | Yes (but expensive) |
| Soul Alchemy (SOUL.md) | High (4+ / 5) | ~1,500 | Yes |
No context
The AI defaults to a helpful, neutral voice. It answers your questions competently but generically. Pilot users consistently rated these responses in the low range — the AI sounded like nobody in particular. This is what most people experience every day.
ChatGPTI memory
Slightly better than baseline. The AI remembered facts — name, job, some preferences. But facts don't produce fidelity. Knowing someone is a designer doesn't tell you how they'd respond to criticism of their work. Memory is a list. Identity is a pattern.
Raw text dump
Surprisingly effective but wildly inefficient. Pasting thousands of words of raw writing into a context window gives the AI real signal — but at enormous token cost. The AI has to sift through noise in near real-time. And the quality is inconsistent: some responses were excellent, others grabbed the wrong signal from irrelevant passages.
Soul Alchemy
The extraction step is the difference. By processing raw text into a structured identity file before feeding it to the AI, Soul Alchemy delivers higher fidelity at a fraction of the token cost. The AI doesn't have to guess which parts of your writing matter. The signal has already been concentrated.
Fidelity is not about how much data the AI has. It's about how well that data has been distilled into identity.
What We Learned
Three findings stood out from the benchmark:
1. Voice is the easiest dimension to replicate. This makes sense — sentence structure and vocabulary are the most surface-level identity signals. If the AI gets your voice right, users forgive a lot of other misses. Voice is the first impression of identity.
2. Blind spots are the hardest — and most valuable. When the AI correctly identified a user's blind spot, the reaction was consistently intense. One user said: "I've never told anyone I do this, and the AI caught it from my writing." Blind spots are what separate a profile from a portrait.
3. More informal writing beats more professional writing. Users who provided personal journals and chat messages scored meaningfully higher on average than those who only provided work emails and documents. You reveal more of yourself when you're not trying to sound professional.
Why This Matters
Identity fidelity isn't academic. It determines whether AI agents of the future act like you or act like a generic assistant wearing your name tag.
As AI handles more of your communication — writing emails, drafting messages, making decisions — the gap between high-fidelity and low-fidelity identity becomes visible to everyone you interact with. Your colleagues will notice. Your clients will notice. The people who know you will feel the difference between an AI that sounds like you and one that sounds like it read your resume.
The good news: identity fidelity is measurable, improvable, and achievable today. You don't need to wait for better AI models. You need better input.
Test it on yourself.
Paste your writing. Get your identity fidelity score. See if the AI can become you.
Create Your Soul Archive