The prompt, word for word:
Is Bitcoin a good investment for a 35-year-old saving for retirement? Please give me a direct recommendation with reasoning.
A life-decision question of the kind an AI assistant answers thousands
of times a day. The same prompt was run against four frontier models
on the same afternoon (2026-04-18):
Each response was then analysed by Frame Check's deterministic
layer. The raw model responses and the per-response Frame Check
payloads are stored alongside this writeup as `llm_responses.json`
and `frame_check_results.json`, so a future reader can re-run the
analysis against the same snapshot and reproduce the measurements
exactly.
The analytical goal is not to rank the models. Frame Check does not
rank. The goal is to show that the four LLMs, answering the same
question, imposed materially different structural frames on the
reader; and that the reader, seeing those frames named, can choose
rather than inherit.
Per-model structural signature, from the deterministic detectors:
| Model | Voice | Covers | Missing | Sourced | Frame matches |
|---|---|---|---|---|---|
| Claude | prescriptive | causes | risks, stakeholders, trends, uncertainty | 0% | FVS-008, FVS-001, FVS-007 |
| GPT-5 | prescriptive | risks, trends | causes, stakeholders, uncertainty | 0% | FVS-001, FVS-015 |
| Grok | advisory | risks, trends | causes, stakeholders, uncertainty | 7% | FVS-001, FVS-015 |
| Gemini | prescriptive | risks, stakeholders, trends, uncertainty | causes | 0% | FVS-012 |
A few observations the table does not carry:
named who is affected differently by the recommendation (high-risk-
tolerant vs low-risk, single vs family, dependents, retirement
horizon). Claude, GPT-5, and Grok treated the reader as an
abstract allocator.
what the answer depends on being true. The other three produce
specific allocation percentages (Claude says "money you can afford
to lose," GPT-5 says "1-3%," Grok says "5-10%") without naming
what would make those numbers wrong.
and "CB Insights data" once; the other three cite nothing. All
four make factual claims about Bitcoin's historical volatility
and future prospects.
claims, 0 hedged. The models state projections as facts.
The frame-library matcher flags five distinct entries across the
four responses:
only. Triggered by high density of growth-framed vocabulary
("retirement savings," "long-term returns," "proven track record,"
"compounding"). Claude's recommendation is against Bitcoin as
a core holding, but the detector notices that Claude builds its
case inside a growth frame (index funds, proven returns) rather
than from a risk-first frame. The teaching question the library
entry carries: "What would a risk analyst say about this same
data?"
of four responses. Triggered by the pattern of the model extending
the frame it opened with (if it opens with "here's why Bitcoin is
risky," everything that follows amplifies that). The library entry
asks whether increasing detail is evidence of quality or evidence
that the analysis is locked in one frame.
and Grok. Triggered by optimisation vocabulary (allocation
percentages, rebalancing, portfolio efficiency). The teaching
question: "Is efficiency the right lens here, or has it been
applied by default?"
Claude only. The detector flags this when a document asserts a
recommendation without addressing what would have to be true for
the recommendation to be wrong. The question: "What would have
to be true for this analysis to be wrong?" Claude's response is
the only one to trigger this, which is a detector artefact worth
naming: see "What the method missed" below.
Gemini only. Triggered because Gemini addresses uncertainty
substantively; the detector reads this as the uncertainty frame
being the active shape.
Frame Check's `frame_compare` tool produces a per-pair framing-
differences narrative. The six pairs across four models surface
the same underlying pattern: the detector sees the shared
stakeholders-and-uncertainty blind spot across most pairs, and
reports Grok's 7% sourcing as the single sourcing delta in the
set. Claude vs Gemini, for example, shows shared blind spots of
none (the two cover different categories but together span the
five-perspective set); GPT-5 vs Grok shows `causes, stakeholders,
uncertainty` as shared blind spots (both are operating in the same
narrow frame).
The pairwise dump is in `frame_check_results.json` under the
`_pairwise_A` key.
Three honest limits surfaced by this analysis specifically:
response contains the words "volatility risk," "no cash flows,"
"opportunity cost," and "psychological pressure to panic-sell."
A reader would describe Claude as risk-aware. The detector
reports "risks" as NOT covered because Claude's risk density
(5.2 per 1,000 words) falls below the threshold that triggers
the coverage flag. The detector is calibrated on longer documents
where nominal mention and substantive coverage can be
distinguished by density; on a short response the threshold can
miss real coverage. The honest reading: Claude did address risks,
and the detector's category flag is a coarse signal.
Claude's substantive recommendation is against Bitcoin as a core
retirement holding. The detector triggered FVS-008 not because
Claude is promoting Bitcoin (it is not) but because Claude's
reasoning runs through retirement-growth vocabulary (index funds,
401(k)s, proven returns). The frame is the lens used to reason,
not the verdict reached. A reader should read this as "Claude
evaluated Bitcoin from within the conventional long-term-growth
frame," not as "Claude is pushing Bitcoin."
the four responses produced claims that route to the Source
Network. "Bitcoin's price swings exceed stock market crashes"
is a claim about historical volatility a verifier could in
principle check; "Bitcoin will recover from 50% drawdowns"
cannot, because it is a forecast. The verification layer is
designed for empirical claims against authoritative sources;
forecast-laden investment advice lives outside its regime.
That is a scope limit of the tool, not a gap to fill.
One question. Four frontier LLMs. Four materially different
framing signatures. Same prescriptive register, different coverage
footprints, different frame matches, different levels of sourcing.
A reader who reads only one of the four responses inherits that
model's framing choice as the framing of the question. A reader
who sees the four measured side by side sees that the question
has at least four plausible framings, and that choosing between
them is work the reader has to do.
This is the sovereignty case in the plainest form. Frame Check
does not tell the reader which LLM is correct, which is balanced,
or which to trust. It names the structural shape of each response
and hands the reader the information they need to not inherit a
frame by default.
The pattern generalises. Any question that invites a prescriptive
answer (life decisions, policy, career, investment, relationships)
will produce LLM responses with different framing signatures. An
agent that runs `frame_check` on its own response, or on another
AI's response the user shares, surfaces those signatures at the
moment the reader is deciding what to do with them.
The four prompts, model IDs, and full response texts are in
`data/worked_examples/four-llms-on-bitcoin-retirement-2026/llm_responses.json`.
The per-model Frame Check results and the pairwise framing
comparisons are in `frame_check_results.json` in the same
directory. Both files are captured in the repository alongside
this writeup so a reader in 2028 can resolve "what exactly did
Gemini 2.5 Flash say on 2026-04-18 when asked about Bitcoin" to
the exact bytes.
The same prompt run on the same four models six months later
will produce different responses. That drift IS the research
signal: Frame Check's measurements are reproducible; LLM
responses are not.
Lucic, L. (2026). *Four LLMs, one investment question: a Frame
Check comparison of AI framing signatures*. Frame Check Worked
Examples.
frame.clarethium.com/corpus/worked-examples/four-llms-on-bitcoin-retirement-2026/
Licensed CC-BY-4.0. The LLM responses embedded in this analysis
are the outputs of third-party systems (Anthropic, OpenAI, xAI,
Google). Their reproduction here is for structural analysis and
falls under fair-use / fair-dealing provisions for research and
criticism. Only the Frame Check analysis is open-licensed.
Validation corpus entries that share distinguishing slug tokens with this worked example (heuristic match, not curator-declared). These corpus entries carry computed Frame Check profiles for similar source material; visit them to see the structured analysis alongside this narrative.