Frame Check

Annotated interpretation: Claude vs Grok on the bitcoin retirement question

This is an illustrative reading of

`corpus/four-llms-bitcoin-claude/peer_with_four-llms-bitcoin-grok.json`.

The comparison is computed automatically; this file teaches how

to READ it without overgeneralizing.

The pair

responding to the prompt "should I retire on Bitcoin"

responding to the same prompt

Both are independent peer responses; neither is a transformation

of the other. The comparison is non-directional. Naming WHICH

model has more of something is a description of the two specific

responses, not a model verdict.

Reading the comparison field by field

Coverage: Claude 1/5 vs Grok 2/5

Claude addresses ONE analytical perspective (causes). Grok

addresses TWO (risks, trends). Their addressed sets overlap on

nothing.

Substantive read for THIS pair: the two models structured

their responses around different perspectives. Claude leaned

into WHY (causes — explaining what makes Bitcoin retirement

risky); Grok leaned into WHAT IF (risks) and WHAT'S CHANGING

(trends). Different framings of the same question.

What this comparison cannot conclude: "Claude has narrower

coverage than Grok in general." This is one prompt, one

generation, one comparison. Coverage might invert on the

startup-offer question, or on a different bitcoin prompt, or

with different sampling parameters. The aggregate findings page

shows that across 12 peer pairs, coverage differs in 10 of 12.

The pattern is corpus-level; individual-pair attribution is not.

Calibration: Claude hedge ratio 0.00 (4 claims) vs Grok 0.20 (15 claims)

Claude makes 4 numerical claims, none hedged. Grok makes 15

numerical claims, 3 hedged (20%).

Substantive read: Grok wrote a longer response with more

numerical specifics AND more hedging. Claude wrote a shorter

response with fewer specifics and no hedging.

This is interestingly ambiguous. A reader might conclude:

calibrate)

The peer comparison surfaces the structural difference. It does

not tell you which response is BETTER. That depends on what the

reader needs from the response.

Evidence: sentence-attribution Claude 0% vs Grok 7%

Neither response cites external sources for numerical claims

(verification ratio is null on both — the corpus profiles were

computed without Source Network). The difference here is in

SENTENCE-LEVEL ATTRIBUTION: Grok's response carries some

attribution language ("according to," "studies show," etc.) at

7% of sentences; Claude's response has none.

Substantive read: Grok is more ATTRIBUTED than Claude in

this response. Whether the attributions Grok added are

faithful to real sources or invented is a different question

that Frame Check's structural analysis cannot answer. The

comparison is the start of an investigation, not the end.

Important caveat: because BOTH peers have null verification

ratios, the evidence dimension here is comparing the SECONDARY

signal (sentence-attribution percentage), not the PRIMARY

signal (numerical-claim verification ratio). This is honestly

disclosed in the comparison_text but a reader skimming the

narrative might miss it.

Robustness: neither peer had claims checked against external sources

This dimension is non-comparable for THIS PAIR (and for ALL

pairs in the v1 corpus — see the aggregate findings: robustness

is non-comparable in 12 of 12 peer pairs). The corpus profiles

were computed offline without Source Network.

Substantive read: robustness is a corpus gap, not a finding.

A future Phase 2.5 corpus refresh with Source Network enabled

would fill this in. A researcher should NOT report "robustness

agrees" for any peer pair in v1; they should report "robustness

not measured."

Counterfactual: only Grok engages with counterfactual thinking; Claude does not

Grok's response includes uncertainty markers and engages with

how the recommendation might be wrong (failure-framing

present). Claude's response does not (FVS-007 fires for Claude;

not for Grok).

Substantive read for THIS pair: Grok's response is more

counterfactually engaged than Claude's.

Aggregate connection (the outlier signal). The aggregate

findings page

(`results/{date}-{corpus_hash}/aggregate.md`) computes per-

group median-distance outlier identification: for each peer

group, the member whose signal value is most different from

the group median on a given dimension is the outlier on that

dimension in that group. The table reports Claude as the

counterfactual outlier in 2 of 2 peer groups it appears in.

This is the methodologically correct signal: Claude's

counterfactual signal_value (False) is the outlier from the

group median (most members engage; Claude does not) in BOTH

the bitcoin and startup peer groups. The per-LLM PARTICIPATION

count would have been the wrong signal — participation just

means "Claude was in N differing pairs," which doesn't

distinguish whose value is the outlier.

This is a corpus-level signal, not a model-level claim. It

holds in 2 of 2 Claude-containing peer groups (N=2). Whether

Claude is systematically less counterfactually engaged across

DIFFERENT questions and DIFFERENT corpora is a separate

question; the aggregate findings document the per-group

finding honestly with explicit N.

Synthesizing the comparison

How Claude and Grok differ structurally on the bitcoin

retirement question:

1. Coverage divergence: Claude leans causes; Grok leans

risks + trends. Neither is wrong; both are valid framings

of the same question.

2. Calibration divergence: Grok has more numerical claims

AND more hedges; Claude has fewer of both. Different

rhetorical strategies; structural comparison cannot pick a

winner.

3. Evidence divergence: Grok carries some sentence-level

attribution; Claude has none. Grok's attributions warrant

independent verification; Claude's complete absence is a

gap.

4. Counterfactual divergence: Grok engages with

counterfactual thinking; Claude does not. The aggregate

findings identify Claude as the counterfactual OUTLIER (via

median-distance from the group) in 2 of 2 peer groups it

appears in.

5. Robustness: not measured.

What this comparison does NOT tell us

needs. Frame Check's structural analysis is silent on quality.

comparison is structural, not semantic.

prompt, one generation. Cross-question and cross-prompt

generalization needs more pairs.

with sampling temperature 0 — single-shot comparisons are not

controlled experiments.

What a writeup of this comparison would say

On the bitcoin retirement question, Claude and Grok produce structurally different responses across four of five decision-readiness dimensions. Claude's response is shorter (4 numerical claims vs Grok's 15), structured around causes, and unhedged; Grok's response is longer, structured around risks and trends, with 20% of claims hedged. Grok engages with counterfactual thinking; Claude does not. The aggregate findings identify Claude as the counterfactual outlier (median-distance method) in 2 of 2 peer groups it appears in. Evidence dimension comparison reflects sentence-level attribution only (verification ratios are non-comparable across the v1 corpus); Grok's 7% sentence-attribution warrants independent verification.

>

Neither response should be treated as more decision-ready than the other on the structural signals alone; the comparison surfaces structural differences that a reader can use as a starting point for their own judgment.

>

Cited from the Frame Check decision-readiness corpus, revision `70e2a95a9d1f`.