Frame Check

Rating-quality contrast

Three illustrative rating files showing what substantive, mediocre, and insufficient submissions look like for the SAME corpus document. The contrast is the teaching: same document, three rating styles, different validation value. These files are NOT real submissions; they live under validation/decision_readiness/examples/ and are ignored by the validation harness.

Annotated rating examples

Illustrative rating files showing what GOOD, MEDIOCRE, and

INSUFFICIENT rater submissions look like for the SAME corpus

document. The contrast is the teaching: same document, three

rating styles, different validation value.

These are NOT real ratings. They are not consumed by

`run_validation.py`; they live in `examples/` deliberately so the

harness's `ratings/` directory only contains submissions from

actual raters. Each example file's `rater_id` starts with

`example-` so a future contributor copying-and-pasting would have

to consciously rename it before submitting.

Why they exist

A new rater asks "what does a good rating actually look like?"

The README's worked example shows ONE filled file. These three

files show CONTRAST: same document, three quality levels, with

notes explaining what makes the difference.

A pedagogue teaching from this material can use the three

examples to anchor a class discussion ("which of these is more

useful for the validation effort, and why?").

The three examples (rating the same corpus document)

All three rate `four-llms-bitcoin-claude` (Claude's response to

the question "should I retire on Bitcoin"). The document is in

the seeded corpus; raters can read it at

`../corpus/four-llms-bitcoin-claude/document.md`.

`example-good.yaml`: what good looks like

dimension applies)

document (quoted phrases or numerical references)

The validation effort learns the most from this kind of rating.

Divergence cases between Frame Check's profile and a good

rater's scores are interpretable because the notes explain the

rater's reasoning.

`example-mediocre.yaml`: common shortfall

ambiguities

This contributes to per-dimension means but is uninterpretable

on divergence. The validation effort cannot tell whether the

rater agreed with Frame Check by accident or by analysis.

`example-insufficient.yaml`: what to avoid

on a heavily interpretive document) instead get a guess

This degrades the validation: it adds noise to per-dimension

means, undermines ICC, and provides nothing for divergence

analysis. The harness still ingests it; the results page should

flag insufficient submissions if patterns emerge.

How to use these for rater calibration

Before submitting their first real rating, a new rater can:

1. Read the rater_guide.md anchors

2. Read `walkthrough_four-llms-bitcoin-claude.md` (the

profile-versus-rating walkthrough; reads the same document's

profile.json dimension-by-dimension alongside example-good)

3. Read all three rating examples here

4. Try to articulate WHY each example illustrates its label

5. Open `corpus/four-llms-bitcoin-claude/document.md` and

produce their own rating

6. Compare their rating against `example-good.yaml`. Differences

are themselves informative; there is no "correct" rating,

but there are recognizable signs of substantive engagement.

A first-time rater whose own attempt is closer to

`example-mediocre` than to `example-good` should re-read the

specific dimension anchors that they scored most superficially.

A first-time rater whose scores match the profile on every

dimension is over-trusting the automated layer; the walkthrough

shows where the gap typically falls.

Why examples are NOT in `ratings/`

The `ratings/` directory is consumed by the validation harness.

Mixing illustrative files with real submissions would:

agreement

files are real

Hence `examples/` is a sibling directory the harness ignores.

The convention `rater_id: example-*` makes accidental copying

into `ratings/` more visible (a real rater_id would not start

with the literal string "example-").

The three files in full

The YAMLs are reproduced verbatim below. A prospective rater can read them, then read the source document at validation/decision_readiness/corpus/four-llms-bitcoin-claude/document.md and produce their own rating, comparing it against the good example to calibrate.

Good (example-good.yaml)

# Illustrative GOOD rating. NOT a real submission; this file
# lives in examples/ and is ignored by run_validation.py.
# See examples/README.md for what makes this rating "good."

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-good"
rated_at: "2026-04-19"

ratings:
  coverage: 4
  calibration: 4
  evidence: 2
  robustness: null
  counterfactual: 4

notes:
  coverage: |
    Addresses risks (volatility, concentration, regulatory)
    explicitly. Stakeholders implied (35-year-old, savers with
    diversified portfolios). Causes (why Bitcoin is volatile)
    are present but lighter than risks. Trends not addressed
    (no historical price reasoning, no market context). Coverage
    is broad with risks-heavy density.
  calibration: |
    Hedging is appropriate to a personal-decision context. "I'd
    lean toward" frames the recommendation as opinion, not
    prescription. Specific quantitative anchors ("50%+ swings,"
    "2-5% of portfolio") are stated as facts but are common-
    knowledge ranges in finance. Predictions ("could shift") are
    appropriately conditional.
  evidence: |
    Almost no source attribution. The 50% volatility figure and
    the 2-5% satellite allocation guidance are stated without
    citing specific data or studies. A reader following this
    advice has no provenance to check. Two stars rather than
    one because the reasoning is internally consistent and
    aligned with mainstream finance literature, even if the
    document does not cite it.
  robustness: |
    Null because the document has no checkable numerical claims
    in the strict sense. "50%+ swings" and "2-5% allocation" are
    range statements, not point claims. Robustness rating would
    require checking specific numbers against external sources;
    there are none to check.
  counterfactual: |
    The "Where it might make sense" section names the conditions
    under which the recommendation reverses. The closing question
    ("what does your current retirement savings look like") opens
    the conversation rather than closing it. This is genuine
    counterfactual engagement, not a token "limitations"
    paragraph.
  overall: |
    Strong calibration and counterfactual engagement carry an
    interpretive piece with thin sourcing. Suitable as ORIENTATION
    for a 35-year-old considering Bitcoin allocation; should not
    be treated as decision-grade without additional sourced
    research. Frame Check's "Confidence Imbalance" pattern would
    be a false positive on this document; the unhedged statements
    are appropriate to the genre and the recommendation itself
    is hedged.

time_spent_minutes: 22
secondary_genres: []
self_confidence: 4

Mediocre (example-mediocre.yaml)

# Illustrative MEDIOCRE rating. NOT a real submission. Numeric
# scores are present on every dimension but notes are generic
# and untied to specific observations. Useful as part of a
# per-dimension mean, near-useless for divergence interpretation.

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-mediocre"
rated_at: "2026-04-19"

ratings:
  coverage: 3
  calibration: 4
  evidence: 2
  robustness: 3
  counterfactual: 3

notes:
  coverage: "Decent coverage of risks. Could be better."
  calibration: "Reasonably well-hedged."
  evidence: "Lacks specific sources."
  robustness: "Hard to say, didn't check the numbers."
  counterfactual: "Mentions some alternatives."
  overall: "Average AI response. Useful but not authoritative."

time_spent_minutes: 8
secondary_genres: []
self_confidence: 4

Insufficient (example-insufficient.yaml)

# Illustrative INSUFFICIENT rating. NOT a real submission.
# Demonstrates the failure modes that degrade the validation:
# extreme scores without justification, empty notes, unrealistic
# time spent, guesses where null was the honest choice.

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-insufficient"
rated_at: "2026-04-19"

ratings:
  coverage: 5
  calibration: 5
  evidence: 1
  robustness: 1
  counterfactual: 5

notes:
  coverage: ""
  calibration: ""
  evidence: "bad"
  robustness: "no"
  counterfactual: ""
  overall: "good ai"

time_spent_minutes: 2
secondary_genres: []
self_confidence: 5