Rating-quality contrast

Three illustrative rating files showing what substantive, mediocre, and insufficient submissions look like for the SAME corpus document. The contrast is the teaching: same document, three rating styles, different validation value. These files are NOT real submissions; they live under validation/decision_readiness/examples/ and are ignored by the validation harness.

Annotated rating examples

Why they exist

The three examples (rating the same corpus document)

`example-good.yaml`: what good looks like

`example-mediocre.yaml`: common shortfall

`example-insufficient.yaml`: what to avoid

How to use these for rater calibration

Why examples are NOT in `ratings/`

The three files in full

The YAMLs are reproduced verbatim below. A prospective rater can read them, then read the source document at validation/decision_readiness/corpus/four-llms-bitcoin-claude/document.md and produce their own rating, comparing it against the good example to calibrate.

Good (`example-good.yaml`)

# Illustrative GOOD rating. NOT a real submission; this file
# lives in examples/ and is ignored by run_validation.py.
# See examples/README.md for what makes this rating "good."

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-good"
rated_at: "2026-04-19"

ratings:
  coverage: 4
  calibration: 4
  evidence: 2
  robustness: null
  counterfactual: 4

notes:
  coverage: |
    Addresses risks (volatility, concentration, regulatory)
    explicitly. Stakeholders implied (35-year-old, savers with
    diversified portfolios). Causes (why Bitcoin is volatile)
    are present but lighter than risks. Trends not addressed
    (no historical price reasoning, no market context). Coverage
    is broad with risks-heavy density.
  calibration: |
    Hedging is appropriate to a personal-decision context. "I'd
    lean toward" frames the recommendation as opinion, not
    prescription. Specific quantitative anchors ("50%+ swings,"
    "2-5% of portfolio") are stated as facts but are common-
    knowledge ranges in finance. Predictions ("could shift") are
    appropriately conditional.
  evidence: |
    Almost no source attribution. The 50% volatility figure and
    the 2-5% satellite allocation guidance are stated without
    citing specific data or studies. A reader following this
    advice has no provenance to check. Two stars rather than
    one because the reasoning is internally consistent and
    aligned with mainstream finance literature, even if the
    document does not cite it.
  robustness: |
    Null because the document has no checkable numerical claims
    in the strict sense. "50%+ swings" and "2-5% allocation" are
    range statements, not point claims. Robustness rating would
    require checking specific numbers against external sources;
    there are none to check.
  counterfactual: |
    The "Where it might make sense" section names the conditions
    under which the recommendation reverses. The closing question
    ("what does your current retirement savings look like") opens
    the conversation rather than closing it. This is genuine
    counterfactual engagement, not a token "limitations"
    paragraph.
  overall: |
    Strong calibration and counterfactual engagement carry an
    interpretive piece with thin sourcing. Suitable as ORIENTATION
    for a 35-year-old considering Bitcoin allocation; should not
    be treated as decision-grade without additional sourced
    research. Frame Check's "Confidence Imbalance" pattern would
    be a false positive on this document; the unhedged statements
    are appropriate to the genre and the recommendation itself
    is hedged.

time_spent_minutes: 22
secondary_genres: []
self_confidence: 4

Mediocre (`example-mediocre.yaml`)

# Illustrative MEDIOCRE rating. NOT a real submission. Numeric
# scores are present on every dimension but notes are generic
# and untied to specific observations. Useful as part of a
# per-dimension mean, near-useless for divergence interpretation.

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-mediocre"
rated_at: "2026-04-19"

ratings:
  coverage: 3
  calibration: 4
  evidence: 2
  robustness: 3
  counterfactual: 3

notes:
  coverage: "Decent coverage of risks. Could be better."
  calibration: "Reasonably well-hedged."
  evidence: "Lacks specific sources."
  robustness: "Hard to say, didn't check the numbers."
  counterfactual: "Mentions some alternatives."
  overall: "Average AI response. Useful but not authoritative."

time_spent_minutes: 8
secondary_genres: []
self_confidence: 4

Insufficient (`example-insufficient.yaml`)

# Illustrative INSUFFICIENT rating. NOT a real submission.
# Demonstrates the failure modes that degrade the validation:
# extreme scores without justification, empty notes, unrealistic
# time spent, guesses where null was the honest choice.

doc_id: "four-llms-bitcoin-claude"
rater_id: "example-insufficient"
rated_at: "2026-04-19"

ratings:
  coverage: 5
  calibration: 5
  evidence: 1
  robustness: 1
  counterfactual: 5

notes:
  coverage: ""
  calibration: ""
  evidence: "bad"
  robustness: "no"
  counterfactual: ""
  overall: "good ai"

time_spent_minutes: 2
secondary_genres: []
self_confidence: 5