Rating-quality contrast
Three illustrative rating files showing what substantive, mediocre, and insufficient submissions look like for the SAME corpus document. The contrast is the teaching: same document, three rating styles, different validation value. These files are NOT real submissions; they live under validation/decision_readiness/examples/ and are ignored by the validation harness.
Annotated rating examples
Illustrative rating files showing what GOOD, MEDIOCRE, and
INSUFFICIENT rater submissions look like for the SAME corpus
document. The contrast is the teaching: same document, three
rating styles, different validation value.
These are NOT real ratings. They are not consumed by
`run_validation.py`; they live in `examples/` deliberately so the
harness's `ratings/` directory only contains submissions from
actual raters. Each example file's `rater_id` starts with
`example-` so a future contributor copying-and-pasting would have
to consciously rename it before submitting.
Why they exist
A new rater asks "what does a good rating actually look like?"
The README's worked example shows ONE filled file. These three
files show CONTRAST: same document, three quality levels, with
notes explaining what makes the difference.
A pedagogue teaching from this material can use the three
examples to anchor a class discussion ("which of these is more
useful for the validation effort, and why?").
The three examples (rating the same corpus document)
All three rate `four-llms-bitcoin-claude` (Claude's response to
the question "should I retire on Bitcoin"). The document is in
the seeded corpus; raters can read it at
`../corpus/four-llms-bitcoin-claude/document.md`.
`example-good.yaml`: what good looks like
- Numeric ratings on every dimension (no `null`s when the
dimension applies)
- Notes per dimension that name SPECIFIC observations from the
document (quoted phrases or numerical references)
- Overall note synthesizes across dimensions
- Time spent reflects substantive engagement (15-30 minutes)
- Self-confidence honest about ambiguous dimensions
The validation effort learns the most from this kind of rating.
Divergence cases between Frame Check's profile and a good
rater's scores are interpretable because the notes explain the
rater's reasoning.
`example-mediocre.yaml`: common shortfall
- Numeric ratings on every dimension
- Notes are present but generic ("the document is unclear")
- No specific observations tied to the document
- Time spent is low (5-10 minutes)
- Self-confidence high, suggesting the rater did not notice
ambiguities
This contributes to per-dimension means but is uninterpretable
on divergence. The validation effort cannot tell whether the
rater agreed with Frame Check by accident or by analysis.
`example-insufficient.yaml`: what to avoid
- Some ratings are extreme (1 or 5) without justification
- Notes are empty strings or one word
- Time spent is unrealistically low (1-2 minutes)
- Dimensions where the rater should use `null` (e.g., evidence
on a heavily interpretive document) instead get a guess
This degrades the validation: it adds noise to per-dimension
means, undermines ICC, and provides nothing for divergence
analysis. The harness still ingests it; the results page should
flag insufficient submissions if patterns emerge.
How to use these for rater calibration
Before submitting their first real rating, a new rater can:
1. Read the rater_guide.md anchors
2. Read `walkthrough_four-llms-bitcoin-claude.md` (the
profile-versus-rating walkthrough; reads the same document's
profile.json dimension-by-dimension alongside example-good)
3. Read all three rating examples here
4. Try to articulate WHY each example illustrates its label
5. Open `corpus/four-llms-bitcoin-claude/document.md` and
produce their own rating
6. Compare their rating against `example-good.yaml`. Differences
are themselves informative; there is no "correct" rating,
but there are recognizable signs of substantive engagement.
A first-time rater whose own attempt is closer to
`example-mediocre` than to `example-good` should re-read the
specific dimension anchors that they scored most superficially.
A first-time rater whose scores match the profile on every
dimension is over-trusting the automated layer; the walkthrough
shows where the gap typically falls.
Why examples are NOT in `ratings/`
The `ratings/` directory is consumed by the validation harness.
Mixing illustrative files with real submissions would:
- Skew per-document means by my fictional ratings
- Inflate ICC by counting illustrative agreement as inter-rater
agreement
- Make it impossible to tell from the directory contents which
files are real
Hence `examples/` is a sibling directory the harness ignores.
The convention `rater_id: example-*` makes accidental copying
into `ratings/` more visible (a real rater_id would not start
with the literal string "example-").
The three files in full
The YAMLs are reproduced verbatim below. A prospective rater can read them, then read the source document at validation/decision_readiness/corpus/four-llms-bitcoin-claude/document.md and produce their own rating, comparing it against the good example to calibrate.
Good (example-good.yaml)
# Illustrative GOOD rating. NOT a real submission; this file
# lives in examples/ and is ignored by run_validation.py.
# See examples/README.md for what makes this rating "good."
doc_id: "four-llms-bitcoin-claude"
rater_id: "example-good"
rated_at: "2026-04-19"
ratings:
coverage: 4
calibration: 4
evidence: 2
robustness: null
counterfactual: 4
notes:
coverage: |
Addresses risks (volatility, concentration, regulatory)
explicitly. Stakeholders implied (35-year-old, savers with
diversified portfolios). Causes (why Bitcoin is volatile)
are present but lighter than risks. Trends not addressed
(no historical price reasoning, no market context). Coverage
is broad with risks-heavy density.
calibration: |
Hedging is appropriate to a personal-decision context. "I'd
lean toward" frames the recommendation as opinion, not
prescription. Specific quantitative anchors ("50%+ swings,"
"2-5% of portfolio") are stated as facts but are common-
knowledge ranges in finance. Predictions ("could shift") are
appropriately conditional.
evidence: |
Almost no source attribution. The 50% volatility figure and
the 2-5% satellite allocation guidance are stated without
citing specific data or studies. A reader following this
advice has no provenance to check. Two stars rather than
one because the reasoning is internally consistent and
aligned with mainstream finance literature, even if the
document does not cite it.
robustness: |
Null because the document has no checkable numerical claims
in the strict sense. "50%+ swings" and "2-5% allocation" are
range statements, not point claims. Robustness rating would
require checking specific numbers against external sources;
there are none to check.
counterfactual: |
The "Where it might make sense" section names the conditions
under which the recommendation reverses. The closing question
("what does your current retirement savings look like") opens
the conversation rather than closing it. This is genuine
counterfactual engagement, not a token "limitations"
paragraph.
overall: |
Strong calibration and counterfactual engagement carry an
interpretive piece with thin sourcing. Suitable as ORIENTATION
for a 35-year-old considering Bitcoin allocation; should not
be treated as decision-grade without additional sourced
research. Frame Check's "Confidence Imbalance" pattern would
be a false positive on this document; the unhedged statements
are appropriate to the genre and the recommendation itself
is hedged.
time_spent_minutes: 22
secondary_genres: []
self_confidence: 4
Mediocre (example-mediocre.yaml)
# Illustrative MEDIOCRE rating. NOT a real submission. Numeric
# scores are present on every dimension but notes are generic
# and untied to specific observations. Useful as part of a
# per-dimension mean, near-useless for divergence interpretation.
doc_id: "four-llms-bitcoin-claude"
rater_id: "example-mediocre"
rated_at: "2026-04-19"
ratings:
coverage: 3
calibration: 4
evidence: 2
robustness: 3
counterfactual: 3
notes:
coverage: "Decent coverage of risks. Could be better."
calibration: "Reasonably well-hedged."
evidence: "Lacks specific sources."
robustness: "Hard to say, didn't check the numbers."
counterfactual: "Mentions some alternatives."
overall: "Average AI response. Useful but not authoritative."
time_spent_minutes: 8
secondary_genres: []
self_confidence: 4
Insufficient (example-insufficient.yaml)
# Illustrative INSUFFICIENT rating. NOT a real submission.
# Demonstrates the failure modes that degrade the validation:
# extreme scores without justification, empty notes, unrealistic
# time spent, guesses where null was the honest choice.
doc_id: "four-llms-bitcoin-claude"
rater_id: "example-insufficient"
rated_at: "2026-04-19"
ratings:
coverage: 5
calibration: 5
evidence: 1
robustness: 1
counterfactual: 5
notes:
coverage: ""
calibration: ""
evidence: "bad"
robustness: "no"
counterfactual: ""
overall: "good ai"
time_spent_minutes: 2
secondary_genres: []
self_confidence: 5