Frame Check's existing measurements (perspective coverage, claim calibration, source verification, contradictions, named structural patterns) are signals of how WELL an analysis supports a decision. The decision-readiness profile composes these signals into a multi-dimensional read of decision support, with explicit acknowledgment of what each dimension measures and what it does not.
The profile is structural, not predictive. It does not claim to predict whether a decision based on the analysis will turn out well. It claims to surface signals that a thoughtful reader would already use intuitively to judge an analysis's adequacy. Outcome quality and decision quality are different constructs; this profile addresses the second.
Question: Does the analysis address the perspectives
that matter for the decision?
Signal: Of five analytical dimensions (causes,
risks, stakeholders, trends, uncertainty), how many are addressed,
and how balanced are their coverage densities?
Why it matters for decisions: Decisions made on
analyses that omit relevant perspectives are systematically
biased. The Completeness Illusion (FVS-010) names the failure
mode: presence of multiple perspectives at very different depths
reads as "comprehensive" but is not.
What this signal does NOT capture: Whether the
ADDRESSED perspectives are relevant to the SPECIFIC decision in
question. A document covering all five general dimensions might
miss the dimension that matters most for the user's particular
choice.
Related library entries: FVS-001 Frame Amplification (single-frame amplification narrows perspectives), FVS-008 Growth Frame (growth-dominant narrows risks), FVS-009 Risk Frame (strengthens risks dimension), FVS-010 Completeness Illusion (the canonical balance-deficit pattern), FVS-011 Stakeholder Frame (strengthens stakeholders dimension), FVS-014 Temporal Anchoring (single temporal frame narrows), FVS-015 Efficiency Frame (efficiency-dominant narrows), FVS-017 False Balance (broad on count but misleading on weighting).
Question: Are the claims appropriately hedged
given their epistemic status?
Signal: Hedge-word ratio across claims; count of
claims stated as fact when they are predictions or speculations.
Why it matters for decisions: Overconfident
assertions about uncertain matters lead to under-weighting of
contrary evidence. Calibration failures are a documented driver
of poor decisions.
What this signal does NOT capture: Sophisticated
implicit qualifiers ("in a typical case," "all else equal") that
do not contain hedge tokens. The hedge-word proxy understates
calibration in carefully written analytical prose.
Related library entries: FVS-012 Uncertainty Frame (presence of this frame is the qualitative correlate of calibrated hedging; absence is the structural overconfidence signal), FVS-017 False Balance (calibration failures in opposite directions: overconfidence in lower-supported perspectives, under-confidence in better-supported ones).
Open detector-vs-library inconsistency: the
detector code in domain_baselines.py emits a
calibration-related pattern labeled "Confidence Imbalance" with
ID FVS-002, but the library entry FVS-002 is
Fluency-Quality Illusion,
a meta-side frame about reader evaluation rather than a
structural calibration signal. The methodology page does not
cite FVS-002 as the canonical library entry for Calibration
because of this mismatch; FVS-012 is cited instead. Resolution
requires curator-level naming work in the detector code OR a
new library entry for the Confidence Imbalance pattern under
its own FVS-ID.
Question: Are the claims supported by sources, or
floating assertions?
Signal: Numerical claims successfully matched
against authoritative providers (SEC EDGAR, World Bank, Wikipedia,
etc.), weighted by the provider's calibrated F1. Plus the share of
sentences attributed to a source.
Why it matters for decisions: Decisions based on
unsourced numerical claims are decisions based on assertion. Source
verification surfaces the share of the analysis that can be checked
against external ground truth.
What this signal does NOT capture: Whether the
sources themselves are correct, or whether the analysis interprets
sourced data correctly. Source-agreement is a necessary but
insufficient condition for evidence quality.
Related library entries: FVS-016 Authority by Citation (citation-shaped language without verifiable sources is a false-positive signal on this dimension; the Source Network calibration corpus is the direct corrective).
Question: Does the analysis hold up under scrutiny
from the available evidence?
Signal: Count of contradicted or disputed claims
where Frame Check's verifier found a source value at odds with the
document. Cross-model consensus where applicable.
Why it matters for decisions: A single contradicted
claim can invalidate a chain of reasoning. Robustness asks how many
load-bearing claims survive contact with external sources.
What this signal does NOT capture: Logical
robustness (whether the argument structure holds), only evidentiary
robustness for numerical claims. A logically incoherent argument
with verified numbers would still register as robust on this
dimension.
Related library entries: FVS-016 Authority by Citation (when load-bearing claims rely on fabricated citations, robustness fails the moment a reader checks).
Question: Does the analysis name what would
falsify it, or what alternatives it has considered?
Signal: Presence of failure-framing markers
(FVS-007 fires when limitations / risks / counter-arguments are
absent in a domain that should address them). Presence of
uncertainty dimension markers.
Why it matters for decisions: Confirmation bias is
the most-documented decision pathology. An analysis that does not
consider what could be wrong is an analysis that has made it harder
to update on new evidence.
What this signal does NOT capture: Whether the
counterfactuals NAMED are the strongest counterfactuals available.
A pro-forma "limitations" paragraph satisfies the proxy without
genuine counterfactual reasoning. The current backend computation
(Phase 1.5) emits this dimension as a boolean composite; expert
ratings are 1-5 ordinal. Spearman correlation handles the
mismatch but loses information; Phase 2 may revise the backend
to a continuous 0-1 score if validation reveals the boolean
treatment is too coarse to discriminate.
Related library entries: FVS-001 Frame Amplification (sophisticated single-frame amplification suppresses counterfactual reasoning), FVS-007 Failure Framing (the canonical absence pattern; the methodology page's structural proxy is grounded here), FVS-009 Risk Frame (presence supports counterfactual engagement), FVS-012 Uncertainty Frame (presence surfaces alternative interpretations), FVS-014 Temporal Anchoring (over-anchoring to one time orientation hides the disconfirmations the others would surface).
Not every Frame Vocabulary Standard entry maps onto a single
decision-readiness dimension. Some entries name patterns about
how readers EVALUATE analyses (rather than how analyses
themselves are structured), or patterns that operate above the
five-dimension layer. The library calls these meta-side
frames. They inform the methodology rather than feeding any
specific dimension's signal, which is why the library hub tags
them meta instead of with a dimension link.
Meta-side frames matter for decision-readiness even though they sit outside the dimensional decomposition: ignoring them would let an analysis score well on every dimension while still being misread (Fluency-Quality Illusion) or misattributed (System Attribution Error).
Meta-side library entries: FVS-002 Fluency-Quality Illusion (reader-side: prose fluency reads as analytical quality), FVS-005 System Attribution Error (reader-side: misattributing model output to a single system), FVS-006 Identity Framing Asymmetry (reader-side: the same content reads differently when attributed to different identities), FVS-013 Oracle Frame (reader-side: treating model outputs as authoritative predictions rather than samples), FVS-020 Invisible Frame (meta-meta: the unmarked default that all dimensional measurements presuppose).
Each dimension's measurement is a proxy. When the profile reports "Evidence backing: limited," the chain is:
The summary label is a compression. The underlying signal is shown alongside it so the reader can see the proxy and judge whether it warrants the label. This is the same construct-honesty pattern Frame Check uses across the rest of the product.
The five-dimension profile is not the complete decomposition of decision-readiness. Dimensions we do not currently operationalize:
We name these explicitly so a reader does not infer that the five-dimension profile is the complete picture. Future versions may add dimensions; this list is the public commitment to what is currently outside the profile's scope.
The decision-readiness profile applied independently to two documents tells you what each document looks like structurally. The transformation diff tells you what an LLM (or other transformation) DID to decision-readiness when it produced one document from the other. This is a Phase 1.5 extension built on top of the existing profile; it does not introduce new measurements, only a structured comparison of profile signals across a paired source / transformed pair.
We are not aware of another tool that publishes per-dimension decision-readiness deltas across LLM transformations. The measurement is novel; the methodology is the existing five-dimension profile applied to both halves of the pair.
Per dimension, the diff reports:
The diff also carries an overall narrative naming WHAT CHANGED across the transformation, silent on dimensions that held steady. This is deliberate: a narrative that lists every dimension would dilute the signal a reader needs to act on.
A corpus entry's metadata.yaml can carry two
optional fields naming its pair partner:
source_document,
llm_summary, llm_paraphrase,
llm_translation)
The pair-diff harness
(validation/decision_readiness/compute_pair_diffs.py)
walks the corpus, identifies paired entries, and writes
diff_with_{partner_slug}.json into both halves of
each pair. Re-running the harness after a profile regeneration
picks up the updated signals.
The diff is only as good as the profiles it compares. Every limitation named in the per-dimension sections above propagates to the diff: the boolean treatment of counterfactual loses information across transformations the same way it does in the single-document profile; the proxy chain of "construct -> signal -> proxy" applies to deltas too. A diff is a comparison of two proxies, not a measurement of "what the transformation actually did to decision quality." Construct-honesty is preserved by showing the per-dimension change_text alongside the synthesized narrative; readers can question the narrative by reading the specifics.
The transformation diff above is directional (source -> derived). Peer comparison is non-directional: two independent peer responses to the same prompt are compared on each decision-readiness dimension. Neither peer is privileged as ground truth; the comparison surfaces relative structural differences.
The corpus already holds groups of peer responses (the four
LLMs answering the bitcoin retirement question, the four LLMs
answering the startup offer question). The peer-comparison
harness
(validation/decision_readiness/compute_peer_comparisons.py)
discovers these via the optional peer_group field
in metadata.yaml and computes pairwise comparisons
within each group. With four members per group, six pairwise
comparisons emerge per group automatically.
The narrative names dimensions where the two peers measurably differ; silent on dimensions where their profiles agree. As with the transformation diff, the no-verdict discipline holds: "differs on coverage" is reported, not "peer A is better than peer B."
The math is shared (per-dimension comparison of two profiles)
but the narrative templates differ. A transformation diff says
"the transformation dropped risks coverage"; a peer comparison
says "only Claude addresses risks; Grok does not." Overloading
one module with both narrative styles would obscure the
distinct intents. Two modules,
decision_readiness_diff and
decision_readiness_peer, clarify the semantics
each is committed to.
The two modules apply different "meaningful change" thresholds on signal_value comparisons:
The asymmetry is methodological: directional measurements benefit from sensitivity (any change is informative); non- directional comparisons need filtering (sub-threshold deltas are noise, not signal). Thresholds are explicit choices and revisable; the v1 thresholds reflect convention rather than empirical calibration and should be re-examined when the corpus has enough peer pairs to estimate the noise floor empirically.
The decision-readiness profile is most differentially valuable applied to AI responses. A user asks an LLM a high-stakes question ("should I retire on this asset," "is this medication interaction serious," "should I take this job offer"). The LLM returns a prescriptive answer with structural confidence and apparent analytical breadth. The user has no built-in way to ask: is this response decision-ready, or just confident-sounding?
Frame Check via the MCP server applies the decision-readiness profile to the LLM's response itself. The user sees a structural read of the response: how many perspectives it addressed, whether claims are hedged, whether sourced, whether internally robust, whether counterfactuals are present. The user makes the decision; the profile provides the structural prerequisite.
We are not aware of an existing tool that frames AI-response audit as decision-readiness measurement at the moment of the conversation. If you know of one, we would like to hear about it; the strategic positioning depends on the territory being genuinely empty, not on our survey being complete.
The profile's claims about decision-readiness are not credible without validation against independent ground truth. The validation methodology has three phases.
This page. Documents the dimensions, the proxy chain, and the acknowledged gaps. Invites researcher and methodologist feedback before the profile becomes a live signal.
The five-dimension profile is computed server-side from existing display measurements and exposed in two places:
/api/download-json endpoint
(display.decision_readiness) for power users
downloading a result.analysis.decision_readiness), so an agent invoking
frame_check on its own last response receives the
profile in line with the rest of the structural analysis. This
wires the AI-response audit lead use case end-to-end; without
it, the strategic positioning would be documented but not
implemented for agents.The result-page web UI does NOT currently surface the profile. UI visibility is gated on Phase 2 reaching the correlation thresholds documented below.
Validation harness scaffolded at
validation/decision_readiness/ in the repository:
rater_guide.md)rating_template.yaml)
that raters copy into ratings/{doc_id}/{rater_id}.yamlrun_validation.py)
that loads ratings + Frame Check profiles and computes per-dimension
Spearman + per-rater ICC + per-genre breakdown into
results/{date}/correlations.jsoncorpus/{doc_id}/ with document text, metadata, and
the Frame Check profile)The validation work itself follows: curating the corpus (target 20-30 documents spanning genres), recruiting raters (3+ per genre minimum), and running the harness once the ratings land.
Curate a rating corpus of 20-30 documents spanning genres (financial analysis, policy briefs, journalism, AI responses to life questions). The N is chosen for tractability of expert recruitment at the v1 stage; it is sufficient for per-dimension correlation estimation but not for tight per-genre confidence intervals. Phase 2 expands to 60-100 documents once the v1 methodology survives initial review.
Sampling honesty: the seeded corpus
(validation/decision_readiness/corpus/ in the
repository) is convenience-sampled from documents already
captured for other Frame Check worked examples. This means the
seeded set is biased toward documents that fire interesting
structural patterns; a corpus chosen to validate the methodology
should ideally include randomly-sampled documents to control for
this selection effect. The v2 corpus expansion explicitly adds a
randomly-sampled component (e.g., randomly-selected SEC filings
within a date range, randomly-selected Wikipedia entries from a
category list) so the v1-versus-v2 correlation difference itself
becomes a measurable signal of selection bias in the v1 results.
At least three expert raters per document score each dimension on a 1-5 scale, blind to Frame Check's profile. Three raters is the minimum for inter-rater agreement to be measurable (with two, disagreement is just disagreement; with three, you can identify a majority vs an outlier). More raters per document improves the estimate; three is the floor.
Computed and published per dimension:
The profile becomes a live signal in the product surface only when Phase 2 reaches per-dimension Spearman ≥ 0.6 averaged across genres, with no individual genre below 0.4. The 0.6 threshold matches conventional psychometric guidance for "moderate-to-strong" ordinal correlation; the 0.4 floor prevents shipping a profile that is highly predictive in one genre and noise in another. Both thresholds are explicit choices, not field defaults; they are revisable in light of Phase 2 results, with any revision documented and dated.
For analyses where decisions and outcomes can be observed over time, track whether profile readings predict outcome quality. This is genuinely hard, not theoretically hard:
Phase 3 is deliberately scoped narrow: financial analyses with realized P&L within a 12-month horizon, and explicit predictions with verifiable outcomes within a similar window. Negative results from Phase 3 inform Phase 2 weighting; the profile is not gated on Phase 3 because the methodology cannot wait years for outcome data to ship a structurally-grounded signal.
Once any signal becomes visible, people optimize FOR the signal. A document optimized to satisfy the decision-readiness profile is not necessarily a more decision-ready document; it may simply be a document better fitted to the profile's measurements. This is Goodhart's Law applied to our own metric.
We commit to: (a) never displaying a single composite score that could be optimized as a target; (b) maintaining the dimensional breakdown so optimization for one dimension is visible against others; (c) periodic methodology review when patterns suggest profile-targeted writing has emerged in the wild.
The dimensional decomposition is grounded in well-established decision-theory and judgment literature. Each dimension corresponds to a documented failure mode rather than an invented construct.
We do not cite these references to certify the dimensions as correct; we cite them to ground the choice of dimensions in established literature so the design is reviewable rather than arbitrary. A reviewer who disagrees with a dimension can engage with the cited foundation rather than with our framing alone.
This methodology is open and citable.
Lucic, L. (2026). Decision-readiness profile: methodology, v0.1. FrameCheck Frame Library 0.2.0. https://frame.clarethium.com/corpus/decision-readiness/
The methodology version (v0.1) reflects the current status: methodology published, backend computation shipped (Phase 1.5, JSON-only exposure), validation harness scaffolded, expert ratings not yet collected. The version bumps to v1.0 once Phase 2 reaches the correlation thresholds and the profile becomes a live signal in the result-page UI.
Methodologists, researchers, and practitioners with expertise in decision theory, risk analysis, or expert judgment elicitation are the intended pre-publication review audience. Feedback is solicited on: dimensional decomposition, proxy validity, validation design, and acknowledged gaps.
Channels for feedback:
[decision-readiness]. The repository's history is the
canonical record of methodology evolution, so issues there compound
into the next methodology version.There is no mailing list yet. If you want to be notified when the profile becomes a live signal in the product, the research blog is the publication channel; subscribing there is the closest available signal.