Decision-readiness profile: methodology

A profile, not a score. Five structural dimensions of decision support derived from existing Frame Check measurements. Lead use case: AI-response audit at the moment of the conversation (MCP integration). Status: methodology published, profile pending validation. Library version: 0.2.0. Licensed CC-BY-4.0.

What the methodology produces in practice on the current corpus: corpus aggregate findings. The aggregate carries cross-question outlier consistency findings, fired patterns per LLM, and per-dimension divergence rates across the seeded validation corpus. Same data exposed via MCP at frame-check://aggregate/latest.

Reviewers wanted. Phase 2 expert validation needs raters across genres. Without external raters the profile stays experimental indefinitely. How to participate →

What this is

Frame Check's existing measurements (perspective coverage, claim calibration, source verification, contradictions, named structural patterns) are signals of how WELL an analysis supports a decision. The decision-readiness profile composes these signals into a multi-dimensional read of decision support, with explicit acknowledgment of what each dimension measures and what it does not.

The profile is structural, not predictive. It does not claim to predict whether a decision based on the analysis will turn out well. It claims to surface signals that a thoughtful reader would already use intuitively to judge an analysis's adequacy. Outcome quality and decision quality are different constructs; this profile addresses the second.

What this is not

Not a single score. A composite "decision-quality score" combining diverse signals into one number is false precision. The dimensions matter independently; the user does the integration.
Not a verdict. The profile does not certify that an analysis is "decision-grade" or otherwise authorized for acting on. The judgment is the reader's.
Not domain-blind. A grant proposal SHOULD be heavily hedged. A historical claim SHOULD NOT. The profile annotates its dimensional readings with genre context when available; thresholds are not universal.
Not a substitute for outcome tracking. Decisions should still be logged and reviewed against outcomes. The profile addresses ex-ante reasoning quality, not ex-post outcome quality.

The five dimensions

1. Coverage of perspectives

Question: Does the analysis address the perspectives that matter for the decision?
Signal: Of five analytical dimensions (causes, risks, stakeholders, trends, uncertainty), how many are addressed, and how balanced are their coverage densities?
Why it matters for decisions: Decisions made on analyses that omit relevant perspectives are systematically biased. The Completeness Illusion (FVS-010) names the failure mode: presence of multiple perspectives at very different depths reads as "comprehensive" but is not.
What this signal does NOT capture: Whether the ADDRESSED perspectives are relevant to the SPECIFIC decision in question. A document covering all five general dimensions might miss the dimension that matters most for the user's particular choice.

Related library entries: FVS-001 Frame Amplification (single-frame amplification narrows perspectives), FVS-008 Growth Frame (growth-dominant narrows risks), FVS-009 Risk Frame (strengthens risks dimension), FVS-010 Completeness Illusion (the canonical balance-deficit pattern), FVS-011 Stakeholder Frame (strengthens stakeholders dimension), FVS-014 Temporal Anchoring (single temporal frame narrows), FVS-015 Efficiency Frame (efficiency-dominant narrows), FVS-017 False Balance (broad on count but misleading on weighting).

2. Claim calibration

Question: Are the claims appropriately hedged given their epistemic status?
Signal: Hedge-word ratio across claims; count of claims stated as fact when they are predictions or speculations.
Why it matters for decisions: Overconfident assertions about uncertain matters lead to under-weighting of contrary evidence. Calibration failures are a documented driver of poor decisions.
What this signal does NOT capture: Sophisticated implicit qualifiers ("in a typical case," "all else equal") that do not contain hedge tokens. The hedge-word proxy understates calibration in carefully written analytical prose.

Related library entries: FVS-012 Uncertainty Frame (presence of this frame is the qualitative correlate of calibrated hedging; absence is the structural overconfidence signal), FVS-017 False Balance (calibration failures in opposite directions: overconfidence in lower-supported perspectives, under-confidence in better-supported ones).

Open detector-vs-library inconsistency: the detector code in domain_baselines.py emits a calibration-related pattern labeled "Confidence Imbalance" with ID FVS-002, but the library entry FVS-002 is Fluency-Quality Illusion, a meta-side frame about reader evaluation rather than a structural calibration signal. The methodology page does not cite FVS-002 as the canonical library entry for Calibration because of this mismatch; FVS-012 is cited instead. Resolution requires curator-level naming work in the detector code OR a new library entry for the Confidence Imbalance pattern under its own FVS-ID.

3. Evidence backing

Question: Are the claims supported by sources, or floating assertions?
Signal: Numerical claims successfully matched against authoritative providers (SEC EDGAR, World Bank, Wikipedia, etc.), weighted by the provider's calibrated F1. Plus the share of sentences attributed to a source.
Why it matters for decisions: Decisions based on unsourced numerical claims are decisions based on assertion. Source verification surfaces the share of the analysis that can be checked against external ground truth.
What this signal does NOT capture: Whether the sources themselves are correct, or whether the analysis interprets sourced data correctly. Source-agreement is a necessary but insufficient condition for evidence quality.

Related library entries: FVS-016 Authority by Citation (citation-shaped language without verifiable sources is a false-positive signal on this dimension; the Source Network calibration corpus is the direct corrective).

4. Robustness

Question: Does the analysis hold up under scrutiny from the available evidence?
Signal: Count of contradicted or disputed claims where Frame Check's verifier found a source value at odds with the document. Cross-model consensus where applicable.
Why it matters for decisions: A single contradicted claim can invalidate a chain of reasoning. Robustness asks how many load-bearing claims survive contact with external sources.
What this signal does NOT capture: Logical robustness (whether the argument structure holds), only evidentiary robustness for numerical claims. A logically incoherent argument with verified numbers would still register as robust on this dimension.

Related library entries: FVS-016 Authority by Citation (when load-bearing claims rely on fabricated citations, robustness fails the moment a reader checks).

5. Counterfactual thinking

Question: Does the analysis name what would falsify it, or what alternatives it has considered?
Signal: Presence of failure-framing markers (FVS-007 fires when limitations / risks / counter-arguments are absent in a domain that should address them). Presence of uncertainty dimension markers.
Why it matters for decisions: Confirmation bias is the most-documented decision pathology. An analysis that does not consider what could be wrong is an analysis that has made it harder to update on new evidence.
What this signal does NOT capture: Whether the counterfactuals NAMED are the strongest counterfactuals available. A pro-forma "limitations" paragraph satisfies the proxy without genuine counterfactual reasoning. The current backend computation (Phase 1.5) emits this dimension as a boolean composite; expert ratings are 1-5 ordinal. Spearman correlation handles the mismatch but loses information; Phase 2 may revise the backend to a continuous 0-1 score if validation reveals the boolean treatment is too coarse to discriminate.

Related library entries: FVS-001 Frame Amplification (sophisticated single-frame amplification suppresses counterfactual reasoning), FVS-007 Failure Framing (the canonical absence pattern; the methodology page's structural proxy is grounded here), FVS-009 Risk Frame (presence supports counterfactual engagement), FVS-012 Uncertainty Frame (presence surfaces alternative interpretations), FVS-014 Temporal Anchoring (over-anchoring to one time orientation hides the disconfirmations the others would surface).

Meta-side frames in the library

Not every Frame Vocabulary Standard entry maps onto a single decision-readiness dimension. Some entries name patterns about how readers EVALUATE analyses (rather than how analyses themselves are structured), or patterns that operate above the five-dimension layer. The library calls these meta-side frames. They inform the methodology rather than feeding any specific dimension's signal, which is why the library hub tags them meta instead of with a dimension link.

Meta-side frames matter for decision-readiness even though they sit outside the dimensional decomposition: ignoring them would let an analysis score well on every dimension while still being misread (Fluency-Quality Illusion) or misattributed (System Attribution Error).

Meta-side library entries: FVS-002 Fluency-Quality Illusion (reader-side: prose fluency reads as analytical quality), FVS-005 System Attribution Error (reader-side: misattributing model output to a single system), FVS-006 Identity Framing Asymmetry (reader-side: the same content reads differently when attributed to different identities), FVS-013 Oracle Frame (reader-side: treating model outputs as authoritative predictions rather than samples), FVS-020 Invisible Frame (meta-meta: the unmarked default that all dimensional measurements presuppose).

The proxy chain we are honest about

Each dimension's measurement is a proxy. When the profile reports "Evidence backing: limited," the chain is:

The CONSTRUCT is "evidence backing"
The SIGNAL is "share of numerical claims source-verified, F1-weighted"
The SIGNAL is itself a proxy for "share of claims correct" (verification measures source-agreement, not truth)
And the universe of claims tested is restricted to NUMERICAL claims (qualitative claims are not verified)

The summary label is a compression. The underlying signal is shown alongside it so the reader can see the proxy and judge whether it warrants the label. This is the same construct-honesty pattern Frame Check uses across the rest of the product.

What we DO NOT measure (acknowledged missing dimensions)

The five-dimension profile is not the complete decomposition of decision-readiness. Dimensions we do not currently operationalize:

Stakes calibration. Is the depth and rigor of the analysis appropriate to the magnitude of the decision? A casual essay should not be judged by the standards of a regulatory filing, and vice versa.
Recency. Are the inputs current? An analysis based on year-old data may not support a present decision.
Conflict of interest. Who wrote it; what is their stake in the conclusion? Detection requires meta-information the document text alone does not carry.
Replicability. Would different sources or different analysts reach the same conclusion?
Reasoning structure. Beyond claim-level calibration, is the argument structure logically valid?

We name these explicitly so a reader does not infer that the five-dimension profile is the complete picture. Future versions may add dimensions; this list is the public commitment to what is currently outside the profile's scope.

Transformation diff: measuring what an LLM does to decision-readiness

The decision-readiness profile applied independently to two documents tells you what each document looks like structurally. The transformation diff tells you what an LLM (or other transformation) DID to decision-readiness when it produced one document from the other. This is a Phase 1.5 extension built on top of the existing profile; it does not introduce new measurements, only a structured comparison of profile signals across a paired source / transformed pair.

We are not aware of another tool that publishes per-dimension decision-readiness deltas across LLM transformations. The measurement is novel; the methodology is the existing five-dimension profile applied to both halves of the pair.

What the diff measures

Per dimension, the diff reports:

source_value and transformed_value — the dimension's signal_value on each side of the pair
value_delta — the numeric change (transformed minus source). Positive deltas mean the transformation increased the signal value; whether that is favorable depends on the dimension
change_text — a plain-language sentence describing the change, including qualitative shifts (e.g., "Confidence Imbalance pattern emerged in transformation", "transformation dropped risks coverage")
moved — a boolean derived from the combination of value_delta and qualitative criteria; the synthesis narrative lists only dimensions where moved is true

The diff also carries an overall narrative naming WHAT CHANGED across the transformation, silent on dimensions that held steady. This is deliberate: a narrative that lists every dimension would dilute the signal a reader needs to act on.

How pairs are declared

A corpus entry's metadata.yaml can carry two optional fields naming its pair partner:

paired_with: the slug of the partner entry
transformation_kind: what the transformation was (e.g., source_document, llm_summary, llm_paraphrase, llm_translation)

The pair-diff harness (validation/decision_readiness/compute_pair_diffs.py) walks the corpus, identifies paired entries, and writes diff_with_{partner_slug}.json into both halves of each pair. Re-running the harness after a profile regeneration picks up the updated signals.

Limits the diff inherits from the profile

The diff is only as good as the profiles it compares. Every limitation named in the per-dimension sections above propagates to the diff: the boolean treatment of counterfactual loses information across transformations the same way it does in the single-document profile; the proxy chain of "construct -> signal -> proxy" applies to deltas too. A diff is a comparison of two proxies, not a measurement of "what the transformation actually did to decision quality." Construct-honesty is preserved by showing the per-dimension change_text alongside the synthesized narrative; readers can question the narrative by reading the specifics.

Peer comparison: cross-LLM decision-readiness on the same question

The transformation diff above is directional (source -> derived). Peer comparison is non-directional: two independent peer responses to the same prompt are compared on each decision-readiness dimension. Neither peer is privileged as ground truth; the comparison surfaces relative structural differences.

The corpus already holds groups of peer responses (the four LLMs answering the bitcoin retirement question, the four LLMs answering the startup offer question). The peer-comparison harness (validation/decision_readiness/compute_peer_comparisons.py) discovers these via the optional peer_group field in metadata.yaml and computes pairwise comparisons within each group. With four members per group, six pairwise comparisons emerge per group automatically.

What peer comparison answers

Which peer addresses more analytical perspectives on this specific question?
Do the peers' calibration patterns differ structurally (hedge-ratio gap, presence-or-absence of Confidence Imbalance)?
Does only one peer engage with counterfactual thinking?
Are the differences concentrated on one dimension or distributed across many?

The narrative names dimensions where the two peers measurably differ; silent on dimensions where their profiles agree. As with the transformation diff, the no-verdict discipline holds: "differs on coverage" is reported, not "peer A is better than peer B."

Why the diff and the peer comparison are separate modules

The math is shared (per-dimension comparison of two profiles) but the narrative templates differ. A transformation diff says "the transformation dropped risks coverage"; a peer comparison says "only Claude addresses risks; Grok does not." Overloading one module with both narrative styles would obscure the distinct intents. Two modules, decision_readiness_diff and decision_readiness_peer, clarify the semantics each is committed to.

Threshold asymmetry between diff and peer

The two modules apply different "meaningful change" thresholds on signal_value comparisons:

Transformation diff surfaces ANY measurable movement (delta > 0). Source-to-derived is directional; even small movements name what the transformation did. A summary that hedge-shifts by 3 percentage points has done something we want to see.
Peer comparison applies meaningfulness thresholds (currently 5 percentage points for sentence- attribution and hedge ratio) to filter sampling noise. Two independent peers may differ by 1-2 percentage points just from prompt-sensitivity or seed variance; calling that "differs" would inflate the divergence count without adding signal.

The asymmetry is methodological: directional measurements benefit from sensitivity (any change is informative); non- directional comparisons need filtering (sub-threshold deltas are noise, not signal). Thresholds are explicit choices and revisable; the v1 thresholds reflect convention rather than empirical calibration and should be re-examined when the corpus has enough peer pairs to estimate the noise floor empirically.

The lead use case: AI-response audit

The decision-readiness profile is most differentially valuable applied to AI responses. A user asks an LLM a high-stakes question ("should I retire on this asset," "is this medication interaction serious," "should I take this job offer"). The LLM returns a prescriptive answer with structural confidence and apparent analytical breadth. The user has no built-in way to ask: is this response decision-ready, or just confident-sounding?

Frame Check via the MCP server applies the decision-readiness profile to the LLM's response itself. The user sees a structural read of the response: how many perspectives it addressed, whether claims are hedged, whether sourced, whether internally robust, whether counterfactuals are present. The user makes the decision; the profile provides the structural prerequisite.

We are not aware of an existing tool that frames AI-response audit as decision-readiness measurement at the moment of the conversation. If you know of one, we would like to hear about it; the strategic positioning depends on the territory being genuinely empty, not on our survey being complete.

Validation methodology

The profile's claims about decision-readiness are not credible without validation against independent ground truth. The validation methodology has three phases.

Phase 1 (done): methodology published

This page. Documents the dimensions, the proxy chain, and the acknowledged gaps. Invites researcher and methodologist feedback before the profile becomes a live signal.

Phase 1.5 (done): backend computation, JSON + MCP exposure

The five-dimension profile is computed server-side from existing display measurements and exposed in two places:

The /api/download-json endpoint (display.decision_readiness) for power users downloading a result.
The MCP server response payload (analysis.decision_readiness), so an agent invoking frame_check on its own last response receives the profile in line with the rest of the structural analysis. This wires the AI-response audit lead use case end-to-end; without it, the strategic positioning would be documented but not implemented for agents.

The result-page web UI does NOT currently surface the profile. UI visibility is gated on Phase 2 reaching the correlation thresholds documented below.

Phase 2 (in progress): expert validation against rated corpus

Validation harness scaffolded at validation/decision_readiness/ in the repository:

Rater guide with per-dimension operational definitions and 1-5 anchor descriptions (rater_guide.md)
Rating template (rating_template.yaml) that raters copy into ratings/{doc_id}/{rater_id}.yaml
Correlation harness (run_validation.py) that loads ratings + Frame Check profiles and computes per-dimension Spearman + per-rater ICC + per-genre breakdown into results/{date}/correlations.json
Corpus directory for rated documents (corpus/{doc_id}/ with document text, metadata, and the Frame Check profile)

The validation work itself follows: curating the corpus (target 20-30 documents spanning genres), recruiting raters (3+ per genre minimum), and running the harness once the ratings land.

Browse the documents Phase 2 validation runs against: validation corpus. Each entry has its computed profile, diff/peer artifacts, and (for some pairs) an annotated reading walking the artifact field by field. Surfacing the corpus on the web channel makes aggregate findings verifiable against the actual documents being analyzed.

For pedagogical readings of actual diff and peer artifacts in the corpus, plus the rating-quality contrast prospective raters use to calibrate: decision-readiness examples. Three sub-collections (transformation-pair diffs, peer comparisons, rating contrast) for three audiences.

Phase 2: detailed methodology

Curate a rating corpus of 20-30 documents spanning genres (financial analysis, policy briefs, journalism, AI responses to life questions). The N is chosen for tractability of expert recruitment at the v1 stage; it is sufficient for per-dimension correlation estimation but not for tight per-genre confidence intervals. Phase 2 expands to 60-100 documents once the v1 methodology survives initial review.

Sampling honesty: the seeded corpus (validation/decision_readiness/corpus/ in the repository) is convenience-sampled from documents already captured for other Frame Check worked examples. This means the seeded set is biased toward documents that fire interesting structural patterns; a corpus chosen to validate the methodology should ideally include randomly-sampled documents to control for this selection effect. The v2 corpus expansion explicitly adds a randomly-sampled component (e.g., randomly-selected SEC filings within a date range, randomly-selected Wikipedia entries from a category list) so the v1-versus-v2 correlation difference itself becomes a measurable signal of selection bias in the v1 results.

At least three expert raters per document score each dimension on a 1-5 scale, blind to Frame Check's profile. Three raters is the minimum for inter-rater agreement to be measurable (with two, disagreement is just disagreement; with three, you can identify a majority vs an outlier). More raters per document improves the estimate; three is the floor.

Computed and published per dimension:

Spearman rank correlation between expert mean rating and Frame Check's signal. Spearman (not Pearson) because the dimensional signals are not assumed linear in expert judgment, and rank correlation is more robust to outlier documents.
Inter-rater reliability as intra-class correlation (ICC) across raters. ICC (not Cohen's kappa, which applies to categorical agreement) because ratings are 1-5 ordinal/interval.
Per-genre correlation breakdown because decision-readiness is genre-specific. A profile that correlates well in financial analysis but poorly in policy briefs is a profile that ships with a genre caveat, not a profile that ships without one.
Divergence cases: documents where the profile and expert consensus diverge sharply. These are the most informative for methodology revision.

The profile becomes a live signal in the product surface only when Phase 2 reaches per-dimension Spearman ≥ 0.6 averaged across genres, with no individual genre below 0.4. The 0.6 threshold matches conventional psychometric guidance for "moderate-to-strong" ordinal correlation; the 0.4 floor prevents shipping a profile that is highly predictive in one genre and noise in another. Both thresholds are explicit choices, not field defaults; they are revisable in light of Phase 2 results, with any revision documented and dated.

Phase 3 (longer-term): outcome tracking

For analyses where decisions and outcomes can be observed over time, track whether profile readings predict outcome quality. This is genuinely hard, not theoretically hard:

Outcome attribution: a decision's outcome depends on the decision AND on factors outside the decision-maker's control. Disentangling reasoning quality from luck requires either many samples (rare for high-stakes decisions) or controlled matched-pair designs (rarer still).
Selection effects: which documents we choose to track shapes which outcomes we measure. Tracking only documents that produced shareable outcomes biases the sample toward high-attention decisions.
Base rates: high-decision-readiness analyses might still produce bad outcomes at any base rate driven by the underlying domain. Without comparison against random or low-readiness baselines, the readiness signal could correlate with outcome without being causally informative.
Long horizons: many decisions take years to resolve. Phase 3 starts with short-horizon predictions (quarterly earnings calls, policy forecasts within 12 months) because the feedback loop closes within the methodology's publication cycle.

Phase 3 is deliberately scoped narrow: financial analyses with realized P&L within a 12-month horizon, and explicit predictions with verifiable outcomes within a similar window. Negative results from Phase 3 inform Phase 2 weighting; the profile is not gated on Phase 3 because the methodology cannot wait years for outcome data to ship a structurally-grounded signal.

Anti-Goodhart commitment

Once any signal becomes visible, people optimize FOR the signal. A document optimized to satisfy the decision-readiness profile is not necessarily a more decision-ready document; it may simply be a document better fitted to the profile's measurements. This is Goodhart's Law applied to our own metric.

We commit to: (a) never displaying a single composite score that could be optimized as a target; (b) maintaining the dimensional breakdown so optimization for one dimension is visible against others; (c) periodic methodology review when patterns suggest profile-targeted writing has emerged in the wild.

Foundational references

The dimensional decomposition is grounded in well-established decision-theory and judgment literature. Each dimension corresponds to a documented failure mode rather than an invented construct.

Calibration and overconfidence: Lichtenstein, Fischhoff, & Phillips (1982). Calibration of probabilities: The state of the art to 1980. In Kahneman, Slovic, & Tversky (Eds.), Judgment under uncertainty: Heuristics and biases. Cambridge University Press. The canonical reference for the calibration-failure literature; underwrites the Claim Calibration dimension.
Overconfidence and decision pathology: Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. Synthesizes the calibration literature for general audiences; underwrites the framing of overconfidence as a decision pathology.
Confirmation bias: Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2(2), 175-220. Foundational survey of confirmation-bias research; underwrites the Counterfactual Thinking dimension.
Coverage of relevant considerations: Janis, I. L., & Mann, L. (1977). Decision making: A psychological analysis of conflict, choice, and commitment. Free Press. Names "vigilant decision making" as the deliberate canvassing of alternatives, considerations, and consequences; underwrites the Coverage dimension.
Source verification and evidence quality: the rationalist epistemology underlying scientific peer review. The closest single referent is the broader literature on evidence hierarchies in evidence-based medicine (e.g., Sackett, Strauss, Richardson, Rosenberg, & Haynes, 2000, Evidence-based medicine: How to practice and teach EBM), which underwrites the Evidence Backing dimension.
Robustness and falsifiability: Popper, K. (1959). The logic of scientific discovery. Routledge. The criterion that a claim should name what would falsify it; underwrites the Robustness dimension when extended beyond formal scientific claims to argumentation.

We do not cite these references to certify the dimensions as correct; we cite them to ground the choice of dimensions in established literature so the design is reviewable rather than arbitrary. A reviewer who disagrees with a dimension can engage with the cited foundation rather than with our framing alone.

Citation

This methodology is open and citable.

Lucic, L. (2026). Decision-readiness profile: methodology, v0.1. FrameCheck Frame Library 0.2.0. https://frame.clarethium.com/corpus/decision-readiness/

The methodology version (v0.1) reflects the current status: methodology published, backend computation shipped (Phase 1.5, JSON-only exposure), validation harness scaffolded, expert ratings not yet collected. The version bumps to v1.0 once Phase 2 reaches the correlation thresholds and the profile becomes a live signal in the result-page UI.

Feedback and following the work

Methodologists, researchers, and practitioners with expertise in decision theory, risk analysis, or expert judgment elicitation are the intended pre-publication review audience. Feedback is solicited on: dimensional decomposition, proxy validity, validation design, and acknowledged gaps.

Channels for feedback:

Substantive review: open an issue at the GitHub repository with the topic prefixed [decision-readiness]. The repository's history is the canonical record of methodology evolution, so issues there compound into the next methodology version.
Direct correspondence: the research blog has the author's contact details and publishes longer-form methodology essays.
Citing this page: the page version (v0.1) is stable. Subsequent revisions bump the version and preserve the v0.1 URL via redirect; existing citations do not break.

There is no mailing list yet. If you want to be notified when the profile becomes a live signal in the product, the research blog is the publication channel; subscribing there is the closest available signal.