Stats Check
Run the agent_docs/statistics.md checklist over a Results section — flag bare p-values, missing N/effect-size/CI, causal overclaim on observational data, multiple-comparison issues, and numbers that.
Core Rule
Every statistical claim must report effect size + uncertainty (a CI), N, the named test, and whether its assumptions hold. "Significant" means statistically significant — never "large" or "important". Association is not causation: correlational/observational designs license "associated with / predicts", never "causes / drives / improves". And no p-hacking / HARKing / selective reporting — the analysis you report is the analysis you pre-specified, with exploratory results labeled exploratory. A bare p-value, a missing N, a causal verb on an ablation, or a number that disagrees between the abstract and Table 2 are the failures this skill exists to catch.
This is the deterministic, numbers-and-design half of verification, operationalizing
agent_docs/statistics.md. It flags and proposes fixes — it does not change a reported
quantity (a Protected Claim per CLAUDE.md); recomputing a number to reconcile text and
table is fine, altering what was measured or which test was run is not.
When to Use
Invoke with /stats-check when:
- A Results section with statistics is "done" and you want it Reviewer-2-proof.
- A revision added or changed a number, test, or comparison.
- Before submission, on Results + the abstract + every table — the consistency check needs all three open at once.
- The prompt-router flagged the task
[Statistics]or[Causation].
Scope it: /stats-check sections/results.tex. Run the whole Results section with its tables
and the abstract in view — half the findings are cross-location number mismatches.
Process
Phase 1: Load the Checklist and the Numbers
- Read
agent_docs/statistics.md— the reporting checklist this skill runs. It is the source of truth; this skill is its executor. - Read the target Results
.tex, plus every table/figure it summarizes and the abstract. Numbers must agree across all three (statistics.md → text ↔ tables ↔ abstract); you cannot check consistency with only one open. - Read
MANUSCRIPT_MAP.md— the Data & reproducibility line (was the design an RCT, an ablation, an observational comparison? — this decides what causal language is licensed) and Claims that need extra care. - For ML/agents work, read
agent_docs/field/ai-ml.md— variance over seeds/rollouts, matched-budget comparison, and partial-vs-full success are field-specific statistics expectations.
If a number's test or assumptions are not stated and you cannot tell which was run, that is a finding — do not infer the test. Which test was run is a Protected Claim; ask the author.
Phase 2: Find Every Statistical Claim
Walk the section and extract each quantitative comparison or inferential statistic: every p-value, CI, effect size, mean difference, correlation, regression coefficient, percentage, ratio, and "significant"/"more"/"better" comparison. Record for each: the point estimate, the uncertainty (CI/SE/SD, or none), the N, the named test (or none), the verb (causes / is associated with / predicts), and the scope (which harness, which task set, all vs subset).
Phase 3: Check Each Claim Against the Checklist
Run every claim through agent_docs/statistics.md:
- Effect size + uncertainty, not just p. A bare "p < 0.05" with no magnitude and no CI is incomplete — the reportable triplet is point estimate · interval · test (with N).
- "Significant" used correctly. Flag "significant" deployed as a synonym for large / important / meaningful. Magnitude needs a number, stated separately.
- N, test, assumptions stated. N for that comparison (not the study total); the test named; assumptions addressed (normality/variance for a t-test; linearity for Pearson r) and what was done if violated.
- Multiple comparisons. A family of tests needs a correction (Bonferroni/Holm/BH-FDR) or a justification; all tests run must be disclosed, not just the winners. A subgroup "significant" after many slices is a hypothesis, not a finding.
- Causation vs association. Check the verb against the design (Phase 1). A causal verb ("causes / drives / improves / the effect of") on a correlational comparison or an uncontrolled ablation is OVERCLAIM. RCT licenses causation within its population; a natural experiment does conditionally with the identifying assumption stated.
- p-hacking / HARKing signals. Outcomes in Methods that vanish in Results (or appear in Results unannounced); p-values clustered just under 0.05; "trending toward significance"; post-hoc results framed as a-priori hypotheses; undisclosed exclusions or optional stopping.
- Significant digits & units. False precision ("94.732%" off a ±1% estimate); the last digit should be the first uncertain one; units present and consistent; percentage points ≠ percent (a 70→88 rise is 18 pp, ~26% relative).
- Cross-location consistency. Recompute: every number matches across text, tables, figures, and abstract; N is the same everywhere; totals add up; derived numbers (a difference, ratio, percent change) recompute from the reported inputs.
Phase 4: Report Findings and Fixes
Produce the findings table (claim → issue → fix), severity-ordered. The fix says what the author must do. Do not edit reported quantities here — flag the mismatch, propose the calibration. Reconciling a text number to a table (a clerical correction) can be proposed; changing what was measured or which test was run is a Protected Claim needing sign-off.
Output Format
# Stats Check — sections/results.tex
## Summary
- Statistical claims examined: 12
- Clean: 6
- Issues: 6 (2 causal overclaim · 1 bare-p · 1 missing-N · 1 mult-comparison · 1 cross-location mismatch)
## Findings (claim → issue → fix)
| # | Locator | Claim (quoted) | Issue | Fix (author action) |
|---|---|---|---|---|
| 1 | res ¶2 | "the gate caused an 18-pp gain in tool-call accuracy" | Causal verb on an uncontrolled ablation (MANUSCRIPT_MAP: not an RCT) | Soften to "was associated with an 18-pp gain"; or justify a causal design |
| 2 | res ¶3 | "tool-call accuracy was significantly higher (p = 0.03)" | Bare p; no effect size, no CI | Add point estimate + 95% CI + named test + N: "X pp (95% CI a–b; two-sample t-test, n = …)" |
| 3 | res ¶3 | "the gate reduced hallucinated tool calls (p = 0.02)" | N not stated for this comparison | State N for the comparison |
| 4 | res ¶5 | "the gate helped on 3 of 8 task subsets" | 8 subgroup tests, no correction, winners only | Report all 8; apply BH-FDR or justify; label exploratory if post-hoc |
| 5 | res ¶2 vs Table 2 | "21% → 6%" in text; Table 2 shows "21% → 8%" | Cross-location number mismatch | Reconcile to the real value; fix everywhere at once (Protected — confirm which is correct) |
| 6 | res ¶4 | "a significant improvement" | "significant" as a synonym for large | Give the magnitude with a number; reserve "significant" for the test result |
## Causation flags (design vs verb)
- Design per MANUSCRIPT_MAP: ablation comparison (not randomized at the unit of inference).
- Licensed: "associated with", "predicts", "co-occurs with". NOT: "causes", "drives", "the effect of".
- Finding 1 violates this — highest priority.
## Cross-location consistency
- Hallucinated-call rate: text 6% vs Table 2 8% — MISMATCH (finding 5).
- N: text "512 tasks" vs Methods "512 held-out" — consistent.
- 18-pp gain recomputes from Table 2 (39% → 57%) — OK.
## Notes for the author
- Findings 1, 5 are Protected (causal claim / changed number) — need your sign-off, not a silent edit.
- Confirm which test produced each p-value; I did not infer any (that would change what was reported).End with the tally (clean / issues) and a one-line worst-risk. Never report a Results
section "clean" while a bare p-value, a causal overclaim, or a cross-location mismatch stands.
Pairs With
agent_docs/statistics.md— the checklist this skill executes; read it first, it is the authority for every rule above.integrity-revieweragent — escalate when the signals look like patterned selective-reporting/HARKing across the whole manuscript, not isolated reporting gaps; it scans breadth,/stats-checkverifies each number./claim-check— run alongside: claim-check verifies verbs against cited sources, stats-check verifies numbers against the data and the design. Together they cover claims + quantities.citation-gate.sh(PostToolUse) — orthogonal (it checks\citeresolution), but run it so the Results section is structurally clean before the numerical pass.
Common Rationalizations
| Rationalization | Reality |
|---|---|
| "p < 0.05 is the result; the effect size is obvious from the means" | A bare p hides magnitude and precision. Report the estimate + CI explicitly; do not make the reader reconstruct it. |
| "It's a significant improvement" | "significant" ≠ large. State the magnitude with a number; keep "significant" for the test. |
| "The gate improved accuracy" (from an ablation) | "improved" smuggles causation. An uncontrolled ablation licenses "was associated with". Match the verb to the design. |
| "I only report the comparisons that worked" | Reporting winners only is selective reporting. Disclose every test run; correct for the family. |
| "The abstract says 92%, the table says 89%, close enough" | A cross-location mismatch is a hard finding. The same quantity reads the same everywhere — recompute, do not eyeball. |
| "I'll just change the number to match" | Reconciling a clerical typo is fine; changing what was measured is a Protected Claim. Flag it, get sign-off. |
Notes
- This skill never invents a number, a test, or an assumption. A missing value is
[VALUE — verify]; an unknown test is a question for the author — per the cardinal rule inCLAUDE.md. - Causation-vs-association and selective-reporting judgments are Reasoner-tier
(
CLAUDE.md → Model Selection); the cross-location number-matching is deterministic — recompute, do not estimate (CLAUDE.md → Model vs Code). - A recurring statistics slip (the author keeps correcting "significant", or bare p-values) is
a rule — log under
tasks/reviews/,applies_to: [statistics], promote to## Top Rulesif it recurs (CLAUDE.md → Self-Improvement Loop).
Gap Finder
Breadth-first scan of a draft for unsupported and uncited claims and missing-evidence gaps — classify every claim, list what is UNCITED/UNSUPPORTED, and for true gaps emit search directions.
Methods Review
Reproducibility check of the Method(s) section against agent_docs/reproducibility.md — enumerate what an independent team needs to rerun the work, check the section, and flag every missing ingredient.