Claude Research Kit
Field

Field: AI / ML

ML/NLP venue cues (NeurIPS/ICML/ACL/EMNLP), reproducibility, baselines & ablations, eval contamination.

A field overlay: it supplements, not replaces, the general agent_docs. Read it before discipline-specific writing. It encodes the conventions and reviewer expectations of machine-learning venues so the agent does not write chemistry-paper prose into a NeurIPS submission.

Activate it in CLAUDE.project.md → Field overlay.


Venues & where content goes

Venue familyNotes
ML: NeurIPS, ICML, ICLR, AAAI, COLM~8–10 pages + unlimited appendix; double-blind; rebuttal phase; reproducibility checklist required
NLP: ACL, EMNLP, NAACL, COLINGACL Rolling Review; Limitations section is mandatory and unnumbered; responsible-NLP checklist
Preprint: arXivNorm to post early; cite the published version once it exists
  • Main paper vs appendix: core claims, primary results, and the ablation that supports the thesis go in the main paper. Hyperparameter grids, extra seeds, prompts, and proofs go to the appendix.
  • Reproducibility statement / checklist and (for NLP) a Limitations section are not optional — reviewers check for them.

Structure (often not strict IMRaD)

A typical layout: Introduction → Related Work → Method → Experiments (setup, results, ablations, analysis) → (Limitations) → Conclusion. Differences from IMRaD to honor:

  • "Method" replaces "Materials and Methods"; describe the model/algorithm precisely enough to reimplement.
  • "Experiments" carries setup + results + analysis together; separate what you measured from what it means.
  • Contributions are usually stated as an explicit bulleted list at the end of the Introduction.

Reproducibility (the bar is high here)

State enough to reproduce — this is the field's defining review criterion:

ReportExample
Random seeds"results averaged over 5 seeds; seeds listed in App. C"
Hyperparametersfull grid + selected values; selection criterion (val set)
Computehardware, GPU-hours, wall-clock — for fair-comparison and carbon reporting
Dataversion, splits, preprocessing, license; how train/val/test were separated
Modelexact checkpoint / API model and date (API models drift — pin the version)
Decoding (for LLMs)temperature, top-p, max tokens, stop conditions, prompt templates (verbatim, in App.)
Code/data releaserepository or "will be released"; tie to agent_docs/reproducibility.md

data/raw/ is immutable (enforced by protect-sources.sh) — derive splits into new files, never edit raw in place.

Baselines & ablations (reviewers attack these first)

  • Fair comparison: matched compute / parameter / data budgets. A win under unequal budgets is not a win — state the budgets.
  • Strong baselines: include the obvious strong method, not just weak ones. A missing obvious baseline is the most common desk-reject reason.
  • Ablations: remove each component of your method to show it carries weight. "We add A, B, C and it's better" is not a result; "A alone gives X, +B gives Y, +C gives Z" is.
  • Variance: report over multiple seeds/runs with error bars or CIs — never a single run. See agent_docs/statistics.md.

Evaluation hygiene

  • Contamination / leakage: check that test data (or its sources) did not appear in pretraining or in your tuning. State how you checked. For LLMs, benchmark contamination is a first-order reviewer concern.
  • Held-out discipline: tune on val, report on test once. No test-set peeking, no selecting the seed that looks best.
  • Human eval (when used): protocol, number of annotators, inter-annotator agreement (e.g. Cohen's κ), and the rubric belong in the paper.
  • Benchmark saturation: if a benchmark is near-ceiling, a small gain may be noise — argue significance.

LLM-agent specifics

  • Vocabulary (lock per MANUSCRIPT_MAP.md → Terminology): agent, tool, tool call, trajectory / rollout, episode, step, task horizon, policy, scaffold. Pick one term per concept — e.g. "tool-call accuracy", not alternating "success rate" / "correctness".
  • Environment versioning: agent benchmarks change; pin the environment/harness version and commit hash.
  • Non-determinism: API models and sampling make runs non-reproducible — report multiple rollouts, temperature, and the model snapshot date.
  • Cost & latency: report tokens / API cost / wall-clock — agentic methods that "win" by spending 10× compute must say so.
  • Partial credit: distinguish full task success from partial / sub-goal completion; define the metric precisely.

Citations & notation

  • Style: numbered (ACL/IEEE) or author–year (natbib) per venue. Cite the published version when one exists, not just the arXiv id.
  • Notation: define every symbol on first use; brace-protect acronyms in BibTeX titles so they survive casing — {LLM}, {API}, {RAG}.
  • Common knowledge for this audience needs no citation (e.g. "LLMs can produce ungrounded outputs"); calibrate to the venue's readership.

Typical reviewer concerns (pre-empt them)

ConcernWhat it looks likePre-empt by
Missing baseline"why not compare to X?"include the obvious strong baseline
Cherry-pickingonly nice qualitative examplesrandom samples + failure cases
No significancesingle-run SOTAmultiple seeds + CIs / tests
Unfair comparisonmore compute/data than baselinesmatched budgets, stated
Contaminationtest seen in pretraininga stated contamination check
IrreproducibleAPI model, no versionpin model + date + decoding params
Overclaim"general", "SOTA", "solves"calibrate to the tested settings (agent_docs/academic-style.md)
No limitationsabsent / perfunctorya genuine Limitations section

MANUSCRIPT_MAP.md additions for ML papers

Add to your map's Claims that need extra care:

  • Generalization beyond the tested benchmarks/harness is out of scope unless shown.
  • A correlational ablation does not license a causal claim ("more tools causes higher success").
  • "SOTA" / "first" / "best" require the comparison that establishes them, under matched budgets.