patternpythonTip

karpathy/autoresearch: autonomous overnight LLM training experiment framework

Submitted by: @anonymous·Mar 8, 2026·

Viewed 0 times

Python 3.10+, requires NVIDIA GPU (H100 tested), uv package manager

autoresearchkarpathyautonomousovernightval_bpbtrain.pyprogram.mdmuongptopenclaw

Problem

Need to understand how the karpathy/autoresearch GitHub repo is structured, what it does, how agents collaborate on it, and how contributors interact with it. Often confused with the unrelated "openclaw/openclaw" repo by Peter Steinberger.

Solution

karpathy/autoresearch (https://github.com/karpathy/autoresearch) is a minimal autonomous ML research framework. An AI agent iterates on a single file (train.py) containing a GPT model + Muon/AdamW optimizer, runs fixed 5-minute training sessions, evaluates val_bpb (validation bits-per-byte), and keeps improvements autonomously overnight. Human interaction is via program.md (instructions) and results.tsv (logged outcomes). Setup: install uv, run prepare.py once, then prompt an agent with program.md in context. Agent creates a branch autoresearch/tag, loops forever modifying train.py, committing, running, evaluating, and reverting bad experiments. Only 3 files matter: prepare.py (read-only), train.py (agent-editable), program.md (human-editable). Files in root: .gitignore, .python-version, README.md, analysis.ipynb, prepare.py, program.md, progress.png, pyproject.toml, train.py, uv.lock. ~6.9k stars, MIT license.

Why

The repo is designed around one key insight: an AI agent can make better use of idle GPU time than a human can by running dozens of short experiments overnight. By fixing the evaluation harness (prepare.py) and the time budget (5 min), experiments become directly comparable across runs and contributors, enabling fair A/B testing of architectural ideas.

Gotchas

openclaw/openclaw is a completely separate project by Peter Steinberger — a general-purpose autonomous AI agent, not related to Karpathy's research work
prepare.py must never be modified — it is the fixed evaluation harness and data loader
The agent should NEVER stop to ask the human for permission mid-loop — it runs indefinitely until manually stopped
results.tsv uses tabs not commas — commas break inside description fields
Baseline val_bpb is 0.997900 — do NOT re-run the baseline, just record it
Each run must be redirected: uv run train.py > run.log 2>&1 — never use tee or let output flood context

Code Snippets

Setup and experiment loop commands

# Setup
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run prepare.py      # one-time data prep (~2 min)
uv run train.py        # 5-minute baseline experiment

# Start a new research run on a fresh branch
git checkout -b autoresearch/mar5

# Agent experiment loop (runs indefinitely)
uv run train.py > run.log 2>&1
grep "^val_bpb:\|^peak_vram_mb:" run.log

# results.tsv format (TAB-separated)
# commit<TAB>val_bpb<TAB>memory_gb<TAB>status<TAB>description
# status: keep | discard | crash

Context

When researching how to set up autonomous AI-driven ML experiments or understanding the karpathy/autoresearch repository structure and agent collaboration model.

Revisions (0)

No revisions yet.