# vibration_classification_nested_learning

**Repository Path**: fu1fan/vibration_classification_nested_learning

## Basic Information

- **Project Name**: vibration_classification_nested_learning
- **Description**: A Reproduction of GDM's Nested Learning Paper
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-22
- **Last Updated**: 2026-03-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Nested Learning Reproduction

![CI](https://github.com/kmccleary3301/nested_learning/actions/workflows/ci.yml/badge.svg)
![Security](https://github.com/kmccleary3301/nested_learning/actions/workflows/security.yml/badge.svg)
![Python](https://img.shields.io/badge/python-3.10%20to%203.12-blue)
![PyTorch](https://img.shields.io/badge/pytorch-2.9.0-red)
![License](https://img.shields.io/badge/license-Apache--2.0-green)
![Status](https://img.shields.io/badge/tests-smoke--ready-lightgrey)

Mechanism-level reproduction of Google's Nested Learning (HOPE) architecture (HOPE blocks, CMS, and Self‑Modifying TITANs), matching the quality bar set by lucidrains' TITAN reference while remaining fully open-source and `uv` managed.

Faithfulness scope (high level):
- ✅ HOPE / CMS / Self‑Modifying Titans update rules + wiring (mechanism-level)
- ✅ Tensor-level invariants covered by unit tests (teach-signal, δℓ, CMS chunking, causality)
- ✅ Boundary-target online chunking + optional attention-cache carry path are implemented
- ⚠️ Stable default uses stop-grad online writes; an experimental single-process boundary-state mode supports differentiable write paths
- ⚠️ Multi‑GPU mechanism-auditing online updates are not supported in this repo (DDP disables some features)

Paper reference pin:
- Source: `google_papers/Nested_Learning_Full_Paper/Nested_Learning_Full_Paper.md`
- SHA-256: `7524af0724ac8e3bad9163bf0e79c85b490a26bc30b92d96b0bdf17a27f9febc`

## Quickstart
```bash
uv python install 3.12
uv sync --all-extras
uv run nl doctor --json > logs/runtime_doctor.json
uv run bash scripts/data/run_sample.sh
uv run nl smoke --config-name pilot_smoke --device cpu
uv run bash scripts/run_smoke.sh pilot  # CPU-friendly HOPE block smoke test
uv run bash scripts/run_e2e_smoke.sh    # sync + sample data + smoke train + zeroshot eval
uv run bash scripts/run_mechanism_audit_smoke.sh
uv run python scripts/eval/zeroshot.py \
  --config configs/hope/pilot.yaml \
  --checkpoint artifacts/examples/pilot_dummy.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
  --tasks piqa --max-samples 32 --device cpu
```

## Requirements
- Python 3.10-3.12
- PyTorch 2.9.x+ (golden environment in this repo uses 2.9.x)
- `uv` (recommended for development) or `pip` for package-style usage

## Compatibility
- Support tiers and OS/runtime matrix: `docs/COMPATIBILITY_MATRIX.md`
- Versioning/stability policy: `docs/VERSIONING_POLICY.md`
- Golden repro environment: Python 3.12 + `uv lock` + PyTorch 2.9.x

macOS / Apple Silicon expectations:
- Mac users can run install + CLI + eval/smoke workflows.
- `train.device=mps` is supported for small/local runs.
- Linux + CUDA remains the only Tier 1 full-training path in this repo.
- Cross-backend numerical parity (CUDA vs MPS) is not guaranteed.
- If MPS is unavailable, device selection falls back to CPU (`nl doctor --json` shows this clearly).

## Installation (pip-first)
1. Create and activate a virtual environment.
2. Install Torch first (CPU/CUDA wheel selection is backend-specific).
3. Install this project.

CPU example:
```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install "torch>=2.9,<3" --index-url https://download.pytorch.org/whl/cpu
python -m pip install -e .
```

CUDA example (adjust index URL to your CUDA runtime):
```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install "torch>=2.9,<3" --index-url https://download.pytorch.org/whl/cu128
python -m pip install -e .
```

## Setup (uv dev workflow)
```bash
uv python install 3.12
uv sync --all-extras
```

Developer checks:
- `uv run ruff check .`
- `uv run mypy src`
- `uv run pytest`
- `uv run bash scripts/checks/run_fidelity_ci_subset.sh`
- `uv run python scripts/checks/compliance_report.py --config configs/pilot.yaml --output eval/compliance_report.json`

## CLI
The package ships with `nl` for portable workflows across local/dev/prod environments.

```bash
# runtime compatibility snapshot
uv run nl doctor --json

# architecture/config smoke on chosen device
uv run nl smoke --config-name pilot_smoke --device cpu --batch-size 1 --seq-len 8

# static fidelity checks for a config
uv run nl audit --config-name pilot_paper_faithful

# train with Hydra overrides
uv run nl train --config-name pilot --override train.device=cuda:1 --override train.steps=100
```

`python -m nested_learning ...` is also supported.

## First 30 Minutes
Use this path for a fast first success on CPU:

```bash
uv sync --all-extras
uv run bash scripts/data/run_sample.sh
uv run bash scripts/run_smoke.sh pilot
uv run bash scripts/run_mechanism_audit_smoke.sh
```

This confirms:
- data/tokenizer pipeline is operational,
- model/training loop runs end-to-end,
- cadence checks pass for a mechanism-auditing smoke run.

## Data Pipeline
1. **Tokenizer training**
   ```bash
   uv run python scripts/data/train_tokenizer.py \
     --manifest configs/data/refinedweb_mixture.yaml \
     --vocab-size 32000 \
     --output-dir artifacts/tokenizer/refinedweb_mix \
     --log-file data/mixtures/refinedweb_mix_tokenizer.json
   ```
2. **Corpus filtering + sharding**
   ```bash
   uv run python scripts/data/process_mixture.py \
     configs/data/refinedweb_mixture_filtered.yaml \
     --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
     --log-file data/mixtures/refinedweb_mix_filtered_shards.json
   ```
3. **Sample pipeline** (downloads/licensed datasets, filters, shards, records stats)
   ```bash
   uv run bash scripts/data/run_sample.sh
   ```
4. **Full pipeline** (set env vars like `RW_LIMIT`, `WIKI_LIMIT`, etc. to scale ingestion)
  ```bash
  uv run bash scripts/data/run_full.sh  # default ~50k docs per corpus; increase limits as needed
  ```

### Data Troubleshooting
- If `scripts/data/run_sample.sh` cannot find `artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model`, rerun:
  ```bash
  uv run bash scripts/data/run_sample.sh
  ```
  The script auto-trains the tokenizer when missing.
- If `scripts/data/run_full.sh` fails with `Bad split: train. Available splits: ['test']`, use split fallback:
  ```bash
  FALLBACK_SPLIT=test uv run bash scripts/data/run_full.sh
  ```
  You can also override per-corpus splits (for example `RW_SPLIT=test`).

## Training
- Single GPU / CPU:
  ```bash
  uv run nl train --config-name pilot_smoke
  ```
- Apple Silicon (MPS, if available):
  ```bash
  uv run nl train --config-name pilot_smoke --override train.device=mps
  ```
  Use this path for smoke and small local runs; long/full-scale paper-regime runs are not a supported Mac target in this repository.
- Script-based entrypoint (legacy-compatible):
  ```bash
  uv run python train.py --config-name pilot_smoke
  ```
- DDP (torchrun):
  ```bash
  torchrun --nproc_per_node=2 train_dist.py --config-name mid
  ```
- CPU-only DDP smoke (verifies `gloo` backend and deterministic seeding):
  ```bash
  uv run bash scripts/run_cpu_ddp_smoke.sh
  ```
- FSDP (see `docs/FSDP_SCALING_GUIDE.md` for VRAM/batch sizing):
  ```bash
  # 760M run
  torchrun --nproc_per_node=2 train_fsdp.py --config-name hope/mid_fsdp
  # 1.3B run
  torchrun --nproc_per_node=2 train_fsdp.py --config-name hope/target_fsdp
  ```
- DeepSpeed (requires `deepspeed` installed separately):
  ```bash
  deepspeed --num_gpus=2 train_deepspeed.py --config-name target \
    deepspeed.config=configs/deepspeed/zero3.json
  ```

### Mechanism-auditing presets (HOPE / Nested Learning)

Use the mechanism-auditing preset configs (single GPU):

```bash
uv run python train.py --config-name pilot_paper_faithful
# HOPE self-mod variant:
uv run python train.py --config-name pilot_selfmod_paper_faithful
```

Notes:
- These presets set `data.batch_size=1` to avoid cross-sample fast-memory sharing.
- Online chunking supports one-token overlap **or** explicit boundary-target mode (`train.online_boundary_targets=true`).
- Optional attention-state carry across chunks is available in training via `train.online_carry_attention_cache=true`.
- The exact sequence/segment/chunk/buffer semantics are documented in `docs/STREAMING_CONTRACT.md`.

Overrides:
- `optim.type=m3` (paper optimizer option)
- `train.steps=...` / `train.device=...`

See `docs/PAPER_COMPLIANCE.md` for full fidelity notes.
See `docs/STREAMING_CONTRACT.md` for the precise streaming/update contract used by this repo.

## Scope Boundaries (Current)
- This repo targets mechanism-auditing fidelity, not full paper-scale results parity.
- Boundary-state gradient-through-write exists as an experimental constrained path; it is not yet treated as production/full-scale paper reproduction.
- Distributed mechanism-auditing path for boundary-target + attention-cache carry is not implemented.

### Pilot (3 B tokens) workflow
1. Ensure TMUX session:
   ```bash
   tmux new -s pilot_train
   ```
2. Launch the long run on `cuda:1` (≈52 h wall clock):
   ```bash
   set -a && source git.env && set +a
   export UV_CACHE_DIR=/tmp/uv-cache UV_LINK_MODE=copy
   uv run python train.py --config-name pilot \
     logging.enabled=true logging.backend=wandb \
     logging.project=nested-learning logging.run_name=pilot-main-$(date +%Y%m%d%H%M%S) \
     train.device=cuda:1
   ```
3. Checkpoints appear in `artifacts/checkpoints/pilot/step_*.pt` every 1 000 steps; the accompanying W&B run captures full telemetry.
4. Copy the final checkpoint, config, logs, and eval JSON/CSV into `artifacts/pilot_release/` for distribution.

## Logging
Set `logging.enabled=true` in Hydra configs (or override via CLI) to send metrics to W&B (default). For local JSON logs, use `logging.backend=json logging.path=logs/run.json`. Sample outputs reside in `logs/` and `artifacts/examples/`.

## Evaluation
- Zero-shot:
  ```bash
  uv run python scripts/eval/zeroshot.py \
  --config configs/hope/mid.yaml \
  --checkpoint checkpoints/mid/step_000100.pt \
  --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
  --tasks all --max-samples 200 --device cuda:0
  ```
  Use `uv run python scripts/eval/zeroshot.py --list-tasks` to display the full benchmark roster (PIQA, HellaSwag, WinoGrande, ARC-E/C, BoolQ, SIQA, CommonsenseQA, OpenBookQA). See `docs/zeroshot_eval.md` for details.
- Needle-in-a-Haystack:
  ```bash
  uv run python scripts/eval/niah.py \
    --config configs/hope/mid.yaml \
    --checkpoint checkpoints/mid/step_000100.pt \
    --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model \
    --context-lengths 2048 4096 8192 --samples-per-length 20
  ```
- Continual-learning forgetting:
  ```bash
  uv run python scripts/eval/continual.py \
    --config configs/hope/mid.yaml \
    --checkpoints checkpoints/mid/step_000050.pt checkpoints/mid/step_000100.pt \
    --segments-yaml configs/data/continual_segments_sample.yaml \
    --batch-size 4 --max-batches 10 --memorize --memorize-steps 2
  ```
  Plot forgetting curves via `uv run python scripts/eval/plot_forgetting.py --continual-json eval/continual_mid.json`.
- Long-context diagnostics:
  ```bash
  uv run python scripts/eval/passkey.py --config configs/hope/pilot.yaml --checkpoint artifacts/checkpoints/pilot/step_230000.pt \
    --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model --samples 64 --memorize

  uv run python scripts/eval/pg19_perplexity.py --config configs/hope/pilot.yaml --checkpoint artifacts/checkpoints/pilot/step_230000.pt \
    --tokenizer-path artifacts/tokenizer/refinedweb_mix/spm_32000_unigram.model --max-samples 64
  ```

Evaluation summaries are written to `eval/` alongside per-task JSON metrics.

### Test-time memorization toggles
Every evaluator supports TITAN-style memorization so you can reproduce test-time adaptation:
```bash
uv run python scripts/eval/zeroshot.py \
  ... \
  --memorize \
  --memorize-steps 2 \
  --memorize-use-correct-answer \
  --memorize-no-reset  # optional: retain updates across samples
  --memorize-paths titan,cms_fast \
  --memorize-surprise-threshold 0.01
```
- `--memorize` turns on the learner with one LMS step per example by default.
- `--memorize-steps` controls the number of adaptation passes per prompt.
- `--memorize-use-correct-answer` injects ground-truth text during memorization for ablations.
- `--memorize-no-reset` carries memories across samples; omit it to reset every question.
- `--memorize-paths` restricts which levels receive teach-signal updates (`titan`, `cms_fast`, or `all`).
- `--memorize-surprise-threshold` gates updates on average teach-signal norm, matching the paper’s surprise trigger.

Memorization metrics (baseline vs adaptive) are emitted alongside task accuracy for easy comparisons.

## Architecture variants
Select the paper-defined variant via `model.block_variant` in Hydra configs:
- `hope_attention` (paper HOPE-Attention): `Attention → CMS` (paper-defined).
- `hope_selfmod` (paper HOPE scaffold): `Self-modifying Titans (Eqs. 83–93; Eq. 91 residual MLP memories) → CMS` with (by default) **fixed q** and **local conv window=4**, plus chunked updates via `model.self_mod_chunk_size` (others) and `model.self_mod_chunk_size_memory` (M_memory). See `docs/PAPER_COMPLIANCE.md` for the “differentiable read / update-pass writes” semantics.
- `hope_hybrid` (legacy): `Attention + TitanMemory + CMS` (exploratory; not paper-defined).
- `transformer` (baseline): `Attention → MLP` (no TITAN/CMS learning updates; useful for Phase 2 comparisons).

Self-modifying Titans knobs (ablation-friendly, paper-aligned):
- `model.self_mod_objective` (`l2` vs `dot`), `model.self_mod_use_rank1_precond` (DGD-like preconditioner), `model.self_mod_use_alpha` (weight-decay/retention gate), `model.self_mod_stopgrad_vhat`, `model.self_mod_momentum`, `model.self_mod_adaptive_q`, `model.self_mod_local_conv_window`.

## Fast state (Nested Learning semantics)
In-context updates can run against a per-context fast state so meta parameters never change:
- `HOPEModel.init_fast_state()` / `TitanOnlyModel.init_fast_state()` returns a `ModelFastState`.
- `MemorizeConfig.use_fast_state=true` (default) requires passing `fast_state` into `memorize_tokens()` / `memorize_sequence()`; evaluation scripts handle this automatically.
- Training can also run update passes against a per-batch fast state via `train.use_fast_state=true` (meta+delta fast state: meta params are learnable; online updates write deltas only). If `data.batch_size>1`, CMS/TITAN fast state is shared across the batch; use `data.batch_size=1` for strict per-context semantics. See `docs/PAPER_COMPLIANCE.md`.

## Releases
Before tagging or announcing a new checkpoint, work through:
- `docs/release_checklist.md` (model/eval artifact release bundle)
- `docs/PACKAGE_RELEASE_CHECKLIST.md` (package/GitHub/PyPI release flow)
- `docs/PYPI_TRUSTED_PUBLISHING.md` (one-time OIDC setup for TestPyPI/PyPI)

Tag pushes (`v*`) automatically publish:
- PyPI/TestPyPI package artifacts (via Trusted Publishing), and
- a GitHub Release entry with wheel, sdist, and `SHA256SUMS.txt` in the Releases tab.
- a GitHub Packages (GHCR) OCI bundle (`nested-learning-dist`) containing `dist/*`.

GitHub Packages note:
- The repo publishes an OCI artifact bundle to GHCR (shown under the Packages tab), not a Python package registry endpoint.
- Python installs should still use PyPI (`pip install nested-learning`).

Example (pull/extract dist artifacts from GHCR):
```bash
docker pull ghcr.io/kmccleary3301/nested-learning-dist:latest
cid=$(docker create ghcr.io/kmccleary3301/nested-learning-dist:latest)
docker cp "$cid:/dist" ./dist_from_ghcr
docker rm "$cid"
```

For versioning semantics and breaking-change expectations, see `docs/VERSIONING_POLICY.md`.

For reproducibility bug reports, use `docs/BUG_REPORT_CHECKLIST.md`.

## Performance & optimizer options
- **Mixed precision:** enable bf16 autocast via `train.mixed_precision.enabled=true train.mixed_precision.dtype=bf16` (already enabled in pilot/mid/target configs).
- **`torch.compile`:** accelerate attention/core loops by toggling `train.compile.enable=true train.compile.mode=max-autotune`; failure falls back to eager unless `train.compile.strict=true`.
- **Muon hybrid (default):** all HOPE configs now set `optim.type=muon`, routing ≥2D tensors through PyTorch 2.9's Muon optimizer while embeddings/norms stay on AdamW. Training logs emit `optim.muon_param_elems` / `optim.adamw_param_elems` so you can confirm the split.
- **Fused AdamW fallback:** override with `optim.type=adamw optim.fused=auto` if Muon is unavailable or if you want to compare against the AdamW ablation in `reports/ablations.md`.
- **Surprise gating:** set `model.surprise_threshold=<float>` to gate all inner updates. By default the surprise metric is the average L2 norm of the (scaled/clipped) teach signal (`model.surprise_metric=l2`); you can also use `loss` or `logit_entropy` for ablations. Evaluation CLIs expose `--memorize-surprise-threshold` for ad-hoc gating.

All Hydra knobs can be overridden from the CLI or composed via config groups (`configs/hope/*.yaml`). Use these flags in tandem with `scripts/run_e2e_smoke.sh` (automation) or `scripts/run_cpu_ddp_smoke.sh` (CPU-only determinism check) to validate releases quickly.

## Documentation & References
- `docs/IMPLEMENTATION_STATUS.md` – current mechanism-level status matrix.
- `docs/PAPER_COMPLIANCE.md` – equation-to-code fidelity notes and explicit boundaries.
- `docs/STREAMING_CONTRACT.md` – exact sequence/segment/chunk/update semantics.
- `docs/release_checklist.md` – release readiness checklist.
- `docs/data_pipeline.md` – large-scale sharding/tokenizer workflow.
- `docs/scaling_guidance.md` – roadmap for expanding data + compute footprints.
- `docs/stage2_plan.md` – Stage 2 architecture + experiment roadmap.
- `docs/PHASE_2_PLAN.md` – detailed Phase 2 execution plan.
- `docs/stage2_progress.md` – progress tracker for the latest faithfulness remediation sprint.
- `docs/experiments_report.md` – draft paper covering completed experiments.
- `docs/future_directions.md` – prioritized roadmap after the initial release.
- `reports/stage2_smoke.md` – exact commands/artifacts for the release-ready smoke workflow.
- `docs/FSDP_SCALING_GUIDE.md` – dual-RTX 6000 Ada instructions for the mid/target FSDP configs.
- `google_papers/` – PDFs/markdown of Nested Learning & TITAN papers.
- `CHANGELOG.md` – user-facing changes per release.

## Contributing
1. Run formatting/tests (`uv run ruff check .`, `uv run pytest`).
2. Document new configs or scripts in the relevant docs under `docs/` and update `CHANGELOG.md`.
3. Open a PR referencing the relevant NL/TITAN spec sections and tests.