# Parameter Golf Research Tracks

Priority order is dictated by the challenge rules:

1. stay under the `16,000,000` byte artifact cap
2. stay within the `10 minute / 8xH100` training budget for record attempts
3. optimize post-roundtrip `val_bpb`, not pre-quant loss

## Integrated now

- Post-compression-aware training:
  - sampled int8 reconstruction regularizer
  - optional ternary-weight regularizer
  - optional outlier suppression penalty
- Weight sharing / recurrence:
  - shared-block transformer via `NUM_UNIQUE_BLOCKS`
- Sparse attention:
  - optional sliding-window attention via `WINDOW_SIZE`
- Factorized embeddings:
  - optional `EMBED_DIM < MODEL_DIM`
- Hybrid eval-time compute:
  - optional recent-token cache bias during validation / roundtrip eval
- Local proxy iteration:
  - capped validation
  - optional skip of expensive final roundtrip eval
  - proxy sweep launcher

## Current knobs

- `NUM_UNIQUE_BLOCKS`
- `WINDOW_SIZE`
- `EMBED_DIM`
- `COMPRESSION_REG_WEIGHT`
- `COMPRESSION_GRID_REG_WEIGHT`
- `COMPRESSION_SCALE_REG_WEIGHT`
- `COMPRESSION_RANK1_REG_WEIGHT`
- `TERNARY_REG_WEIGHT`
- `OUTLIER_REG_WEIGHT`
- `EVAL_CACHE_MIX_WEIGHT`
- `EVAL_BIGRAM_MIX_WEIGHT`
- `EVAL_CACHE_SIZE`
- `FINAL_ROUNDTRIP_EVAL`
- `ROUNDTRIP_VAL_MAX_TOKENS`

## Local proxy reference point

All local comparisons below use the same quick 3090 proxy envelope:

- `MAX_WALLCLOCK_SECONDS=180`
- `TRAIN_BATCH_TOKENS=32768`
- `VAL_MAX_TOKENS=1048576`
- `FINAL_ROUNDTRIP_EVAL=0`
- baseline architecture:
  - `NUM_LAYERS=12`
  - `NUM_UNIQUE_BLOCKS=12`
  - `MODEL_DIM=384`
  - `EMBED_DIM=0`
  - `NUM_HEADS=6`
  - `NUM_KV_HEADS=3`

## Roundtrip proxy track

Use this when ranking experiments on a more faithful local objective:

- keep the same baseline architecture unless explicitly testing architecture
- enable `FINAL_ROUNDTRIP_EVAL=1`
- keep `ROUNDTRIP_VAL_MAX_TOKENS` capped so the run stays practical on a 3090
- treat this as the local approximation to the actual challenge metric

## Latest findings

- Quick local baseline:
  - run: `baseline3090_20260318_170251`
  - result: `val_bpb=2.0916`, `val_loss=3.4910`
  - total artifact: `6,831,983` bytes
  - interpretation: current local number to beat
- Hybrid eval sidecar, recent-token + bigram continuation bias:
  - run: `sidecar3090_20260318_172524`
  - knobs: `EVAL_CACHE_MIX_WEIGHT=0.03`, `EVAL_BIGRAM_MIX_WEIGHT=0.05`, `EVAL_CACHE_SIZE=16`
  - result: `val_bpb=2.0970`, `val_loss=3.5000`
  - total artifact: `6,810,819` bytes
  - delta vs baseline: `+0.0054 bpb` worse, `21,164` bytes smaller
  - interpretation: close enough to keep around for later tuning, not good enough to become the default path
- Compression-aware baseline, reconstruction regularization `0.01`:
  - run: `compress3090_20260318_174132`
  - result: `val_bpb=2.0943`, `val_loss=3.4954`
  - total artifact: `6,812,935` bytes
  - delta vs baseline: `+0.0027 bpb` worse, `19,048` bytes smaller
  - interpretation: strongest experimental branch so far
- Compression-aware baseline, reconstruction regularization `0.005`:
  - run: `compress3090_half_20260318_1750`
  - result: `val_bpb=2.0928`, `val_loss=3.4930`
  - total artifact: `6,829,073` bytes
  - delta vs baseline: `+0.0012 bpb` worse, `2,910` bytes smaller
  - interpretation: best pre-roundtrip proxy result outside the plain baseline
- Matched roundtrip-proxy baseline:
  - run: `baselinert3090_20260318_181344`
  - exact final roundtrip result: `val_bpb=2.11089617`, `val_loss=3.56464830`
  - total artifact: `6,705,058` bytes
- Matched roundtrip-proxy compression baseline:
  - run: `compressrt3090_20260318_175828`
  - knobs: `COMPRESSION_REG_WEIGHT=0.005`
  - exact final roundtrip result: `val_bpb=2.06085837`, `val_loss=3.48014999`
  - total artifact: `6,839,798` bytes
  - delta vs matched roundtrip baseline: `-0.05003780 bpb`, about `2.37%` better
  - interpretation: compression-aware training is now the leading local research branch when measured on a more faithful objective
- Sparse-attention probe on the winning compression setup:
  - run: `compressrt_sparse512_20260318_1842`
  - knobs: `WINDOW_SIZE=512`, `COMPRESSION_REG_WEIGHT=0.005`
  - exact final roundtrip result: `val_bpb=2.07004634`, `val_loss=3.49566562`
  - delta vs best compression baseline: `+0.00918797 bpb` worse
  - interpretation: not good enough to displace the dense compression-aware path; sparse attention stays experimental for later
- Focused QAT roundtrip sweep around the winning compression point:
  - sweep: `qatrtsweep_20260318_1906`
  - best result in sweep:
    - run: `qatrtsweep_20260318_1906_w0045_o0000`
    - knobs: `COMPRESSION_REG_WEIGHT=0.0045`, `OUTLIER_REG_WEIGHT=0.0`
    - exact final roundtrip result: `val_bpb=2.06804196`, `val_loss=3.49228084`
    - total artifact: `6,814,995` bytes
  - interpretation:
    - tiny outlier regularization did not help on this local roundtrip track
    - none of the focused QAT sweep runs beat the standing best dense compression-aware run at `2.06085837`
    - the dense compression-aware baseline remains the current best local result
- Recurrent/shared-block roundtrip sweep:
  - sweep: `recurtsweep_20260318_1925`
  - tested:
    - `16 layers / 8 unique / embed 0` -> `2.25452146`
    - `18 layers / 6 unique / embed 0` -> `2.28804085`
    - `16 layers / 8 unique / embed 256` -> `2.28260194`
    - `18 layers / 6 unique / embed 256` -> `2.34886036`
  - interpretation:
    - this branch cuts artifact size aggressively, but quality collapses on the current local roundtrip track
    - none of these shapes are close to the dense compression-aware baseline
    - shared-block recurrence stays interesting for the 16 MB objective, but this first pass is not competitive enough to prioritize locally
- Roundtrip sidecar revisit on top of the winning dense compression setup:
  - sweep: `sidecarrtsweep_20260318_1942`
  - best usable result in sweep:
    - run: `sidecarrtsweep_20260318_1942_c0020_b0030_s8`
    - knobs: `EVAL_CACHE_MIX_WEIGHT=0.02`, `EVAL_BIGRAM_MIX_WEIGHT=0.03`, `EVAL_CACHE_SIZE=8`
    - exact final roundtrip result: `val_bpb=2.06132482`, `val_loss=3.48093767`
    - total artifact: `6,864,315` bytes
    - delta vs best dense compression baseline: `+0.00046645 bpb` worse
  - sweep reliability notes:
    - `c0015_b0020_s8` and `c0020_b0020_s8` stopped before a usable roundtrip result was written
    - `c0020_b0020_s16` reached artifact export but never wrote `final_int8_zlib_roundtrip_exact`
  - interpretation:
    - the sidecar branch is the closest secondary idea so far
    - it still did not beat the plain dense compression-aware winner
    - keep it parked as a late-stage add-on, not the current pivot
- Conservative ternary / low-bit sweep on top of the winning dense compression setup:
  - sweep: `ternaryrtsweep_20260318_201412`
  - tested:
    - `TERNARY_REG_WEIGHT=0.0005` -> `2.07311732`
    - `TERNARY_REG_WEIGHT=0.0010` -> `2.07009530`
    - `TERNARY_REG_WEIGHT=0.0020` -> `2.07025558`
    - `TERNARY_REG_WEIGHT=0.0035` -> `2.08786263`
    - `TERNARY_REG_WEIGHT=0.0050` -> `2.07821685`
  - interpretation:
    - native low-bit pressure in this form clearly hurts the local roundtrip metric
    - very small ternary weights degrade less, but still do not approach the current leader
    - do not prioritize ternary shaping again until a stronger baseline exists or the training formulation changes
- Quantization residual-budget sweep on top of the winning dense compression setup:
  - sweep: `residualrtsweep_20260318_203241`
  - tested:
    - `residual_rank=0, residual_budget=0` -> baseline export control for this sweep
    - `residual_rank=1, residual_budget=65536` -> `2.08312093`
    - `residual_rank=1, residual_budget=262144` -> `2.08187280`
    - `residual_rank=1, residual_budget=524288` -> `2.08285302`
    - `residual_rank=1, residual_budget=1048576` -> `2.07731235`
  - interpretation:
    - spending more bytes on rank-1 residual export corrections did not improve the local roundtrip metric
    - the export-side residual mechanism is not currently a better lever than the plain dense compression-aware setup
    - quantization-budget tuning should be deprioritized for now
- Refined sidecar micro-sweep around the prior near-win:
  - sweep: `sidecarrefine_20260318_205219`
  - completed exact results:
    - `cache=0.018, bigram=0.030, size=8` -> `2.08080110`
    - `cache=0.020, bigram=0.028, size=8` -> `2.07489103`
    - `cache=0.020, bigram=0.030, size=8` rerun -> `2.08947255`
    - `cache=0.020, bigram=0.032, size=8` -> `2.07840275`
  - incomplete run:
    - `cache=0.022, bigram=0.030, size=8` reached artifact export but did not write `final_int8_zlib_roundtrip_exact`
  - interpretation:
    - the earlier `2.06132482` sidecar near-win did not reproduce
    - the sidecar branch now looks unstable on the local roundtrip track
    - measuring repeatability is more important than additional sidecar micro-tuning right now
- Corrected wallclock repeatability sweep:
  - sweep: `repeatrtsweepfix_20260318_215301`
  - dense compression-aware runs:
    - `base_a` -> `2.06761597`
    - `base_b` -> `2.07369637`
    - `base_c` -> `2.08956232`
  - sidecar near-win reruns:
    - `side_a` -> `2.05608381`
    - `side_b` -> `2.09377262`
    - `side_c` -> `2.07285932`
  - interpretation:
    - both branches show too much spread on the local `180s` wallclock track
    - the best sidecar rerun did beat the standing leader, but the worst sidecar rerun was much worse
    - the dominant local noise source now looks methodological, not architectural
    - the next step should be a fixed-step local roundtrip track, not more wallclock micro-sweeps
- Fixed-step roundtrip sweep:
  - sweep: `fixedsteprtsweep_20260318_221632`
  - dense compression-aware runs:
    - `base_a` -> `2.04299145`
    - `base_b` -> `2.04299145`
  - sidecar near-win reruns:
    - `side_a` -> `2.04300345`
    - `side_b` -> `2.04300345`
  - interpretation:
    - once wallclock variance is removed, the sidecar branch is effectively identical to the dense baseline
    - the dense compression-aware branch remains the cleanest local control
    - future local search should use fixed-step comparison first, then wallclock only as a secondary sanity check
- Export-aware fixed-step compression probe:
  - sweep: `exportaware_fixedstep_20260318_223456`
  - completed result:
    - `g010_r000` -> `2.04288777`
    - knobs: `COMPRESSION_REG_WEIGHT=0.005`, `COMPRESSION_GRID_REG_WEIGHT=0.10`, `COMPRESSION_RANK1_REG_WEIGHT=0.0`
    - total artifact: `6,663,470` bytes
    - delta vs fixed-step dense control: `-0.00010368 bpb` better
  - execution note:
    - the broader coarse sweep was aborted after the first positive signal to avoid spending more 3090 time on low-probability points
  - interpretation:
    - export-aware grid alignment is the first post-fixed-step change that improved the dense compression-aware control
    - the gain is small, but it is deterministic and points in the right direction
    - the next compression-native pivot should stay inside export-aware regularization, not revisit sidecar or architectural branches
- Scale-aware fixed-step compression sweep:
  - sweep: `scaleaware_fixedstep_20260318_224233`
  - completed results:
    - `g010_s0010` -> `2.04313626`
    - `g010_s0025` -> `2.04358127`
  - interpretation:
    - adding explicit adjacent-scale smoothing made the roundtripped result slightly worse at both tested weights
    - this version of scale-aware pressure does not improve on the grid-aligned winner
    - the next best move is to refine the grid-alignment weight itself, not add more compression-native terms yet
- Grid-refinement fixed-step sweep:
  - sweep: `gridrefine_fixedstep_20260318_225110`
  - completed results:
    - `g0080` -> `2.04396986`
    - `g0120` -> `2.04350611`
  - interpretation:
    - both nearby grid weights regressed versus the `0.10` winner
    - `COMPRESSION_GRID_REG_WEIGHT=0.10` currently looks like a real local optimum on the fixed-step track
    - the next compression-aware pivot should keep `grid=0.10` fixed and test only very small outlier pressure around it
- Tiny outlier sweep on top of the grid-aligned winner:
  - sweep: `gridoutlier_fixedstep_20260318_225946`
  - completed results:
    - `o00010` -> `2.04373218`
    - `o00025` -> `2.04372289`
  - interpretation:
    - even very small outlier pressure still regresses
    - outlier suppression should stay parked unless it becomes tensor-targeted
- Dense iso-byte frontier sweep:
  - sweep: `isobyte_fixedstep_20260318_234805`
  - completed results:
    - `b10` -> `2.02814871` at `9,683,932` bytes
    - `b12` -> `2.05262920` at `11,334,608` bytes
    - `b14` -> `2.03768242` at `13,094,288` bytes
    - `b155` -> `2.00290272` at `13,741,308` bytes
  - interpretation:
    - dense scaling dominates the small-model micro-ideas by a wide margin
    - the current best result is no longer the 6.66 MB regime; it is the larger dense `b155` run
    - the frontier is not monotonic with size alone, so geometry still matters, but the main lesson is clear: under-byte-spent local negatives were misleading
    - the next step should stay on the dense high-cap frontier and compare width-vs-depth near the byte ceiling
- High-cap dense frontier:
  - recovered / rerun results:
    - `w608_l12` -> `2.00551677` at `14,371,393` bytes
    - `w624_l12` -> `2.01128088` at `15,024,114` bytes
    - `d576_l14` -> `1.99806297` at `15,222,128` bytes
    - `w640_l12` -> `2.00505534` at `15,658,993` bytes
  - interpretation:
    - depth beat width at roughly the same byte spend in this near-cap regime
    - the first sub-`2.0` local fixed-step result came from the deeper dense model, not the wider one
    - near the byte cap, width is not obviously the best place to spend additional budget
- Tokenizer sanity check on the current best dense recipe:
  - matched local subset controls built from the same `120k` selected-doc prefix
  - SP1024 subset control:
    - `sp1024subsetbest_20260319_020125` -> `1.99806297`
    - total artifact: `15,222,128` bytes
    - dataset stats: `149,659,022` total tokens on the subset
  - SP4096 subset swap on the same trainer:
    - `sp4096best_20260319_015500` -> `1.89591231`
    - total artifact: `16,627,470` bytes
    - dataset stats: `109,783,049` total tokens on the same subset
  - interpretation:
    - moving from SP1024 to SP4096 on the same local subset improved exact roundtrip BPB by `0.10215066`, about `5.11%`
    - the same subset needed about `26.64%` fewer tokens with SP4096, which matches the expected compression benefit
    - the merged `14x576` SP4096 run broke the `16,000,000` byte cap by `627,470` bytes, so it is a strong signal but not yet a submission-shape replacement
    - tokenizer work is no longer purely deferred; it is now a real frontier lever, but it must be co-optimized with model size to stay under cap
- Iso-byte SP4096 dense sweep:
  - sweep: `sp4096isobyte_fixedstep_20260319_022236`
  - completed results:
    - `l15_d544` -> `1.90194008` at `16,090,675` bytes
    - `l14_d560` -> `1.89329916` at `15,869,071` bytes
    - `l12_d608` -> `1.89424125` at `15,844,603` bytes
  - interpretation:
    - SP4096 is now a cap-compliant win, not just an over-budget curiosity
    - the best cap-legal SP4096 point beat the SP1024 `14x576` control by `0.10476381` bpb, about `5.24%`
    - `l14_d560` is the current best local result overall
    - `l12_d608` is slightly worse on fixed-step BPB but notably faster per step, so it remains a plausible wallclock-oriented backup shape
    - the first deeper SP4096 point (`l15_d544`) pushed just over the cap, which suggests the next useful local refinement is a slightly narrower deeper sweep
- SP4096 frontier refinement:
  - sweep: `sp4096frontier_fixedstep_20260319_024311`
  - completed results:
    - `l15_d528` -> `1.90758423` at `15,313,341` bytes
    - `l16_d512` -> `1.90408017` at `15,413,246` bytes
    - `l15_d540_h10` -> `1.90686721` at `15,890,169` bytes
  - interpretation:
    - both deeper/narrower fallback shapes stayed under the cap but regressed versus `l14_d560`
    - the higher-byte `15x540` compromise also stayed under the cap and still regressed versus `l14_d560`
    - depth is not obviously dead, but this local SP4096 frontier no longer looks like it will be beaten by small shape tweaks alone
    - the next best branch should move off pure shape refinement and onto export-side improvements targeted at the current `l14_d560` control
- Sliding-window eval on top of the SP4096 leader:
  - run: `sp4096slide_20260319_140111`
  - exact final roundtrip result: `1.93999302`
  - total artifact: `15,874,851` bytes
  - delta vs plain SP4096 `l14_d560` leader: `+0.04669386 bpb` worse
  - interpretation:
    - sliding-window eval regressed on this branch
    - it moved the same `14x560` SP4096 recipe from `1.89329916` to `1.93999302`, about `2.47%` higher bpb
    - unlike the earlier mixed-quant combo, this kept the proven 8-bit export path intact, so the regression is attributable to eval alone
    - sliding-window eval should stay experimental, not part of the control path, unless a different stride/sequence-length setup changes the result
- Export-side MLP permutation on the SP4096 leader:
  - run: `sp4096mlpperm_20260319_150137`
  - same-checkpoint baseline export:
    - `1.89329916` at `15,874,851` bytes
  - MLP-hidden-unit permuted export:
    - `1.89331845` at `15,853,440` bytes
  - interpretation:
    - this exact symmetry-preserving MLP permutation was almost quality-neutral but slightly worse on BPB
    - it did reduce artifact size by `21,411` bytes, so the ordering idea is not useless, but this heuristic is not yet a score winner
    - export-side permutation should stay open as a compression trick, but not as the next primary optimization lane
- Tensor sensitivity mapping on the SP4096 leader checkpoint:
  - run: `sp4096mlpperm_20260319_150137.sensitivity`
  - baseline export:
    - `1.89329916` at `15,874,851` bytes
  - strongest measured single-tensor protections in the first top-8 scan:
    - `tok_emb.weight` -> `1.89269052` at `17,723,296` bytes
    - `blocks.0.mlp.proj.weight` -> `1.89225616` at `16,842,104` bytes
    - `blocks.1.mlp.proj.weight` -> `1.89284998` at `16,835,628` bytes
  - interpretation:
    - `tok_emb.weight` and early `mlp.proj.weight` tensors are the most quantization-sensitive tensors seen so far
    - `mlp.fc.weight` rows are much less worth protecting than the matching `mlp.proj.weight`
    - full fp16 protection is far too byte-expensive for the gains measured here
    - the next heterogeneous export work should target partial / low-rank / selective protection of projection weights, not blanket keep-float exceptions
- Targeted residual allocation on the SP4096 leader checkpoint:
  - run: `sp4096mlpperm_20260319_150137.targetedresid`
  - baseline export in this harness:
    - `1.89331212` at `15,808,930` bytes
  - targeted variants:
    - `resid_all_mlp_proj` -> `1.89312262` at `15,855,678` bytes
    - `resid_early_mlp_proj` -> `1.89309741` at `15,822,543` bytes
    - `resid_all_attn_proj` -> `1.89330189` at `15,841,186` bytes
    - `resid_early_combo` -> `1.89308554` at `15,831,628` bytes
  - interpretation:
    - concentrating the same rank-1 residual budget on early projection tensors is better than the default global allocation
    - the best variant so far is the early combo of `mlp.proj` + `attn.proj`, improving BPB by `0.00022659` for only `22,698` extra bytes
    - this is still a tiny gain, but it is a cleaner positive result than the first MLP permutation heuristic
    - the next practical step is to fold targeted residual allocation into the trainer/export path and re-run the SP4096 leader with it enabled
- End-to-end targeted residual rerun:
  - run: `sp4096targeted_20260319_155429`
  - config:
    - same SP4096 `14x560` dense leader
    - `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
  - exact final roundtrip result: `1.89309428`
  - total artifact: `15,875,631` bytes
  - delta vs prior SP4096 leader: `-0.00020488 bpb` better, `+6,560` bytes
  - interpretation:
    - the checkpoint-level targeted residual gain survives end to end
    - this is the first export-side change after the SP4096 pivot that produced a real end-to-end improvement on the trusted local track
    - the gain is small, but it is clean and comes at a tiny artifact cost
- End-to-end targeted residual budget refinement:
  - runs:
    - `sp4096targeted_b131072_20260319_1620`
    - `sp4096targeted_b98304_20260319_1633`
  - completed results:
    - `budget=131072` -> `1.89308887` at `15,940,119` bytes
    - `budget=98304` -> `1.89306630` at `15,907,581` bytes
  - interpretation:
    - the first targeted-residual win was real and continues to improve with a modest budget increase
    - `98304` is currently the best end-to-end targeted-residual budget under the cap
    - pushing the budget to `131072` still helps relative to the old `65536` point, but not as much as the tighter `98304` setting
- Training-side compression regularizer refinement on the SP4096 targeted-residual leader:
  - run: `sp4096compw0045_20260319_1650`
  - config:
    - same SP4096 `14x560` control
    - `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
    - `INT8_RESIDUAL_BUDGET_BYTES=98304`
    - `COMPRESSION_REG_WEIGHT=0.0045`
  - exact final roundtrip result: `1.89258040`
  - total artifact: `15,906,874` bytes
  - delta vs prior `budget=98304` leader: `-0.00048590 bpb` better, `-707` bytes
  - interpretation:
    - this is the first materially larger gain after the SP4096 pivot in a while
    - the old `COMPRESSION_REG_WEIGHT=0.005` point was not locally optimal for the stronger SP4096 + targeted-residual regime
    - the next sensible move is to bracket downward around `0.0045` before returning to export-only micro-optimization
- Compression-weight bracket follow-up:
  - runs:
    - `sp4096compw0040_20260319_1655`
    - `sp4096compw00425_20260319_1700`
    - `sp4096compw00475_20260319_1710`
  - completed results:
    - `0.0040` -> `1.89271463` at `15,907,465` bytes
    - `0.00425` -> `1.89293788` at `15,906,920` bytes
    - `0.00475` -> `1.89302752` at `15,907,662` bytes
  - interpretation:
    - both lower-side checks and the upper-side `0.00475` check are worse than the `0.0045` leader
    - the bracket is effectively closed around `COMPRESSION_REG_WEIGHT=0.0045` on the trusted local SP4096 track
    - further local search should stop treating compression-weight micro-sweeps as the primary frontier lever
- Targeted residual + MLP permutation combo on the SP4096 leader checkpoint:
  - sweep:
    - `sp4096perm_targeted_b65536_20260319_161528`
    - `sp4096perm_targeted_b131072_20260319_161603`
    - `sp4096perm_targeted_b196608_20260319_161639`
    - `sp4096perm_targeted_b262144_20260319_161715`
  - best under-cap result:
    - `budget=131072`, targeted residual baseline export -> `1.89308887` at `15,940,119` bytes
    - `budget=131072`, targeted residual + MLP permutation -> `1.89307739` at `15,917,517` bytes
  - interpretation:
    - the MLP permutation heuristic becomes useful once paired with the stronger targeted residual allocation
    - at `131072` residual-budget bytes, permutation both improves BPB and saves `22,602` bytes versus the non-permuted targeted export
    - around the tighter end-to-end winner, permutation is no longer clearly helpful:
      - at `98304`, it slightly regressed BPB
      - at `114688`, it was nearly neutral but still not clearly superior to the plain `98304` targeted export
    - larger targeted budgets (`196608`, `262144`) do not improve enough to justify the extra bytes
    - this is now the best checkpoint-only export setting seen so far for the SP4096 `14x560` control

## Current leader

- `sp4096compw0045_20260319_1650`
- dense attention, no sidecar, no recurrence, no factorized embedding
- `VOCAB_SIZE=4096`, tied embeddings
- `COMPRESSION_REG_WEIGHT=0.0045`
- `COMPRESSION_GRID_REG_WEIGHT=0.10`
- `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
- `INT8_RESIDUAL_BUDGET_BYTES=98304`
- fixed-step exact final roundtrip result: `val_bpb=1.89258040`
- total artifact: `15,906,874` bytes
- best wallclock-track reference remains `compressrt3090_20260318_175828` at `2.06085837`

## Recovered H100 findings

- Runpod grant session used `8x NVIDIA H100 80GB HBM3` on pod `awrxpa5ou3rqcx`
- recovered artifacts and notes are saved in:
  - [runpod_h100_session_20260320_recovery.md](/C:/Users/Justice/Desktop/FunProject/logs/runpod_h100_session_20260320_recovery.md)
  - [h100send_sp1024_d576_l14_20260320_013957.recovered.txt](/C:/Users/Justice/Desktop/FunProject/logs/h100send_sp1024_d576_l14_20260320_013957.recovered.txt)
  - [h100send_sp1024_9x512_compgrid_20260320_015442.partial.txt](/C:/Users/Justice/Desktop/FunProject/logs/h100send_sp1024_9x512_compgrid_20260320_015442.partial.txt)
- Recovered complete H100 run:
  - `h100send_sp1024_d576_l14_20260320_013957`
  - full published `fineweb10B_sp1024`
  - `14x576`, dense attention, `COMPRESSION_REG_WEIGHT=0.005`, `COMPRESSION_GRID_REG_WEIGHT=0.10`
  - timed stop: `step 6821/20000` at `600012ms`, `step_avg 87.97ms`
  - pre-quant metric at stop: `val_bpb 1.1939`
  - exact post-roundtrip metric: `final_int8_zlib_roundtrip_exact val_bpb 1.19816494`
  - artifact: `30,904,580` bytes
- Interpretation of the recovered complete H100 run:
  - the training-side compression/export-aware branch transferred strongly on real `8xH100`
  - score was leaderboard-competitive on BPB alone
  - artifact size was massively over the `16,000,000`-byte cap, so the run is invalid as a submission
- Recovered partial H100 run:
  - `h100send_sp1024_9x512_compgrid_20260320_015442`
  - full published `fineweb10B_sp1024`
  - baseline-sized `9x512` shape with `COMPRESSION_REG_WEIGHT=0.005`, `COMPRESSION_GRID_REG_WEIGHT=0.10`
  - healthy through `step 1000/20000`
  - throughput stabilized around `44.5ms/step`, which is essentially baseline-class
  - final metric was lost when Runpod stopped and removed the pod for low balance
- H100 session takeaway:
  - the project is not failing on quality
  - it is failing on cap discipline when the model is allowed to grow too large
  - the first real submission-shaped H100 rerun target is now the smaller `9x512` compression-aware branch, not another oversized near-cap dense branch

## Recovered 1x H100 findings

- Follow-up Runpod session used `1x H100 SXM`
- all remote outputs from this session were copied back locally before teardown:
  - [h1001x_sp1024_9x512_compgrid_20260320_030259](/C:/Users/Justice/Desktop/FunProject/logs/remote/h1001x_sp1024_9x512_compgrid_20260320_030259)
  - [h1001x_sp1024_9x512_plain_20260320_031944](/C:/Users/Justice/Desktop/FunProject/logs/remote/h1001x_sp1024_9x512_plain_20260320_031944)
  - [h1001x_sp1024_9x512_targeted_20260320_033606](/C:/Users/Justice/Desktop/FunProject/logs/remote/h1001x_sp1024_9x512_targeted_20260320_033606)
- Published `fineweb10B_sp1024` download on `1x H100` took about `2m 22s`
- Complete legal full-data runs on published `SP1024`:
  - `h1001x_sp1024_9x512_compgrid_20260320_030259`
    - `9x512`, dense attention
    - `COMPRESSION_REG_WEIGHT=0.005`
    - `COMPRESSION_GRID_REG_WEIGHT=0.10`
    - `INT8_RESIDUAL_RANK=1`
    - `INT8_RESIDUAL_BUDGET_BYTES=65536`
    - stop: `step 1849/20000` at `600103ms`, `step_avg 324.56ms`
    - exact post-roundtrip metric: `final_int8_zlib_roundtrip_exact val_bpb 1.31756308`
    - artifact: `14,847,999` bytes
  - `h1001x_sp1024_9x512_plain_20260320_031944`
    - same `9x512` shape, no compression/grid regularization
    - `INT8_RESIDUAL_RANK=1`
    - `INT8_RESIDUAL_BUDGET_BYTES=65536`
    - stop: `step 1871/20000` at `600162ms`, `step_avg 320.77ms`
    - exact post-roundtrip metric: `final_int8_zlib_roundtrip_exact val_bpb 1.31693524`
    - artifact: `14,877,547` bytes
  - `h1001x_sp1024_9x512_targeted_20260320_033606`
    - same `9x512` shape, no compression/grid regularization
    - `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
    - `INT8_RESIDUAL_BUDGET_BYTES=98304`
    - stop: `step 1871/20000` at `600337ms`, `step_avg 320.69ms`
    - exact post-roundtrip metric: `final_int8_zlib_roundtrip_exact val_bpb 1.31661720`
    - artifact: `14,912,837` bytes
- 1x H100 takeaway:
  - all three runs stayed safely under the `16,000,000`-byte cap
  - the plain `9x512` full-data control slightly beat the compression/grid variant
  - targeted residual allocation then slightly beat the plain control end to end
  - current best recovered legal full-data H100 run is:
    - `h1001x_sp1024_9x512_targeted_20260320_033606`
    - `final_int8_zlib_roundtrip_exact val_bpb 1.31661720`
    - `14,912,837` bytes
  - this is not leaderboard-competitive, but it is a real legal full-data H100 datapoint and a stronger remote control than the previously lost partial run
  - the local SP4096 / compression-grid story does not transfer mechanically:
    - on full-data `SP1024 1xH100`, compression/grid lost to the plain control
    - export-side targeted residual allocation still transferred as a small positive

## Regime correction

- The trusted local dense control is now in the near-cap regime, not the old `6.66 MB` regime.
- That is why the dense iso-byte and high-cap frontier sweeps changed the project direction so much.
- Many earlier negative results were gathered in an under-byte-spent regime and should not be treated as globally final.
- The local SP4096 winner was built on a retokenized `120k` selected-doc subset, not a published full challenge export.
- That makes it a strong research signal, but not a direct H100 submission path until a full published SP4096 export exists or is rebuilt with enough compute/time to make that legitimate.
- The trustworthy questions now are:
  - how should the remaining byte budget be spent near the cap?
  - which export-aware or tokenizer-aware changes still help once the dense control is already strong?
  - how should the remaining cap headroom be spent inside the stronger SP4096 regime?

## Immediate next step when credits return

- Freeze the local SP4096 branch as the research leader, not the immediate deployment target
- Keep `COMPRESSION_GRID_REG_WEIGHT=0.10`
- Keep the local SP4096 export settings fixed at:
  - `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
  - `INT8_RESIDUAL_BUDGET_BYTES=98304`
  - `COMPRESSION_REG_WEIGHT=0.0045`
- Treat the recovered H100 `14x576` result as proof of quality transfer, but not as a legal shape
- Treat the recovered `1x H100` `9x512 + targeted residual` run as the current legal full-data remote control
- First rerun target on the next H100 session:
  - full-data `SP1024`
  - `9x512`
  - `INT8_TARGETED_RESIDUAL_MODE=early_proj_combo`
  - `INT8_RESIDUAL_BUDGET_BYTES=98304`
  - only reintroduce compression/grid if a new full-data result justifies it
- Only return to larger H100 shapes after a cap-legal run is in hand
- Continue ranking ideas by `final_int8_zlib_roundtrip_exact val_bpb`, but separate:
  - local subset research winners
  - full-data H100 submission-shaped candidates

## Non-negotiable remote-run rule

- Every paid remote run must save results locally before the pod is allowed to die.
- Minimum required pullback for every H100 / paid Runpod session:
  - the full remote text log
  - the compressed artifact file (`final_model.int8.ptz`) if one was produced
  - any auxiliary metadata needed to interpret the run (`submission.json`, driver log, launch script, exact env/config)
- This applies even when the run is obviously bad, over-cap, or incomplete.
- If balance is low, results must be copied down before starting another run.
- We do not spend money again on remote compute unless the recovery path is part of the run procedure.
- The failure from the first grant session is now a process lesson:
  - one run produced a strong but over-cap H100 result
  - a second run was promising but got cut off
  - because the pod died before a local pullback, the final artifact and full second-run log were lost
  - that cannot happen again

## Next experiments

- SP4096 frontier refinement:
  - parked after `15x528`, `16x512`, and `15x540` all lost to `14x560`
  - revisit only if a different head geometry or export path makes deeper shapes more attractive
- H100 rerun target:
  - highest-priority remote follow-up is now the legal `SP1024 9x512 + targeted residual` branch
  - use the plain `SP1024 9x512` run as the remote control
  - do not spend new H100 credits on the over-cap `14x576` branch unless the export path changes drastically
- Export-side symmetry-aware permutation:
  - initial MLP-only pass gave a tiny size win but slightly worse BPB
  - the same heuristic becomes promising once combined with targeted residual allocation
  - do not prioritize broader permutation heuristics until the `budget=131072` targeted export is confirmed end to end
- Tensor sensitivity mapping / heterogeneous export allocation:
  - first pass complete on the top quantized tensors of the SP4096 leader
  - targeted residual allocation on early projection tensors is the first promising sub-cap result
  - `INT8_RESIDUAL_BUDGET_BYTES=98304` is now the best confirmed end-to-end budget under the cap
  - avoid full keep-float unless the byte cost can be offset elsewhere
- Export-aware compression regularizer:
  - continue aligning sampled training-time regularization with the actual export path
  - hold `COMPRESSION_GRID_REG_WEIGHT=0.10` fixed unless new evidence suggests otherwise
- Scale-aware compression regularizer:
  - parked for now after the first two weights regressed
  - revisit only if a different formulation of scale entropy or scale clustering becomes compelling
- Fixed-step compression sweeps:
  - keep using the fixed-step roundtrip track as the local ranking metric
  - only move promising compression-native changes back onto the 180s wallclock track later
- Sidecar branch is parked:
  - fixed-step results say it is not moving the needle in a reliable way
  - do not spend more 3090 time on sidecar micro-tuning for now
- Export-side ideas remain parked:
  - residual-budget tuning did not help
  - sparse attention did not help
  - shared-block recurrence did not help
- Low-bit shaping remains parked:
  - revisit only if the training objective changes materially or H100 results suggest a different regime

## Medium-term work

- Dense winner + sidecar + low-bit combined into one trainer once the individual branches are measured cleanly
- Global/shared codebook quantization across layers
- Basis-generated per-layer weights or hypernetwork-style weight generation
- Test-time adaptation with strict reset semantics
- Token-adaptive recurrent depth / halting policy

## Deferred until the model is stronger

- full tokenizer redesign beyond the SP1024 vs SP4096 sanity branch
- aggressive code-size golf
- heavy hyperparameter brute force
- additional H100 spend without enough balance to finish and preserve logs