About the Addition Competition

The Challenge

Train the smallest autoregressive transformer that can add two 10-digit numbers. Your model receives a tokenized input like 0001234567+0009876543= and must predict the answer one digit at a time: 0011111110.

This is a test of algorithmic reasoning. Addition requires learning to carry across digit positions — a simple pattern for humans but surprisingly hard for small neural networks. The challenge is compressing that ability into as few parameters as possible.

Submissions must achieve 99% accuracy on a held-out test set of 10,000 random addition problems to qualify. Qualified submissions are then ranked by parameter count — smaller is better.

What You Submit

You submit a single .py file that defines your model and its tokenization scheme. The file must export exactly 5 things:

Export	Description
build_model()	Returns a `torch.nn.Module` with loaded weights. Must use self-attention and be causal.
encode(a, b)	Takes two Python ints and returns a list of token IDs (the prompt up to and including `=`).
decode(tokens)	Takes a list of generated token IDs and returns a Python int (the predicted sum).
VOCAB_SIZE	An int. Must be ≤ 256.
MAX_OUTPUT_LEN	An int. Maximum number of tokens the model will generate. Must be ≤ 30.

Numbers are zero-padded to 12 digits and concatenated with + and = tokens

# Example encode/decode
encode(1234567890, 9876543210)
  → [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 0, 0, 9, 8, 7, 6, 5, 4, 3, 11]
  #  ─── operand A (zero-padded) ───  +  ─── operand B (zero-padded) ───  =

decode([1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
  → 11111111100

How Evaluation Works

The evaluation harness controls generation. It feeds your encoded prompt to the model, then generates tokens one at a time using greedy decoding (argmax). At each step, the newly predicted token is appended to the context and fed back in.

The model predicts one digit at a time, appending each to the context

# Generation loop (controlled by harness, not your code)
tokens = encode(a, b)          # prompt tokens
for _ in range(MAX_OUTPUT_LEN):
    logits = model(tokens)     # forward pass
    next_token = argmax(logits[-1])
    tokens.append(next_token)
    if next_token == EOS:
        break
answer = decode(generated_tokens)

Your model never calls its own generation loop. The harness handles the autoregressive loop to ensure fair evaluation and prevent shortcuts.

Scoring & Qualification

Your model is tested on 10,010 addition problems — 10 edge cases (carry chains, boundary values) and 10,000 uniformly random pairs. A problem counts as correct only if the model produces the exact integer answer.

Hit 99% accuracy to qualify, then compete for the smallest model

Qualification: accuracy ≥ 99% (at least 9,900 / 10,010 correct)

Ranking (among qualified submissions):
  1st: Fewest parameters
  2nd: Earlier submission (tiebreak)

Anti-Cheat Validation

Submissions pass through a 7-layer validation pipeline to prevent hardcoded lookup tables and other circumventions:

Structural check: The model must contain at least one nn.MultiheadAttention or equivalent self-attention layer. Pure MLPs are rejected.
Causal check: The model must be causal — changing a future token cannot affect the prediction of an earlier token.
Encode bounds check: encode() must return ≤ 35 tokens, all in range [0, VOCAB_SIZE).
Encode consistency: Calling encode(a, b) twice with the same inputs must return the same tokens.
Decode honesty: decode() must not inspect global state, import modules, or communicate with encode().
AST safety: The submitted file is parsed and inspected for disallowed constructs (network access, file I/O, subprocess calls, etc.).
Parameter count audit: All nn.Parameter tensors (trainable and frozen) plus non-boolean registered buffers are counted. Weight-tied parameters are deduplicated by data_ptr().

Rules & Constraints

Format: Python (.py) file
Max file size: 1 MB
VOCAB_SIZE: ≤ 256
MAX_OUTPUT_LEN: ≤ 30
Model type: Must use self-attention (transformer)
Causality: Must be autoregressive (causal masking)
Test set: 10,010 problems (10 edge + 10,000 random)
Passing threshold: 99% accuracy
Ranking: Fewest parameters among qualified submissions
Submissions are rate-limited to 1 every 10 minutes

Submit Your Model

← Back to Leaderboard