Train the smallest autoregressive transformer that can add two 10-digit numbers. Your model receives a tokenized input like 0001234567+0009876543= and must predict the answer one digit at a time: 0011111110.
This is a test of algorithmic reasoning. Addition requires learning to carry across digit positions — a simple pattern for humans but surprisingly hard for small neural networks. The challenge is compressing that ability into as few parameters as possible.
Submissions must achieve 99% accuracy on a held-out test set of 10,000 random addition problems to qualify. Qualified submissions are then ranked by parameter count — smaller is better.
You submit a single .py file that defines your model and its tokenization scheme. The file must export exactly 5 things:
| Export | Description |
|---|---|
| build_model() | Returns a torch.nn.Module with loaded weights. Must use self-attention and be causal. |
| encode(a, b) | Takes two Python ints and returns a list of token IDs (the prompt up to and including =). |
| decode(tokens) | Takes a list of generated token IDs and returns a Python int (the predicted sum). |
| VOCAB_SIZE | An int. Must be ≤ 256. |
| MAX_OUTPUT_LEN | An int. Maximum number of tokens the model will generate. Must be ≤ 30. |
Numbers are zero-padded to 12 digits and concatenated with + and = tokens
# Example encode/decode encode(1234567890, 9876543210) → [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 0, 0, 9, 8, 7, 6, 5, 4, 3, 11] # ─── operand A (zero-padded) ─── + ─── operand B (zero-padded) ─── = decode([1, 1, 1, 1, 1, 1, 1, 1, 0, 0]) → 11111111100
The evaluation harness controls generation. It feeds your encoded prompt to the model, then generates tokens one at a time using greedy decoding (argmax). At each step, the newly predicted token is appended to the context and fed back in.
The model predicts one digit at a time, appending each to the context
# Generation loop (controlled by harness, not your code)
tokens = encode(a, b) # prompt tokens
for _ in range(MAX_OUTPUT_LEN):
logits = model(tokens) # forward pass
next_token = argmax(logits[-1])
tokens.append(next_token)
if next_token == EOS:
break
answer = decode(generated_tokens)Your model never calls its own generation loop. The harness handles the autoregressive loop to ensure fair evaluation and prevent shortcuts.
Your model is tested on 10,010 addition problems — 10 edge cases (carry chains, boundary values) and 10,000 uniformly random pairs. A problem counts as correct only if the model produces the exact integer answer.
Hit 99% accuracy to qualify, then compete for the smallest model
Qualification: accuracy ≥ 99% (at least 9,900 / 10,010 correct) Ranking (among qualified submissions): 1st: Fewest parameters 2nd: Earlier submission (tiebreak)
Submissions pass through a 7-layer validation pipeline to prevent hardcoded lookup tables and other circumventions:
nn.MultiheadAttention or equivalent self-attention layer. Pure MLPs are rejected.encode() must return ≤ 35 tokens, all in range [0, VOCAB_SIZE).encode(a, b) twice with the same inputs must return the same tokens.decode() must not inspect global state, import modules, or communicate with encode().nn.Parameter tensors (trainable and frozen) plus non-boolean registered buffers are counted. Weight-tied parameters are deduplicated by data_ptr().