OpenAI Parameter Golf: 16MB Model Competition and Hiring Pip

OpenAI opened a public ML competition on March 18th that asks participants to train the best possible language model under a hard constraint: the entire artifact, weights and code combined, must fit in 16 megabytes. Training gets 10 minutes on eight H100 GPUs. That's it.

The competition is called Parameter Golf, and it is the first round of what OpenAI is branding the Model Craft Challenge. There will be more rounds testing different ML skills. But this one zeroes in on pretraining efficiency, and the framing is unusually honest about its dual purpose: find clever engineers, then hire them.

Code golf for the LLM era

The setup is borrowed directly from the NanoGPT Speedrunning community, which has spent the past couple of years competing to train a 124M-parameter model to a target loss as fast as possible on 8xH100s. Parameter Golf inverts the problem. Instead of optimizing for speed at a fixed model size, you're optimizing for compression quality at a fixed artifact size. The metric is bits per byte on a FineWeb validation set, which is a tokenizer-agnostic measure of how well your model predicts web text.

The 16MB cap is decimal, not binary (16,000,000 bytes, not 16 MiB). And it covers everything: model weights, code, tokenizer. No network calls during evaluation. The artifact has to be fully self-contained.

What makes this interesting is what's left unconstrained. Architecture, layer count, tokenizer design, test-time compute tricks, quantization schemes: all fair game. The GitHub repo explicitly lists depth recurrence, aggressive parameter tying, low-rank training, QAT, bitnets, and novel tokenizers as directions they expect participants to explore. If you've read about neural scaling laws, this is essentially L(N) optimization: lowest loss given fixed parameter count, with data and compute unconstrained.

The baseline isn't great

The current leaderboard has exactly one entry. OpenAI's naive baseline scores 1.2244 bits per byte using a 9-layer transformer with 512-dimensional embeddings, a 1024-token vocabulary, tied embeddings, and 4 KV heads. A second entry in the "non-record" track, a 4-hour training run by OpenAI researcher Will DePue, scores 1.2074. So even with 24x the compute budget, the improvement is modest.

That gap (or lack of one) tells you something. The binding constraint here is really the 16MB artifact, not the 10-minute clock. You can throw more H100-hours at training, but if your model architecture doesn't compress well, the extra compute doesn't buy much. Submissions have to beat the existing record by at least 0.005 nats with p < 0.01 statistical significance, which is a nice touch borrowed from the NanoGPT Speedrun rules.

What they're actually after

OpenAI CRO Mark Chen told Inc. that the core question is whether applicants can come up with creative solutions in a sandbox setting. He framed it as testing the kind of problem-solving OpenAI researchers do daily in pretraining, comparing it to building "an efficient rocket ship." Chen has been saying for a while that he'd rather hire inherently creative people and teach them ML than the other way around.

The company plans to hire a small cohort of early-career researchers in June, targeting undergrads, recent graduates, and Olympiad-type competitors. They point to DePue as a model for the kind of background they're looking for: he dropped out of college in 2022 after selling a company he'd co-founded in high school, learned ML largely from Andrej Karpathy's YouTube lectures, and now runs his own research team at OpenAI.

It's a recruiting play, and OpenAI isn't hiding it. The participant form collects contact info and backgrounds. The competition page says exceptional participants may "stand out to OpenAI researchers and recruiters." The subtext is not subtle.

The talent war context

This arrives during an intense period of AI researcher poaching. Meta has reportedly offered compensation packages worth up to $300 million to lure top researchers away from OpenAI. Google DeepMind, Anthropic, and xAI are all competing for the same small pool. Running a public competition to surface non-traditional talent is a cheaper, broader net. Whether a 16MB model challenge actually identifies the people who'd thrive doing frontier pretraining research at scale is... an open question. But it's a more interesting filter than a LeetCode interview.

The compute grant question

OpenAI is distributing up to $1 million in compute credits through RunPod, their infrastructure partner for this challenge. Grants are available on request throughout the competition period, which runs through April 30th. RunPod has a prebuilt template that drops you into an environment with the repo and dependencies ready to go.

An 8xH100 box on RunPod costs around $20/hour. If you're iterating seriously, that adds up fast, and the FAQ acknowledges this tension without fully resolving it. You're allowed to tune hyperparameters across many runs (that's fine), but OpenAI reserves the right to disqualify submissions that feel like they're sneaking in external compute, like brute-forcing seeds. The line between extensive hyperparameter search and unfair external compute is, by their own admission, blurry.

Does this matter beyond recruiting?

Maybe. The techniques that win here, extreme model compression, efficient tokenization, creative use of quantization, do have real applications. Running useful models on edge devices, phones, embedded systems, that requires exactly this kind of engineering. And the open submission format means the winning approaches will be public. All submissions go through GitHub PRs with full training logs, code, and writeups.

But I'd temper expectations about scientific breakthroughs. Optimizing a model to predict web text in 16MB is a constrained engineering puzzle. It rewards ingenuity, sure, but the skills transfer to frontier pretraining at 100B+ parameters is unclear. Chen's rocket ship analogy is apt in one sense: both involve efficiency under constraints. Whether building a bottle rocket teaches you to build a Saturn V is the part he leaves out.

The competition closes April 30th. Anyone 18 or older in OpenAI API-supported countries can participate. The Discord channels (#parameter-golf-discussions) are already active. First real submissions should start appearing within days.