talkie-1930-13b: 13B Language Model With 1930 Cutoff

A team of researchers has released talkie-1930-13b, a 13-billion-parameter language model trained on 260 billion tokens of English-language text published before December 31, 1930. The project, led by Nick Levine, David Duvenaud, and Alec Radford, went public this month with model weights on Hugging Face and code on GitHub repo.

A 1930 knowledge cutoff

Books, newspapers, scientific journals, patents, case law: everything in the corpus predates the US public-domain cliff at the end of 1930. The authors chose that date because it lined up with copyright lapsing into the public domain, not for any deeper reason.

The point of building an LM that doesn't know about Roosevelt isn't nostalgia. According to the project post, vintage models are interesting because they're contamination-free by construction. Modern LLMs trained on web scrapes have likely seen most benchmark questions and a lot of post-training answer keys floating around in forums and Wikipedia edits. A 1930-cutoff model can't have memorized the answer to a question about the moon landing, which makes it useful for probing whether models can reason toward conclusions or just retrieve them.

Roosevelt slipped through anyway

The trickier part is keeping the corpus actually clean. An earlier 7B prototype could name New Deal legislation by year, including bills passed after 1933. The 13B version is also "additionally aware of some details related to World War II and the immediate postwar order," the authors write, citing the United Nations and the division of Germany as examples. So: 1930 newspaper, with a 1945 footnote glued to the back.

The team built a document-level n-gram anachronism classifier to filter the training data. It's imperfect and they say so. Editorial introductions, footnotes added decades later, and modern documents with broken date metadata all leak through. New classifiers are in development for the next version.

OCR is the bigger drag

Buried further down, more interesting: training on text transcribed by conventional OCR delivers only 30% of the learning efficiency of training on human-transcribed text. Regex cleaning of the OCR output gets that up to 70%. Which still leaves a sizeable gap, and a sizeable chunk of compute being eaten by garbled inputs.

"Modern VLM-based systems have higher accuracy, but we have found they are prone to hallucinate modern facts into our corpus, poisoning the exercise," the authors write. So no easy fix from off-the-shelf modern OCR. They're now building a vintage-specific OCR system to retranscribe the corpus.

How does it actually score?

Against a modern twin (same architecture, same FLOPs, but trained on FineWeb), talkie underperforms on standard benchmarks. Filtering out questions about post-1930 events roughly halves the gap but doesn't close it. The authors attribute the rest to OCR noise and the corpus simply not covering modern subject matter. Language understanding and numeracy hold up reasonably well.

Coding is rougher. Given HumanEval problems with in-context examples, talkie's correct answers are essentially one-liners or single-character edits to demonstration code. The authors highlight a case where the model implemented the decoding function for a rotation cipher by swapping an addition for a subtraction in the encoding example. Encouraging that it works at all, modest in absolute terms.

What's next

The team plans a GPT-3-level vintage model by summer 2026, with a GPT-3.5 / original ChatGPT-level model possible if they can scale the historical corpus past one trillion tokens. The project is backed by funding and compute from Coefficient Giving and Anthropic. The post-trained chat checkpoint is available now.