Google researchers have published a method called Sequential Attention that makes neural networks smaller and faster by systematically identifying which components actually matter. The blog post dropped February 4th, summarizing work from two papers that span several years of development on what the team frames as a subset selection problem.
The basic insight is almost obvious in retrospect. Modern neural networks use far more parameters than they need. The hard part is figuring out which ones to cut without breaking the model.
The Underlying Math Problem
Feature selection, the task of choosing which input variables a model should pay attention to, is NP-hard. That's computer science shorthand for "there's no efficient algorithm that guarantees the optimal answer." As models get larger, exhaustively checking every possible subset becomes computationally impossible.
Previous approaches to this problem fall into two camps. Filter methods score features individually, which is fast but misses interactions. A feature might look useless alone but become essential when combined with others. Wrapper methods evaluate features in context but require retraining the model for every candidate subset, which doesn't scale.
Sequential Attention sidesteps this by making selection decisions during training rather than before or after. The algorithm maintains a running set of selected components and uses attention weights to identify the next most valuable addition. It's greedy, meaning it makes locally optimal choices at each step rather than searching for a global optimum. But greedy algorithms have known mathematical guarantees for certain problem classes, and the researchers show this adaptation inherits those properties.
How It Works in Practice
The original paper from 2022 established the core technique for feature selection. At each training step, the algorithm calculates attention scores for all remaining unselected features. The highest-scoring feature gets permanently added to the subset. Then it recalculates, now considering what that addition means for the marginal value of everything else.
The clever part is doing this within a single training run. Traditional greedy selection would require training a new model for every feature you consider adding, which multiplies computational cost by orders of magnitude. By integrating selection into the training loop, Sequential Attention achieves comparable results with minimal overhead.
For linear regression, the researchers prove their approach is mathematically equivalent to Orthogonal Matching Pursuit, a classical algorithm with decades of theoretical backing. That equivalence doesn't hold for deep networks, but it provides some justification for why the approach should work.
Block Sparsification Results
The second paper, SequentialAttention++ from early 2024, extends the technique to structured pruning. Instead of selecting individual features, it identifies entire blocks of weights that can be removed while preserving model accuracy.
Block sparsity matters because it plays nice with hardware. Removing scattered individual weights doesn't necessarily speed up inference, since GPUs and TPUs prefer dense matrix operations. But zeroing out entire blocks translates directly to computational savings.
The team tested on ImageNet classification with ResNet50, which remains a standard benchmark for pruning research despite being somewhat dated. They also evaluated on Criteo, a click-through rate prediction dataset with massive feature counts where sparse models have obvious commercial value.
I couldn't find specific accuracy numbers in the blog post, which mostly gestures at "state of the art" without quantification. The papers presumably have the detailed benchmarks, but the public-facing summary is light on specifics. That's not unusual for a blog post, though it makes independent evaluation difficult.
What's Actually Novel Here
The contribution isn't any single technique but their combination. Differentiable pruning methods (using learned importance scores) and combinatorial optimization (greedy search over discrete choices) had developed separately. Sequential Attention bridges them by using attention weights as the importance signal that guides combinatorial selection.
The theoretical contribution in the second paper is showing that many existing differentiable pruning techniques can be understood as nonconvex regularization. The researchers prove that for a certain class of regularizers, the global optimum is unique and group-sparse, meaning it naturally identifies structures rather than isolated parameters.
Whether this matters for practitioners depends on your use case. If you're compressing models for edge deployment, structured sparsity is valuable. If you're doing feature engineering for recommender systems with thousands of input signals, the feature selection results are more relevant.
The Broader Context
Google has been pushing on efficient ML for years now. Their 2022 research roundup mentioned Sequential Attention alongside other techniques like Treeformer (which uses decision trees to identify relevant attention keys) and work on sparse MLP layers. The throughline is finding ways to maintain accuracy while reducing computation.
The timing is interesting. As transformer models have exploded in size, the gap between what's possible in research labs and what's practical to deploy keeps widening. Techniques that were academic curiosities five years ago now have real commercial pressure behind them.
The blog post mentions future directions including LLM pruning and automated feature engineering for recommender systems. The LLM application seems like the more ambitious goal. Pruning attention heads and transformer blocks without degrading performance is an active research area, and it's not clear whether Sequential Attention's assumptions hold for models trained on next-token prediction.
The recommender system work sounds more immediately practical. Large embedding models with heterogeneous features are exactly the setting where subset selection provides value, and Google has obvious internal use cases.




