Machine Learning

Google's New Approach to AI Efficiency Focuses on What to Keep, Not What to Build

Sequential Attention treats model optimization as a mathematical subset selection problem, with state-of-the-art results on feature selection and block pruning.

Oliver Senti
Oliver SentiSenior AI Editor
February 5, 20265 min read
Share:
Visualization of neural network pruning with selected components highlighted in orange and pruned elements fading to grey

Google researchers have published a method called Sequential Attention that makes neural networks smaller and faster by systematically identifying which components actually matter. The blog post dropped February 4th, summarizing work from two papers that span several years of development on what the team frames as a subset selection problem.

The basic insight is almost obvious in retrospect. Modern neural networks use far more parameters than they need. The hard part is figuring out which ones to cut without breaking the model.

The Underlying Math Problem

Feature selection, the task of choosing which input variables a model should pay attention to, is NP-hard. That's computer science shorthand for "there's no efficient algorithm that guarantees the optimal answer." As models get larger, exhaustively checking every possible subset becomes computationally impossible.

Previous approaches to this problem fall into two camps. Filter methods score features individually, which is fast but misses interactions. A feature might look useless alone but become essential when combined with others. Wrapper methods evaluate features in context but require retraining the model for every candidate subset, which doesn't scale.

Sequential Attention sidesteps this by making selection decisions during training rather than before or after. The algorithm maintains a running set of selected components and uses attention weights to identify the next most valuable addition. It's greedy, meaning it makes locally optimal choices at each step rather than searching for a global optimum. But greedy algorithms have known mathematical guarantees for certain problem classes, and the researchers show this adaptation inherits those properties.

How It Works in Practice

The original paper from 2022 established the core technique for feature selection. At each training step, the algorithm calculates attention scores for all remaining unselected features. The highest-scoring feature gets permanently added to the subset. Then it recalculates, now considering what that addition means for the marginal value of everything else.

The clever part is doing this within a single training run. Traditional greedy selection would require training a new model for every feature you consider adding, which multiplies computational cost by orders of magnitude. By integrating selection into the training loop, Sequential Attention achieves comparable results with minimal overhead.

For linear regression, the researchers prove their approach is mathematically equivalent to Orthogonal Matching Pursuit, a classical algorithm with decades of theoretical backing. That equivalence doesn't hold for deep networks, but it provides some justification for why the approach should work.

Block Sparsification Results

The second paper, SequentialAttention++ from early 2024, extends the technique to structured pruning. Instead of selecting individual features, it identifies entire blocks of weights that can be removed while preserving model accuracy.

Block sparsity matters because it plays nice with hardware. Removing scattered individual weights doesn't necessarily speed up inference, since GPUs and TPUs prefer dense matrix operations. But zeroing out entire blocks translates directly to computational savings.

The team tested on ImageNet classification with ResNet50, which remains a standard benchmark for pruning research despite being somewhat dated. They also evaluated on Criteo, a click-through rate prediction dataset with massive feature counts where sparse models have obvious commercial value.

I couldn't find specific accuracy numbers in the blog post, which mostly gestures at "state of the art" without quantification. The papers presumably have the detailed benchmarks, but the public-facing summary is light on specifics. That's not unusual for a blog post, though it makes independent evaluation difficult.

What's Actually Novel Here

The contribution isn't any single technique but their combination. Differentiable pruning methods (using learned importance scores) and combinatorial optimization (greedy search over discrete choices) had developed separately. Sequential Attention bridges them by using attention weights as the importance signal that guides combinatorial selection.

The theoretical contribution in the second paper is showing that many existing differentiable pruning techniques can be understood as nonconvex regularization. The researchers prove that for a certain class of regularizers, the global optimum is unique and group-sparse, meaning it naturally identifies structures rather than isolated parameters.

Whether this matters for practitioners depends on your use case. If you're compressing models for edge deployment, structured sparsity is valuable. If you're doing feature engineering for recommender systems with thousands of input signals, the feature selection results are more relevant.

The Broader Context

Google has been pushing on efficient ML for years now. Their 2022 research roundup mentioned Sequential Attention alongside other techniques like Treeformer (which uses decision trees to identify relevant attention keys) and work on sparse MLP layers. The throughline is finding ways to maintain accuracy while reducing computation.

The timing is interesting. As transformer models have exploded in size, the gap between what's possible in research labs and what's practical to deploy keeps widening. Techniques that were academic curiosities five years ago now have real commercial pressure behind them.

The blog post mentions future directions including LLM pruning and automated feature engineering for recommender systems. The LLM application seems like the more ambitious goal. Pruning attention heads and transformer blocks without degrading performance is an active research area, and it's not clear whether Sequential Attention's assumptions hold for models trained on next-token prediction.

The recommender system work sounds more immediately practical. Large embedding models with heterogeneous features are exactly the setting where subset selection provides value, and Google has obvious internal use cases.

Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Google's New Approach to AI Efficiency Focuses on What to Keep, Not What to Build | aiHola