Researchers from Penn State and multiple institutions have published what they call the first comprehensive survey examining small language models in the age of GPT-4 and Llama-3.1 405B. The 87-page paper, now in ACM Transactions on Intelligent Systems and Technology, arrives at an interesting moment: SLMs are downloaded more frequently than larger models in the Hugging Face community.
That's not a typo. The models everyone talks about at conferences aren't necessarily the ones shipping in production.
What counts as "small" anyway?
The survey's authors tackle the definitional chaos head-on. Some studies suggest SLMs have fewer than 1 billion parameters and fit within 6GB of memory on mobile devices. Others believe SLMs can extend to 10 billion parameters. The paper proposes a functional definition instead: SLMs should be defined by their capability to perform specialized tasks and suitability for resource-constrained settings, with boundaries based on the minimal size for emergent abilities and the maximum size sustainable under resource constraints.
Practitioners seem to have settled on roughly 7 billion parameters as the upper threshold. Many practitioners use 7B parameters or less as a practical benchmark for defining SLMs. Models like Microsoft's Phi-3 (3.8B parameters), Google's Gemma, Meta's Llama 3.2 (1B and 3B variants), and Alibaba's Qwen2 family cluster in this range.
The performance gap isn't as dramatic as you might expect. Phi 3 mini, the 3.8 billion parameter model, has overtaken Gemma 7B from Google with a score of 68.8 in MMLU and 76.7 in HellaSwag, exceeding both Gemma's scores and even the Mistral 7B model.
The real problem with massive models
The survey documents five specific limitations driving the SLM resurgence. Privacy concerns top the list: cloud APIs mean your data leaves your infrastructure. Latency matters too, particularly for real-time applications. Llama 2 7B takes approximately 84 seconds to process 100 tokens on benchmarks when run on a smartphone equipped with a Snapdragon 685 processor.
Fine-tuning costs compound the issue. Though general-purpose LLMs are powerful, many real-world applications require only specific abilities and domain knowledge. Deploying general-purpose LLMs would be a waste of resources.
Healthcare and legal applications get specific attention. LLMs often underperform in specialized domains such as healthcare and law due to insufficient domain-specific knowledge, necessitating specialized models. The survey notes that domain-specific SLMs typically emerge through fine-tuning on supervised domain data generated by larger models or through continual training on domain corpora.
On-device AI is already here
Meta's Llama 3.2 models (1B and 3B) demonstrate the practical reality. Local processing enables immediate execution of prompts and responses. This approach protects privacy by keeping sensitive data such as patient health information, business data, personal messages, and calendar details local.
The edge deployment story is more nuanced than the marketing suggests. Recent phones with 12-16GB+ RAM can run models in the 1-3B range and even 7B models with Q5 or Q4 quantization. For devices with 6GB RAM, Q5-quantized Gemma and smaller models work. 4GB is challenging, but 1.1B TinyLLAMA can still run.
Quantization techniques prove essential. Quantized versions of popular models like Mistral 7B can achieve memory reductions from over 10GB to approximately 1.5GB, significantly lowering latency and enabling practical use on devices such as the Jetson Nano.
The collaboration frameworks nobody talks about
The survey's most technically interesting sections cover how SLMs and LLMs work together. Several patterns emerge:
Speculative decoding uses an SLM to generate draft tokens that an LLM then verifies in parallel. Distributed speculative decoding accelerates collaborative inference between SLM and LLM by executing the SLM locally and offloading verification to an edge server. The approach preserves output fidelity while reducing overall latency.
Cascade routing starts with an SLM and escalates to an LLM only when necessary. Selective knowledge distillation replaces repeated LLM usage with distilled SLMs and increases information per token via selective signals and sampling.
Proxy tuning employs SLMs to produce reusable updates or control signals for efficient LLM adaptation. The tuning stage targets cost reduction: Fine-tuning language models is costly along two fronts. For tuning SLMs, relying on LLM distillation signals during adaptation increases FLOPs or API expenses. Directly fine-tuning the LLM is expensive in terms of FLOPs, memory, and time.
The trustworthiness applications caught my attention. SLMs offer lightweight, adaptable external safety control. Recent collaborations adopt two forms: safety-guided decoding, where SLMs adjust LLM logits during generation, and guardian-generator, where SLMs filter inputs or outputs.
The trustworthiness problem
The survey devotes substantial attention to SLM reliability issues, particularly hallucination and privacy concerns. Despite the commendable performance of SLMs, it is crucial not to overlook their credibility issues, such as the risks of producing hallucinations and privacy breaches.
Hallucination rates in medical contexts remain concerning. One study found a 1.47% hallucination rate and a 3.45% omission rate in clinical note generation, which sounds low until you consider the consequences. The MedHallu benchmark evaluation revealed that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with binary hallucination detection, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations.
The multilingual trustworthiness picture is worse. Proprietary models lead overall, with low hallucinations, high honesty, neutrality, and jailbreak resistance, though privacy remains a weakness. Large open-weight models show strong factuality and robustness but mixed safety and privacy. Small open-weight models underperform and can be brittle across languages.
Techniques that actually work
The survey catalogs enhancement methods across several categories. Training from scratch benefits from specific architectural choices: commonly used techniques for acquiring general-purpose SLMs include GQA, Gated FFN, SiLU activation functions, RMS regularization, deep and thin model architectures, and optimization of embeddings.
Knowledge distillation remains the primary transfer method. Pruning removes less critical parameters, reducing model size while aiming to maintain performance. Knowledge distillation transfers knowledge from a large teacher model to a smaller student model. Quantization decreases parameter precision, significantly lowering memory and computation needs with minimal impact on accuracy.
Selective distillation shows promise. Recent work on token-weighted approaches reframes distillation from loss engineering to selective supervision, focusing teacher guidance where it matters most rather than applying uniform supervision across all tokens.
What comes next
The survey identifies several research gaps. Current cloud-edge SLM-LLM collaborations face challenges in achieving secure peer-to-peer exchange, effective cold-start learning, formal privacy guarantees, and scalable personalization under strict data protection.
The benchmarking situation needs work. Current evaluations emphasize open-ended performance but neglect structured and system-level metrics. Future benchmarks should assess structured reasoning, modular cooperation, and efficiency indicators such as throughput, latency, and cost-performance.
Gartner's forecast suggests organizations will use small, task-specific models three times more than general-purpose LLMs by 2027. Whether that projection holds, the Hugging Face download data already tells a story: SLMs are gaining increasing attention as alternatives to LLMs.
The 87-page survey is available through ACM Transactions on Intelligent Systems and Technology. The authors maintain an updated GitHub repository tracking new SLM developments.




