Back
Evaluating Small Language Models: A Deep Dive

Tanisha

Introduction
Small Language Models (SLMs) are gaining prominence in natural language processing due to their efficient architecture and versatile applications. These models support various use cases, from chatbots to specialized coding assistants and domain-specific reasoning engines. This analysis evaluates 21 open-source SLMs, providing empirical data for implementation guidance.
Understanding Small Language Models
SLMs offer a lightweight alternative to traditional large language models (LLMs), excelling in resource-constrained environments such as edge computing and on-device AI applications. They prioritize efficiency while maintaining robust performance, benefiting developers, researchers, and commercial enterprises alike.
Methodology and Evaluation Framework
Our evaluation covers seven key performance metrics:
- Average Score: Aggregates overall performance across multiple domains.
- Creative Writing: Measures natural language generation and coherence.
- Science & Technology: Evaluates technical accuracy and scientific reasoning.
- Social Sciences: Analyzes contextual understanding in historical and philosophical domains.
- Business & Economics: Assesses financial literacy and market analysis capabilities.
- Health & Well-being: Measures knowledge in the medical and wellness fields.
- Environment & Climate: Evaluates understanding of environmental science.
The evaluation methodology for the 21 open-source Small Language Models (SLMs) involves a comprehensive assessment across seven key metrics designed to gauge their efficiency and effectiveness in various applications. This approach incorporates both quantitative and qualitative measures to ensure a robust evaluation of model performance.
We selected a diverse array of SLMs based on various criteria, including model architecture, parameter size, and data availability. This selection process is critical as it encompasses models from both industry and academia, facilitating a thorough comparative analysis. All chosen SLMs share similar architectures; however, they differ in specific hyperparameters and training datasets, which leads to varied performance across tasks[1].
The evaluation process is conducted in several stages to ensure thoroughness:
Data Preparation: A synthetic dataset is generated using a text corpus with the assistance of an LLM. This dataset is then utilized for evaluating the SLMs across the defined metrics.
Automated and Human-Like Evaluation: Both automated metrics and qualitative assessments by human evaluators are employed to provide a holistic view of each model’s strengths and weaknesses. This dual approach enhances the reliability of the evaluation outcomes[2].
Feedback Generation: The results from the automated and human-like evaluations are compiled into a comprehensive report that highlights the performance of each model and offers actionable feedback for improvement. By employing this structured methodology, we aim to facilitate informed decision-making and foster trust in the capabilities of small language models in various applications.
Comparative Analysis Table
Model Name | Parameters | Open Source? | Features |
---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | ✅ Yes | Optimized for general-purpose tasks. |
Falcon3 | 3B | ✅ Yes | Strong technical understanding, excels in STEM-related queries. |
Llama-3.2 | 1B | ✅ Yes | Well-balanced, performs decently across diverse domains. |
Qwen2 | 500M | ✅ Yes | Compact and efficient, optimized for structured tasks. |
Salamandra | 2B | ✅ Yes | Provides reasonable performance but lacks depth. |
SmolLM2 | 1.7B | ✅ Yes | Limited capability, lacks coherence and depth. |
Gemma-2B-IT | 2B | ✅ Yes | Performs moderately well but lacks advanced reasoning. |
microsoft/Phi-3.5 | 3.5B | ✅ Yes | Best in creativity, reasoning, and structured responses. |
genSILMA-Kashif | 2B | ✅ Yes | Strong comprehension and structured problem-solving ability. |
Graph Visualization
Key Insights
- Top Performer: microsoft/Phi-3.5-mini-instruct and genSILMA-Kashif-2B-Instruct-v1.0 consistently achieve the highest scores across all categories, demonstrating strong performance in creative writing, science & technology, social sciences, business, health, and environmental topics. These models are well-suited for general-purpose tasks and complex reasoning.
- Moderate Performers: Falcon3-3B-Instruct, Llama-3.2-1B-Instruct, and Salamandra-2B-Instruct perform reasonably well, showing balanced capabilities across different domains. Falcon3-3B-Instruct particularly excels in Science & Technology, while Llama-3.2 maintains consistent performance across all categories.
- Limited Utility: DeepSeek-R1-Distill-Qwen-1.5B, Qwen2-0.5B, and Gemma-2B-IT exhibit lower performance compared to other models, though they still achieve moderate scores in some categories. SmolLM2-1.7B-Instruct performs significantly worse than the others, struggling across all domains.
Implementation Recommendations
- For AI-Centric Applications: Phi-3.5-mini-instruct offers the best overall performance.
- For Balanced Deployments: genSILMA-Kashif-2B-Instruct provides an ideal mix of efficiency and capability.
- For Technical Use Cases: Phi-3.5-mini-instruct is the preferred choice for scientific computing.
- For Lightweight Deployments: Falcon3-1B and Llama-3.2-1B are viable options.
- Not Recommended: SmolLM2-1.7B-Instruct underperforms significantly.
Conclusion
SLMs present an efficient approach to natural language processing, with Phi-3.5-mini-instruct emerging as the top choice. Future research should explore optimization techniques such as fine-tuning and improving inference speed.
References
- OpenChat Documentation
- MicroLlama Technical Specifications
- NVIDIA CUDA Toolkit Documentation
- PyTorch Implementation Guide
- SGLang Technical Documentation
- FlashInfer Research Paper
- Mistral AI Documentation
- Microsoft Phi3.5 Technical Overview
- DeepSeek LLM Documentation
Contributing to the Project
This research is open-source and welcomes contributions. Visit our GitHub repository to engage in development and evaluation efforts.
How to Contribute:
- Fork the repository.
- Create a feature branch.
- Implement your changes.
- Submit a pull request with documentation.
- Engage in the review process.
For detailed contribution guidelines, refer to the repository’s CONTRIBUTING.md file.