Back
Evaluating Multilingual Language Models: A Comprehensive Approach

Tanisha

Introduction
With the increasing demand for AI-driven solutions across diverse linguistic landscapes, evaluating multilingual models is crucial [1]. This blog explores a structured approach for assessing multilingual text generation models across various Indian languages. Our evaluation framework leverages Hugging Face models and standardized techniques, ensuring a robust assessment of accuracy, coherence, and language fidelity [3].
Why Evaluate Multilingual Language Models?
Multilingual models promise versatility, but their effectiveness varies across languages [2]. Evaluating these models is necessary to:
- Ensure fluency and correctness in multiple languages.
- Detect biases or inconsistencies in language generation [4].
- Measure performance across high-resource and low-resource languages.
- Validate usability in real-world applications such as chatbots, content creation, and customer support.
- Identify areas where models struggle with contextual understanding, idiomatic expressions, and code-switching.
Challenges in Multilingual Model Evaluation
While evaluating multilingual models, several challenges arise:
- Data Scarcity: Low-resource languages have fewer training datasets [5].
- Translation Errors: Direct translations may distort meaning across languages.
- Cultural Sensitivity: Ensuring that generated responses align with cultural norms.
- Computational Complexity: Evaluating across multiple languages requires extensive computational resources.
Model Selection and Setup
For our evaluation, we selected a diverse set of multilingual models, including l3cube-pune/marathi-gpt-gemma-2b, trained for Indian languages. The evaluation process involves generating responses to a diverse set of multilingual prompts in Hindi, Tamil, Bengali, Telugu, Malayalam, Marathi, and English.
Setting Up the Environment
We utilize sglang for model inference and transformers for evaluation:
from huggingface_hub import login
import sglang as sgl
import json
login("Your API Key")
gen = "l3cube-pune/marathi-gpt-gemma-2b"
gen_llm = sgl.Engine(model_path=gen)
sampling_params = {"temperature": 0.5, "top_p": 0.95}
Multilingual Prompt Set
A diverse set of 20+ multilingual prompts was designed to test general knowledge, cultural awareness, and linguistic ability [5]. These prompts include:
Language | Sample Prompt |
---|---|
Hindi | “नमस्ते! अपने पसंदीदा त्योहार के बारे में बताइए।” |
Tamil | “இந்திய உணவின் சிறப்புகளை விளக்கவும்.” |
Bengali | “বাংলায় আপনার দেশের ঐতিহ্য ও সংস্কৃতি নিয়ে আলোচনা করুন।” |
Telugu | “భారతీయ సాంప్రదాయ వంటకాల గురించి వివరించండి.” |
Malayalam | “കേരളത്തിലെ സംസ്കാരിക ചടങ്ങുകളെപ്പറ്റി പറയൂ.” |
Marathi | “भारतीय सणांबद्दल मराठीत लिहा आणि त्यांचा सांस्कृतिक महत्त्व स्पष्ट करा.” |
English | “Explain the significance of the scientific method.” |
Mixed | “Explain how a computer works in layman’s terms, then explain the same concept in Hindi.” |
Response Generation and Evaluation
Using sglang
We generate responses using sglang:
responses = []
for question in questions:
response = gen_llm.generate(question, sampling_params, stream=False)
answer = response.get('text', '').strip()
responses.append({"instruction": question, "answer": answer})
Alternative: Using Transformers Library
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "l3cube-pune/marathi-gpt-gemma-2b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
responses = []
for question in questions:
inputs = {k: v.to(device) for k, v in tokenizer(question, return_tensors="pt").items()}
output = model.generate(**inputs, temperature=0.8, top_p=0.95, max_length=300, do_sample=True)
response_text = tokenizer.decode(output[0], skip_special_tokens=True)
responses.append({"instruction": question, "answer": response_text.strip()})
Evaluation Criteria
To assess the model’s multilingual performance, we focus on:
- Fluency & Grammar: Is the generated text grammatically correct?
- Relevance: Does the response match the prompt’s intent?
- Consistency: Are similar questions answered in a uniform manner across languages?
- Bias Detection: Are there any biases or incorrect stereotypes present in the responses?
- Context Awareness: Does the model adapt responses based on cultural context?
Key Findings & Insights
Language | Performance |
---|---|
Marathi, Hindi, English | Produced coherent and contextually relevant responses. |
Tamil, Telugu, Bengali | Good performance, but occasional grammatical inconsistencies. |
Malayalam | More syntactic errors and lack of fluency, indicating a need for better training data. |
Mixed-language | Often led to partial or incorrect responses, highlighting limitations in cross-language contextual understanding. |
Conclusion:
- Expanding datasets to include more low-resource languages.
- Enhancing cross-lingual training techniques to improve context retention.
- Exploring better evaluation metrics for multilingual AI.
- Integrating human-in-the-loop assessment for improved accuracy.
References
- Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
- Conneau, A., & Lample, G. (2019). Cross-lingual Language Model Pretraining. arXiv preprint arXiv:1901.07291.
- Xue, L., Constant, N., Roberts, A., et al. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv preprint arXiv:2010.11934.
- Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
- Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity in NLP. arXiv preprint arXiv:2004.09095.
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT.
Open Source Contribution Guidelines
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch.
- Implement changes and add relevant tests.
- Submit a pull request.
Visit our GitHub repository for more details.
What are your thoughts on multilingual model evaluation? Share your insights and experiences in the comments!