Back

Evaluating Multilingual Language Models: A Comprehensive Approach

Tanisha

Tanisha

Evaluating Multilingual Language Models: A Comprehensive Approach

Introduction

With the increasing demand for AI-driven solutions across diverse linguistic landscapes, evaluating multilingual models is crucial [1]. This blog explores a structured approach for assessing multilingual text generation models across various Indian languages. Our evaluation framework leverages Hugging Face models and standardized techniques, ensuring a robust assessment of accuracy, coherence, and language fidelity [3].

Why Evaluate Multilingual Language Models?

Multilingual models promise versatility, but their effectiveness varies across languages [2]. Evaluating these models is necessary to:

  • Ensure fluency and correctness in multiple languages.
  • Detect biases or inconsistencies in language generation [4].
  • Measure performance across high-resource and low-resource languages.
  • Validate usability in real-world applications such as chatbots, content creation, and customer support.
  • Identify areas where models struggle with contextual understanding, idiomatic expressions, and code-switching.

Challenges in Multilingual Model Evaluation

While evaluating multilingual models, several challenges arise:

  • Data Scarcity: Low-resource languages have fewer training datasets [5].
  • Translation Errors: Direct translations may distort meaning across languages.
  • Cultural Sensitivity: Ensuring that generated responses align with cultural norms.
  • Computational Complexity: Evaluating across multiple languages requires extensive computational resources.

Model Selection and Setup

For our evaluation, we selected a diverse set of multilingual models, including l3cube-pune/marathi-gpt-gemma-2b, trained for Indian languages. The evaluation process involves generating responses to a diverse set of multilingual prompts in Hindi, Tamil, Bengali, Telugu, Malayalam, Marathi, and English.

Setting Up the Environment

We utilize sglang for model inference and transformers for evaluation:

from huggingface_hub import login
import sglang as sgl
import json

login("Your API Key")

gen = "l3cube-pune/marathi-gpt-gemma-2b"
gen_llm = sgl.Engine(model_path=gen)
sampling_params = {"temperature": 0.5, "top_p": 0.95}

Multilingual Prompt Set

A diverse set of 20+ multilingual prompts was designed to test general knowledge, cultural awareness, and linguistic ability [5]. These prompts include:

LanguageSample Prompt
Hindi“नमस्ते! अपने पसंदीदा त्योहार के बारे में बताइए।”
Tamil“இந்திய உணவின் சிறப்புகளை விளக்கவும்.”
Bengali“বাংলায় আপনার দেশের ঐতিহ্য ও সংস্কৃতি নিয়ে আলোচনা করুন।”
Telugu“భారతీయ సాంప్రదాయ వంటకాల గురించి వివరించండి.”
Malayalam“കേരളത്തിലെ സംസ്‌കാരിക ചടങ്ങുകളെപ്പറ്റി പറയൂ.”
Marathi“भारतीय सणांबद्दल मराठीत लिहा आणि त्यांचा सांस्कृतिक महत्त्व स्पष्ट करा.”
English“Explain the significance of the scientific method.”
Mixed“Explain how a computer works in layman’s terms, then explain the same concept in Hindi.”

Response Generation and Evaluation

Using sglang

We generate responses using sglang:

responses = []
for question in questions:
    response = gen_llm.generate(question, sampling_params, stream=False)
    answer = response.get('text', '').strip()
    responses.append({"instruction": question, "answer": answer})

Alternative: Using Transformers Library

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "l3cube-pune/marathi-gpt-gemma-2b"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

responses = []
for question in questions:
    inputs = {k: v.to(device) for k, v in tokenizer(question, return_tensors="pt").items()}
    output = model.generate(**inputs, temperature=0.8, top_p=0.95, max_length=300, do_sample=True)
    response_text = tokenizer.decode(output[0], skip_special_tokens=True)
    responses.append({"instruction": question, "answer": response_text.strip()})

Evaluation Criteria

To assess the model’s multilingual performance, we focus on:

  • Fluency & Grammar: Is the generated text grammatically correct?
  • Relevance: Does the response match the prompt’s intent?
  • Consistency: Are similar questions answered in a uniform manner across languages?
  • Bias Detection: Are there any biases or incorrect stereotypes present in the responses?
  • Context Awareness: Does the model adapt responses based on cultural context?

Key Findings & Insights

LanguagePerformance
Marathi, Hindi, EnglishProduced coherent and contextually relevant responses.
Tamil, Telugu, BengaliGood performance, but occasional grammatical inconsistencies.
MalayalamMore syntactic errors and lack of fluency, indicating a need for better training data.
Mixed-languageOften led to partial or incorrect responses, highlighting limitations in cross-language contextual understanding.

Conclusion:

  • Expanding datasets to include more low-resource languages.
  • Enhancing cross-lingual training techniques to improve context retention.
  • Exploring better evaluation metrics for multilingual AI.
  • Integrating human-in-the-loop assessment for improved accuracy.

References

  1. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  2. Conneau, A., & Lample, G. (2019). Cross-lingual Language Model Pretraining. arXiv preprint arXiv:1901.07291.
  3. Xue, L., Constant, N., Roberts, A., et al. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv preprint arXiv:2010.11934.
  4. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
  5. Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity in NLP. arXiv preprint arXiv:2004.09095.
  6. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT.

Open Source Contribution Guidelines

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a new branch.
  3. Implement changes and add relevant tests.
  4. Submit a pull request.

Visit our GitHub repository for more details.

What are your thoughts on multilingual model evaluation? Share your insights and experiences in the comments!

Share this post