BuildAI - Advancing AI Through Research

Introduction

Text-to-Speech (TTS) technology has significantly evolved, enabling applications in accessibility, content creation, and human-computer interaction. While English and other widely spoken languages have seen major advancements in TTS, Indic languages still face challenges due to linguistic diversity, complex phonetics, and the availability of high-quality datasets.

Indic TTS models aim to bridge this gap by providing natural and intelligible speech synthesis for languages like Hindi, Marathi, Punjabi, and Gujarati. This blog presents an in-depth evaluation of various Indic TTS models based on multiple qualitative and quantitative metrics, ensuring a comprehensive comparison of their capabilities.

Why Evaluate TTS for Indic Languages?

Unlike English, Indic languages exhibit complex phonetic structures, tonal variations, and script-based peculiarities, making TTS evaluation crucial. Some key reasons for evaluating Indic TTS models include:

Linguistic Diversity: India has over 22 official languages, each with distinct pronunciation rules and dialectal variations.
Speech Naturalness: A well-designed TTS model should generate speech that sounds human-like and engaging.
Pronunciation Accuracy: Proper handling of consonant clusters, nasal sounds, and aspirated phonemes is essential.
Computational Efficiency: Evaluating the resource requirements of models helps in selecting suitable ones for real-time applications.
Real-World Applicability: TTS models should be tested for their effectiveness in accessibility solutions, education, and content generation.

Evaluation Metrics for TTS Models

To ensure a fair comparison, both subjective and objective evaluation methods were used. The key metrics include:

Subjective Metrics

Mean Opinion Score (MOS): A widely used rating scale (1-5) where human evaluators assess speech quality based on naturalness and intelligibility.
A/B Testing: Participants compare synthesized speech samples from different models and choose the more natural-sounding one.
User Feedback: Insights gathered from native speakers regarding pronunciation, tone, and fluency.

Objective Metrics

Word Error Rate (WER%): Measures how accurately the synthesized speech can be transcribed back into text. A lower WER indicates better pronunciation and clarity.
Perceptual Evaluation of Speech Quality (Pesq Score): Assesses the similarity between synthesized speech and human speech using acoustic models.
Computational Efficiency: Evaluates the processing power required to generate speech, impacting real-time usability.

Tabulated Results: Indic TTS Model Comparison

Model (Language)	MOS Score	WER%	Pesq Score
ai4bharat/indic-parler-tts (All)	4.5	5.2	89
facebook/mms-tts-hin (Hindi)	4.2	7.1	85
facebook/mms-tts-guj (Gujarati)	4.1	7.5	84
ai4bharat/vits_rasa_13 (Hindi)	4.0	8.5	82
ai4bharat/vits_rasa_13 (Marathi)	3.9	8.7	80
ai4bharat/vits_rasa_13 (Punjabi)	3.8	9.0	78

Output Audio Samples

Below are audio samples generated by different Indic TTS models. Listening to these outputs allows for a practical assessment of speech quality, pronunciation accuracy, and naturalness.

ai4bharat/indic-parler-tts:

ai4bharat/vits_rasa_13 (Hindi):

ai4bharat/vits_rasa_13 (Marathi):

ai4bharat/vits_rasa_13 (Punjabi):

Graphical Visualization

The evaluation results are best understood through visualization. Below is a bar chart representing the MOS Score, WER%, and Pesq Score for each Indic TTS model.

Comparison of Indic TTS Models Radar Plot

Key Insights

The ai4bharat/indic-parler-tts model performed the best overall, achieving the highest MOS score (4.5) and the lowest WER (5.2%).
The facebook/mms-tts-hin model performed well for Hindi but showed slightly higher WER% than the ai4bharat models.
The ai4bharat/vits_rasa_13 models showed consistent performance across Hindi, Marathi, and Punjabi but had relatively higher WER%, indicating challenges in pronunciation accuracy.

Conclusion

Evaluating TTS models for Indic languages provides crucial insights into their strengths and limitations. While ai4bharat/indic-parler-tts stands out in overall performance, facebook/mms-tts models offer competitive quality for Hindi and Gujarati. However, models still struggle with complex phonetic variations, leading to higher WER% in some cases.

Future improvements in Indic TTS could focus on:

Better prosody modeling to capture regional accents and intonations.
Data augmentation using diverse speakers to improve robustness.
Optimization for low-resource devices to enhance real-time usability.

As the demand for high-quality Indic TTS grows, continuous benchmarking and refinements will be key to ensuring seamless, natural, and intelligible speech synthesis.

Stay tuned for more updates on AI-driven language technology and model evaluations!

References

Open-Source Contribution

For detailed information, implementation guides, and benchmarking scripts, visit our public GitHub repository:

GitHub Repository: Text-to-Speech on GitHub

This repository includes:

Evaluation scripts for different models.
Pre-trained models for quick testing.
Guidelines for usage.
Steps to contribute and improve the evaluation framework.

We encourage developers and researchers to fork the repository, experiment with different architectures, and contribute improvements by submitting pull requests.

Evaluating Text-to-Speech (TTS) Models for Indic Languages