BuildAI - Advancing AI Through Research

Introduction

Audio classification is a crucial field in machine learning that enables systems to recognize and categorize different types of audio signals. It has widespread applications, including speech recognition, emotion detection, language identification, and sound event detection. In this blog, we evaluate six different audio classification models to understand their capabilities and performance in various domains [1].

Why is Audio Classification Important?

Audio classification plays a significant role in several real-world applications:

Gender Classification: Used in voice assistants, customer support, and accessibility tools [2].
Emotion Classification: Helps in mental health monitoring, human-computer interaction, and customer sentiment analysis [3].
Language Detection: Essential for multilingual applications, automated transcription services, and content moderation [4].
Sound Detection: Used in security surveillance, wildlife monitoring, and smart home automation [5].

Evaluated Models

We have selected six models for evaluation, each excelling in a specific aspect of audio classification:

Model	Feature	Architecture	Use Cases
Audeering/wav2vec2	Gender Classification	Wav2Vec2	Voice Assistants, Customer Support [1]
FunASR/emotion2vec	Emotion Recognition	Custom Deep Learning	Sentiment Analysis, Human-Computer Interaction [3]
Superb/wav2vec2	Emotion Recognition	Wav2Vec2	Sentiment Analysis, Human-Computer Interaction [2]
MIT/ast-finetuned-audioset	Sound Detection	Audio Spectrogram Transformer (AST)	Smart Home Automation [4]
Facebook-MMS	Language Identification	Transformer	Voice Assistants, Automatic Transcription [5]
OpenAI-Whisper	Language Identification	Transformer	Voice Assistants, Automatic Transcription [6]

Comparative Analysis

Model	Processing Time (s)	Accuracy	Model Size (MB)	License
Audeering/wav2vec2	79.3144	High	120	Apache 2.0
FunASR/emotion2vec	20.0636	Medium	85	MIT
Superb/wav2vec2	5.0744	High	98	Apache 2.0
MIT/ast-finetuned-audioset	13.2893	Medium	150	Apache 2.0
Facebook-MMS	119.6986	High	200	CC-BY-SA
OpenAI-Whisper	102.6909	Very High	155	OpenAI

Check Out Our Results

Audeering/wav2vec2

Input Audio

Output - Predicted Gender: Male

FunASR/emotion2vec and Superb/wav2vec2

Input Audio

Output - FunASR - angry
Output - Superb - ang

MIT/ast-finetuned-audioset

Input Audio

Output - Animal

Facebook-MMS and OpenAI-Whisper

Input Audio

Output - Facebook/MMS - hin
Output - OpenAI/Whisper - hi

Conclusion

Audio classification models have diverse applications and are evolving rapidly with advancements in deep learning. By understanding the differences in architecture, training data, and use cases, developers can choose the most suitable model for their specific application [7]. Future improvements in multimodal learning and larger datasets will continue to push the boundaries of audio classification performance.

References

Audeering/wav2vec2 - https://huggingface.co/audeering/wav2vec2-large-robust-24-ft-age-gender
Superb/wav2vec2 - https://huggingface.co/superb/wav2vec2-base-superb-er
FunASR/emotion2vec - https://huggingface.co/emotion2vec/emotion2vec_plus_large
MIT/ast-finetuned-audioset - https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Facebook-MMS - https://huggingface.co/facebook/mms-lid-256
OpenAI-Whisper - https://huggingface.co/openai/whisper-small
Google AudioSet - https://research.google.com/audioset/

Open-Source Contribution

For detailed information, implementation guides, and benchmarking scripts, visit our public GitHub repository:

GitHub Repository: Evaluation of Audio Classification Models on GitHub

Our public repository allows contributions from the community. Feel free to:

Fork the repository
Submit pull requests for improvements or bug fixes
Report issues and suggest enhancements

Evaluation of Audio Classification Models