Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages

How do you build a single speech recognition system that can understand 1,000’s of languages including many that never had working ASR (automatic speech recognition) models before? Meta AI has released Omnilingual ASR, an open source speech recognition suite that scales to more than 1,600 languages and can be extended to unseen languages with only a few speech text examples, without retraining the model.

Data and language coverage

The supervised training data comes from a combined corpus called AllASR. AllASR contains 120,710 hours of labeled speech paired with transcripts across 1,690 languages. This corpus merges several sources, including open source datasets, internal and licensed corpora, partner created data, and a commissioned collection called the Omnilingual ASR Corpus.

The Omnilingual ASR Corpus contributes 3,350 hours of speech for 348 languages, with data collected through field work with local organizations and speakers in regions such as Africa and South Asia. Prompts are open ended, so speakers produce natural monologues in their own language instead of reading fixed sentences, which gives more realistic acoustic and lexical variation.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

For self supervised pre training, the wav2vec 2.0 encoders are trained on a large unlabeled speech corpus. The pre training dataset contains 3.84M hours of speech with language identification across 1,239 languages, plus another 460K hours without language identification. The total unlabeled audio used for pre training is therefore about 4.3M hours. This is still significantly smaller than the 12M hours used by USM, which makes the reported results more interesting from a data efficiency perspective.

Model family

Omnilingual ASR exposes 3 main model families that all share the same wav2vec 2.0 speech encoder backbone:

SSL encoders (OmniASR W2V)
Self supervised wav2vec 2.0 encoders with the following parameter counts
• omniASR_W2V_300M with 317,390,592 parameters
• omniASR_W2V_1B with 965,514,752 parameters
• omniASR_W2V_3B with 3,064,124,672 parameters
• omniASR_W2V_7B with 6,488,487,168 parameters. These models are trained with the standard wav2vec 2.0 contrastive objective. After training, the quantizer is discarded and the encoder is used as a speech representation backbone.
CTC (connectionist temporal classification) ASR models
CTC models add a simple linear layer on top of the encoder and train end to end with a character level CTC loss. The released CTC models range from 325,494,996 parameters to 6,504,786,132 parameters and reach real time factors as low as 0.001 for the 300M model on A100 for 30 second audio with batch size 1.
LLM ASR models
LLM ASR stacks a Transformer decoder on top of the wav2vec 2.0 encoder. The decoder is a language model like Transformer that operates on character level tokens plus special tokens such as <BOS> and <EOS>. Training uses standard next token prediction on sequences of the form gs(x), gt(<BOS>), gt(y), gt(<EOS>) where gs is the speech encoder and gt is the text embedding matrix. The LLM ASR family ranges from about 1.63B parameters for omniASR_LLM_300M to 7,801,041,536 parameters for omniASR_LLM_7B. A separate omniASR_LLM_7B_ZS checkpoint with 7,810,900,608 parameters is used for zero shot ASR.

All LLM ASR models support optional language conditioning. Languages are represented as {language_code}_{script} such as eng_Latn for English in Latin script or cmn_Hans for Mandarin Chinese in Simplified Chinese script. A learned embedding for the language script identifier is injected into the decoder input. In training, the language ID token is sometimes dropped, so the model can also operate without explicit language tags at inference.

Zero shot ASR with context examples and SONAR

The supervised models cover more than 1,600 languages. However, many languages still have no transcribed ASR data. To handle these cases, Omnilingual ASR extends the LLM ASR model with a zero shot mode trained with context examples.

During training for the zero shot variant, the decoder consumes N + 1 speech text pairs from the same language. The first N pairs act as context and the final pair is the target. All pairs are embedded with the speech encoder and text embedding matrix, then concatenated into a single decoder input sequence. The loss is still next token prediction on the target transcription. This teaches the decoder to infer the mapping from speech to text in a given language from a small prompt of in language examples.

At inference, the omniASR_LLM_7B_ZS model can receive a few speech text examples from any language, including languages not present in training, and then transcribe new utterances in that language without updating weights. This is in context learning for ASR.

The system includes an example retrieval mechanism based on SONAR, a multilingual multimodal encoder that projects audio and text into a shared embedding space. The target audio is embedded once, then nearest neighbor search over a database of speech text pairs selects the most relevant examples to include in the context window. This SONAR based selection improves zero shot performance compared with random example selection or simple text similarity.

Quality and benchmarks

The omniASR_LLM_7B model achieves character error rate below 10 percent for 78 percent of the more than 1,600 supported languages.

The research team reports that on multilingual benchmarks such as FLEURS 102, the 7B LLM ASR model outperforms the 7B CTC models and also surpasses Google USM variants in average character error rate, despite using about 4.3M unlabeled hours instead of 12M and a simpler pre training pipeline. This suggests that scaling the wav2vec 2.0 encoder and adding an LLM style decoder is an effective path for high coverage multilingual ASR.

Key Takeaways

Omnilingual ASR provides open source ASR coverage for more than 1,600 languages and can generalize to more than 5,400 languages using zero shot in context learning.
The models are built on large scale wav2vec 2.0 encoders trained on about 4.3M hours of unlabeled audio from 1,239 labeled languages plus additional unlabeled speech.
The suite includes wav2vec 2.0 encoders, CTC ASR, LLM ASR, and a dedicated zero shot LLM ASR model, with encoder sizes from 300M to 7B parameters and LLM ASR up to about 7.8B parameters.
The 7B LLM ASR model achieves character error rate below 10 percent on 78 percent of the more than 1,600 supported languages, which is competitive with or better than prior multilingual systems in low resource settings.

Omnilingual ASR is a significant systems level contribution because it treats multilingual ASR as an extensible framework, not a fixed language list, combining a 7B wav2vec 2.0 encoder, CTC and LLM ASR decoders, and a zero shot LLM ASR model that can adapt to new languages with a few in context examples, while achieving character error rate below 10 percent on 78 percent of more than 1,600 supported languages and releasing everything under Apache 2.0 and CC BY 4.0. Overall, this launch establishes Omnilingual ASR as the most extensible open source speech recognition model currently available.

Check out the Paper, Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source_link