In the field of artificial intelligence, multilingual speech recognition and translation have become an important tool for promoting global communication. However, developing models that can accurately transcribe and translate multiple languages pose significant challenges. These challenges include managing nuances across various languages, maintaining high accuracy, ensuring low latency and efficient deployment of models between various devices.
To address these challenges, Nvidia AI has open sourced two models: the Canary 1B flash and the Canary 180m flash. These models are designed for multilingual speech recognition and translation, supporting languages such as English, German, French and Spanish. These models can be used for commercial purposes and are released under the loose CC-BY-4.0 license to encourage innovation within the AI community.
Technically, both models use an encoder architecture. The encoder is based on FastConformer, which effectively processes audio functions, while the transformer decoder processes text generation. Task-specific tokens, including
Performance metrics show that the Canary 1B Flash model has achieved inference speeds of over 1000 RTFX in the open ASR ranking dataset, thus achieving real-time processing. In the English Automatic Speech Recognition (ASR) task, it had a word error rate (WER) of 1.48%, and in the Librispeech Clean Dataset, it reached 2.87% on other datasets. For multilingual ASR, the model has 4.36% German, 2.69% Spanish, and 4.47% in the MLS test suite. In the Automatic Speech Translation (AST) task, the model showed strong performance with a BLEU score of 32.27 English to German, 22.6 in English to Spanish, and 41.22 on the Fleurs test set.

The smaller Canary 180m flash model can also bring impressive results with inference speeds of over 1200 RTFX. It hit 1.87% on Librispeech Clean DataSet and 3.83% of English ASR in other datasets of LibrisPeech. For multilingual ASR, the model records for German were 4.81%, Spanish was 3.17%, and 4.75% in the MLS test set. In the AST task, the BLEU score of English was 28.18 and English was 20.47, and 36.66 was obtained in the fleurs test set.
Both models support single-level and subdivided timestamps, enhancing their utility in applications that require precise alignment between audio and text. Their compact size makes them suitable for device deployment, allowing offline processing and reducing dependency on cloud services. Furthermore, their robustness leads to fewer hallucinations during translation tasks, ensuring more reliable output. Open source releases under the CC-BY-4.0 license encourage commercial utilization and further development of the community.
In summary, NVIDIA’s open source of Canary 1B and 180m Flash models represents a significant advance in multilingual speech recognition and translation. Their high accuracy, real-time processing capabilities, and adaptability to device deployment solve many of the existing challenges in the field. By revealing these models, NVIDIA not only demonstrates its commitment to advancing AI research, but also empowers developers and organizations to build more inclusive and efficient communication tools.
Check this Canary 1B Model and Canary 180m flash. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 80k+ ml subcolumn count.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.