Recently NVIDIA and Mozilla announced the release of the new version of "Mozilla Common Voice 7.0" which represents more than 13.000 hours of voice data of collective origin and the addition of another 16 languages and that compared to the last update, the size of the material volume speaking in the collection it has increased by almost 50% more.
In addition, the number of supported languages has increased from 60 to 76, including additional support for the Belarusian, Kazakh, Uzbek, Bulgarian, Armenian, Azerbaijani, and Bashkir languages for the first time.
For those unfamiliar with Common Voice, they should know that eThis is an open data voice data set largest in the world and is designed to democratize voice technology. It is used by researchers, academics and developers worldwide.
Employees mobilize their own communities to donate voice data to MCV's public database, which anyone can use to train voice-enabled technology. As part of the NVIDIA c collaborationon Mozilla Common Voice, models trained in this and other public data sets are available for free via an open source toolkit called NVIDIA NeMo.
The project aims to organize joint work to accumulate a database of voice templates, taking into account all the variety of voices and ways of speaking. The accumulated database with records of different pronunciations of phrases typical of human speech can be used without restrictions in machine learning systems and in research projects.
According to the author of the Vosk Continuous Voice Recognition Library, the shortcomings of the Common Voice set are the one-sidedness of the vocal material (the predominance of men in their 20s and 30s and the lack of material with the voice of women, children and elderly), the lack of vocabulary variability (repetition of the same phrases) and the distribution of MP3 recordings prone to distortion.
About the new version of Common Voice 7.0
In this new version more than 75 thousand people participated in the preparation of materials in English, dictating 2637 hours of confirmed speech (there were 66 thousand participants and 1686 hours).
Also as we mentioned at the beginning, this new version introduces 16 new languages into the Common Voice dataset for a total of 76 languages, of which the top five languages by total hours are English (2.630 hours), Kinyarwanda (2.260), German (1.040), Catalan (920) and Esperanto (840 ).
The languages that have increased the most in percentage are Thai (growth of almost 20 times, from 12 hours to 250 hours), luganda (growth of 9 times, from 8 hours to 80 hours), Esperanto (growth of more than 7 times, from 100 hours to 840 hours) and Tamil (growth of more than 8x, from 24 hours to 220 hours). Curiously, Rwanda ranks second in terms of cumulative data, for which 2260 hours were collected. They are followed by German (1040), Catalan (920) and Esperanto (840). The dataset now features over 182,000 unique voices, a 25% growth in the taxpayer community in just six months.
It is also mentioned that as part of their participation in the project, NVIDIA has prepared ready-to-use trained models for machine learning systems based on collected data (compatible with PyTorch). The models are distributed as part of a free and open NVIDIA NeMo tool, which, for example, is already used in the automated voice services of MTS and Sberbank.
The models are aimed at speech recognition, speech synthesis and information processing systems in natural language and they can be useful to researchers in the design of voice dialogue systems, transcription platforms, and automated call centers. Unlike previously available projects, the published models are not limited to English recognition and cover a variety of languages, accents, and forms of speech.
Finally if you are interested in knowing more about it, you can check the details in the following link