Silero, a neural network speech synthesis system

Few days ago the release of a new public version was announced of the neural network speech synthesis system silero Text-to-Speech, the main goal of the project is to create a modern high-quality speech synthesis system that is not inferior to commercial solutions of corporations and is available to everyone without the use of expensive server equipment.

The models are distributed under the GNU AGPL license, but the company that develops the project does not disclose the mechanism for training models. To get started, you can use PyTorch and frameworks that support the ONNX format.

Currently, Silero It has models in English, Spanish, German, Russian, French, Ukrainian, Tatar, Uzbek, Bashkir, among others.

Voice synthesis in Silero It is based on the use of neural network algorithms. deeply modified modern and digital signal processing methods.

It is observed that the main problem of modern neural network solutionss for speech synthesis is that often are only available as part of paid cloud solutions and public products have high hardware requirements, are of lower quality, or are not finished and ready-to-use products. For example, to successfully run one of the popular new end-to-end synthesis architectures, VITS, in synthesis mode (ie, not for model training), video cards with more than 16 gigabytes of VRAM are required.

Contrary to the current trend, Silero's solutions run successfully even on 1 x86 thread of an Intel processor with AVX2 instructions. On 4 processor threads, synthesis allows you to synthesize 30-60 seconds per second in 8 kHz synthesis mode, in 24 kHz mode – 15-20 seconds, and in 48 kHz mode – about 10 seconds.

Main novelties of the new version of Silero

In this new version that is presented, it is highlighted that the size of the model is reduced 2 times to 50 megabytes, plus the models have become 10 times faster and for example, in 24 kHz mode, they can synthesize up to 20 seconds of audio per second on 4 processor threads.

Besides it models know how to pause, they can accept full paragraphs text as input, SSML tags are supported, and all speech options for a language are packaged into a single model.

It is also highlighted that Synthesis works simultaneously in three sample rates to choose from: 8, 24 and 48 kilohertz, “children's problems”: instability and omission of words are solved and flags have been added to control the automatic placement of accents and the placement of the letter “ё”.

On the other hand, it is also mentioned that there are some systemic problems inherent to the Silero synthesis and they are:

Unlike more traditional synthesis solutions like RHVoice, Silero's synthesis lacks SAPI integration, easy-to-install clients, and Windows and Android integrations.
The speed, while unprecedented for such a solution, may not be enough for on-the-fly synthesis on high-quality weak processors.
The automatic stress solver does not handle homographs and still makes errors, but this bug will be fixed in future releases.
The current version of the synthesis doesn't work on processors without AVX2 instructions (or you need to specifically change the PyTorch configuration), because one of the modules inside the model is quantized.
The current version of synthesis essentially has the only dependency on PyTorch.
libtorch available for mobile platforms is much more cumbersome than the ONNX runtime, but the ONNX version of the model is not provided yet.

Finally it is mentioned that for the next version It will be released in the near future with the following changes:

The synthesis rate will increase by 2 to 4 times more.
Synthesis templates for CIS languages: Kalmyk, Tatar, Uzbek and Ukrainian will be updated.
Models for European languages will be added.
Models for Indian languages will be added.
Models for English will be added.

If you are interested in knowing more about it, you can check the details In the following link.

LinuxAdictos

Silero, a neural network speech synthesis system

Main novelties of the new version of Silero

Leave a Comment Cancel reply