Google Just Released This Amazing SoundStorm Research Papers, Here Are The Some Thing Which You Need To Know. Google Is Just Launched There Resaerch Paper Related to audio generation with its latest research paper titled “SoundStorm: Efficient Parallel Audio Generation.” This groundbreaking model, developed by Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi, introduces a highly efficient and non-autoregressive approach to audio generation. By leveraging bidirectional attention and confidence-based parallel decoding, SoundStorm not only produces audio of exceptional quality but also achieves a remarkable speed that is two orders of magnitude faster than previous methods.
Here Are Official Links Releated Google SoundStorm :
.
.
Table of Contents
ToggleHere Are Demo Example from Google :
Step 1: Your original Voice Recording Audio (As a input,To clone Your Voice)
Step 2: Give Text ,Means Like Script.(You Will Get Your Output Of This Text In Your Own Clonned Voice)
Example Text :
Something really funny happened to me this morning. | Oh wow, what? | Well, uh I woke up as usual. | Uhhuh | Went downstairs to have uh breakfast. | Yeah | Started eating. Then uh 10 minutes later I realized it was the middle of the night. | Oh no way, that’s so funny! |
Step 3: Your Output Of Cloned Voice (Text to Audio ,but In Your Own Voice )
Example output :
.
You See How Amazing Is This . They Just Put Like 3-4 Second Audio As Input To Clone . And Only From 3-4 Seconds Audio File ,Its Cloning Voices Of People.Actually This Really Blow My Mind ,How its Cloning From Just Few Seconds Of Output .
Google Is a Very Large Company They Always Surprise With There Amazing Technology ,And Here Are The Some Amazing Features Of There SoundStorm
Note: Features Are Purely Written On Only Information From Research Paper ,it’s Not Final Feature Of Actual Product .
SoundStorm revolutionizes the field of audio generation by addressing the challenge of generating long audio token sequences. Traditional autoregressive models face computational limitations as the sequence length increases, making it difficult to generate high-quality audio efficiently. However, SoundStorm adopts a unique approach by incorporating an architecture tailored to the hierarchical structure of audio tokens, coupled with a parallel, non-autoregressive, confidence-based decoding scheme.
By utilizing a bidirectional attention-based Conformer, SoundStorm efficiently predicts masked audio tokens based on the semantic tokens of AudioLM. This intelligent model allows for rapid generation of audio with the same quality as autoregressive models while significantly improving consistency in voice and acoustic conditions. Astonishingly, SoundStorm can generate 30 seconds of audio in a mere 0.5 seconds on a TPU-v4, showcasing its exceptional speed and efficiency.
One remarkable application of SoundStorm is its ability to synthesize high-quality and natural dialogues. When combined with the text-to-semantic modeling stage of SPEAR-TTS, SoundStorm enables the generation of dialogues controlled by spoken content, speaker voices, and speaker turns. By providing a transcript annotated with speaker turns and a short voice prompt, SoundStorm can effortlessly synthesize dialogue segments of 30 seconds, delivering an impressive runtime of only 2 seconds on a single TPU-v4. It’s important to note that the speakers and text used in this synthesis have not been encountered during the training process.
SoundStorm’s versatility shines through its capability to generate audio conditioned on the semantic tokens of AudioLM. It can produce audio samples with or without 3-second voice prompts. In the unprompted case, SoundStorm samples different speakers while maintaining high consistency in the speaker’s voice when prompted. This remarkable ability, combined with its exceptional speed, sets SoundStorm apart from previous models such as AudioLM’s acoustic generator.
When comparing SoundStorm with existing baselines, it surpasses them in several aspects. In the prompted case, SoundStorm’s audio generations exhibit higher acoustic consistency and better preservation of the speaker’s voice compared to AudioLM. Furthermore, SoundStorm outperforms the RVQ level-wise greedy decoding method with the same model, resulting in audio of superior quality.
As with any technological advancement, it is essential to consider the broader impact of SoundStorm. While SoundStorm empowers researchers and audio enthusiasts to explore new horizons in audio generation, it is crucial to acknowledge that the generated samples may reflect the biases present in the training data, such as accents and voice characteristics. Therefore, responsible AI principles call for thorough analysis and addressing the limitations of the training data. SoundStorm’s ability to mimic voices also raises concerns about potential misuse, including biometric identification bypass and impersonation. Google researchers have taken measures to ensure the detectability of SoundStorm’s generated audio and remain committed to implementing safeguards against misuse.
Speech Intelligibility and Audio Quality: Subjective evaluation experiments have shown that SoundStorm outperforms AudioLM in terms of speech intelligibility and acoustic consistency. By transcribing the generated audio using an automatic speech recognition (ASR) system, it was found that SoundStorm achieved lower word error rates (WER) and character error rates (CER) compared to AudioLM. Moreover, the perceived audio quality, as estimated by a mean opinion score (MOS) estimator, was found to be on par with AudioLM’s acoustic generator, which has been previously shown to match the quality of ground-truth audio.
Voice Preservation and Acoustic Consistency: SoundStorm also excels in preserving the speaker identity of the prompt. By computing cosine similarity between speaker embeddings extracted from the prompt and the generated audio, it was observed that SoundStorm significantly outperformed the AudioLM baseline. Furthermore, when measuring the acoustic consistency drift over time in long audio generations, SoundStorm maintained a high level of consistency with the prompt, while AudioLM exhibited more significant drift.
Efficiency and Runtimes: One of the most remarkable features of SoundStorm is its efficiency in parallel audio generation. Compared to AudioLM’s acoustic generator, SoundStorm can generate audio two orders of magnitude faster, achieving a real-time factor of 0.017 on a single TPU-v4. The runtime measurements demonstrate that by combining the semantic generation stage of AudioLM with SoundStream, 30 seconds of speech continuation can be generated within just 2 seconds.
Optimizing Decoding Steps: Experiments were conducted to optimize the number of decoding iterations in the first RVQ level of SoundStorm. It was found that using 16 iterations resulted in improved audio quality compared to level-wise greedy decoding. However, increasing the number of iterations did not yield further improvements. Additionally, increasing the number of iterations for RVQ levels 2-12 did not significantly impact audio quality.
Conclusion:
The launch of SoundStorm marks a significant milestone in audio generation technology. Its efficiency, combined with remarkable audio quality and consistency, opens up new possibilities in speech synthesis, text-to-speech systems, and music generation. As researchers continue to explore advancements in this field, we anticipate exciting applications and enhanced user experiences. SoundStorm is set to revolutionize audio generation, offering a faster and more efficient approach that will shape the future of creative audio applications.