Abstract.
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured
High-level overview of Make-An-Audio 2. Note that modules printed with a lock are frozen when training the T2A model. The Text Encoder takes original natural language text as input. And the Temporal Encoder takes the LLMs-parsed structured caption as its input. The structured inputs are parsed before the training process. The training process are seperated into two stages. The first stage we train the Audio VAE. The second stage we train the T2A diffusion module and freeze the parameters of both Audio VAE and Text Encoder.
We show the original natural language caption and the corresponding structured caption of Make-An-Audio 2. And we compare the audio generated by Make-An-Audio 2 to prior T2A works.
                  Input                        | Ground-truth | Make-An-Audio 2 | Make-An-Audio | Audio-LDM | TANGO |
---|---|---|---|---|---|
Trained with variable length data and with the design of 1D-convlution VAE and feed-forward Transformer-based diffusion backbone, Make-An-Audio 2 can generate audios of variable-length without performance dropping.
              Input                    | Make-An-Audio 2 | Make-An-Audio | AudioLDM | TANGO |
---|---|---|---|---|
Due to the ambiguity of natural language, the time period when some sound events occur may not be clearly described, and we can provide more precise temporal control by modifying the order in the structured input.
Origin Input | Structured Input | Generated Audio | Structured Input | Generated Audio |
---|---|---|---|---|
Wind blowing followed by people speaking then a loud burst of thunder | <wind blowing& all>@<people speaking& mid>@<thunder& end> | <wind blowing& start>@<people speaking& mid>@<thunder& end> | ||
A train running on railroad tracks followed by a train horn blowing and steam hissing | <train running on railroad tracks& all>@<train horn blowing& end>@<steam hissing& end> | <train running on railroad tracks& all>@<train horn blowing& mid>@<steam hissing& end> | ||
Winds and ocean waves crashing while a chime instrument briefly plays a melody | <winds& all>@<ocean waves crashing& all>@<chime instrument melody& all> | <winds& all>@<ocean waves crashing& all>@<chime instrument melody& mid> | ||
Constant faint humming and a few light knocks | <constant faint humming & all>@<a few light knocks & end> | <constant faint humming & all>@<a few light knocks & start> |
When LLM parsing the original natural language input, some adjective or quantifier may be lost, and sometimes the structured inputs' format is incorrect. Dual text encoders can avoid information loss and are more robust in these situations.
Origin Input | Wrongly Structured Input | Dual text encoders | Only structured text encoder |
---|---|---|---|
A strong torrent of rain is audible outside of a window | <strong>Sound of strong torrent of rain outside window & all</strong> | ||
A motorcycle revving by quickly twice | <motorcycle revving & all>@<quickly twice & end></quickly> | ||
A car moves quickly and is followed by someone walking and other cars | <car engine revving & start>@<car tires screeching & mid>@<footsteps running & mid>@<other car engines & mid to end> | ||
A metallic swirling and scraping that gets louder and more irregular | <metallic swirling and scraping & all, getting louder and more irregular>@ | ||
A gusting wind with waves crashing in the background from time to time | <gusting wind & all>@<waves crashing & random intervals> |
We believe that our T2A work on temporal enhancement can serve as an important stepping stone for future work on generating semantically aligned and temporally consistent audio.
And our approach of constructing complex audio and enhancing the data based on LLM can provide inspiration for future work.
At the same time, we acknowledge that Make-An-Audio 2 may lead to unintended consequences such as increased unemployment for individuals in related fields such as sound engineering and radio hosting. Furthermore, there are potential concerns regarding the ethics of non-consensual voice cloning or the creation of fake media.