Make-An-Audio 2

Temporal-Enhanced Text-to-Audio Generation

Jiawei Huang^1,2,*, Yi Ren^2,*, Rongjie Huang¹, Dongchao Yang³, Zhenhui Ye^1,2, Chen Zhang², Jinglin Liu^2,*, Xiang Yin² Zejun Ma² Zhou Zhao¹

¹Zhejiang University, ²ByteDance ³The Chinese University of HongKong

^*Equal Contribution

Abstract. Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

Make-An-Audio 2 Overview

High-level overview of Make-An-Audio 2. Note that modules printed with a lock are frozen when training the T2A model. The Text Encoder takes original natural language text as input. And the Temporal Encoder takes the LLMs-parsed structured caption as its input. The structured inputs are parsed before the training process. The training process are seperated into two stages. The first stage we train the Audio VAE. The second stage we train the T2A diffusion module and freeze the parameters of both Audio VAE and Text Encoder.

Text-to-Audio generation
Variable-length audio generation
Precise Temporal Control
Comparison between dual text encoders and only structured text encoder
Broader Impact

Text-to-Audio generation

We show the original natural language caption and the corresponding structured caption of Make-An-Audio 2. And we compare the audio generated by Make-An-Audio 2 to prior T2A works.

Input	Ground-truth	Make-An-Audio 2	Make-An-Audio	Audio-LDM	TANGO

Variable-length audio generation

Trained with variable length data and with the design of 1D-convlution VAE and feed-forward Transformer-based diffusion backbone, Make-An-Audio 2 can generate audios of variable-length without performance dropping.

Input	Make-An-Audio 2	Make-An-Audio	AudioLDM	TANGO

Precise Temporal Control

Due to the ambiguity of natural language, the time period when some sound events occur may not be clearly described, and we can provide more precise temporal control by modifying the order in the structured input.

Origin Input	Structured Input	Structured Input
Wind blowing followed by people speaking then a loud burst of thunder	<wind blowing& all>@<people speaking& mid>@<thunder& end>	<wind blowing& start>@<people speaking& mid>@<thunder& end>
A train running on railroad tracks followed by a train horn blowing and steam hissing	<train running on railroad tracks& all>@<train horn blowing& end>@<steam hissing& end>	<train running on railroad tracks& all>@<train horn blowing& mid>@<steam hissing& end>
Winds and ocean waves crashing while a chime instrument briefly plays a melody	<winds& all>@<ocean waves crashing& all>@<chime instrument melody& all>	<winds& all>@<ocean waves crashing& all>@<chime instrument melody& mid>
Constant faint humming and a few light knocks	<constant faint humming & all>@<a few light knocks & end>	<constant faint humming & all>@<a few light knocks & start>

Comparison between dual text encoders and only structured text encoder

When LLM parsing the original natural language input, some adjective or quantifier may be lost, and sometimes the structured inputs' format is incorrect. Dual text encoders can avoid information loss and are more robust in these situations.

Origin Input	Wrongly Structured Input	Dual text encoders	Only structured text encoder
A strong torrent of rain is audible outside of a window	<strong>Sound of strong torrent of rain outside window & all</strong>
A motorcycle revving by quickly twice	<motorcycle revving & all>@<quickly twice & end></quickly>
A car moves quickly and is followed by someone walking and other cars	<car engine revving & start>@<car tires screeching & mid>@<footsteps running & mid>@<other car engines & mid to end>
A metallic swirling and scraping that gets louder and more irregular	<metallic swirling and scraping & all, getting louder and more irregular>@
A gusting wind with waves crashing in the background from time to time	<gusting wind & all>@<waves crashing & random intervals>

Broader impact

We believe that our T2A work on temporal enhancement can serve as an important stepping stone for future work on generating semantically aligned and temporally consistent audio. And our approach of constructing complex audio and enhancing the data based on LLM can provide inspiration for future work.
At the same time, we acknowledge that Make-An-Audio 2 may lead to unintended consequences such as increased unemployment for individuals in related fields such as sound engineering and radio hosting. Furthermore, there are potential concerns regarding the ethics of non-consensual voice cloning or the creation of fake media.