TTS | 一文總覽語(yǔ)音合成系列基礎(chǔ)知識(shí)及簡(jiǎn)要介紹

yliu277 2024-03-31 發(fā)布于湖北

展開(kāi)全文

Text-to-Speech（通?？s寫(xiě)為T(mén)TS）是指一種將文本轉(zhuǎn)為音頻的技術(shù)。

本文主要包含了以下內(nèi)容：

- 語(yǔ)音合成的歷史概要
- 語(yǔ)音合成中文本分析
- 聲學(xué)模型的類(lèi)型
- 語(yǔ)音合成中的聲碼器
- 端到端的語(yǔ)音合成

1.歷史

第一臺(tái)“會(huì)說(shuō)話(huà)的機(jī)器”可能是在 18 世紀(jì)后期制造的（據(jù)說(shuō)是一位匈牙利科學(xué)家發(fā)明的）。計(jì)算機(jī)輔助創(chuàng)作起源于20世紀(jì)中期，各種技術(shù)已經(jīng)使用了大約50年。如果我們對(duì)舊技術(shù)進(jìn)行分類(lèi).首先，

1）Articulatory Synthesis： 這是一種模擬人的嘴唇、舌頭和發(fā)聲器官的技術(shù)。

2）共振峰合成：人聲可以看作是在語(yǔ)音在器官中過(guò)濾某些聲音而產(chǎn)生的聲音。這就是所謂的源濾波器模型，它是一種在基本聲音（例如單個(gè)音高）上添加各種濾波器以使其聽(tīng)起來(lái)像人聲的方法（稱(chēng)為加法合成）。

3) Concatenative Synthesis：現(xiàn)在使用數(shù)據(jù)的模型。舉個(gè)簡(jiǎn)單的例子，你可以錄制 0 到 9 的聲音，并通過(guò)鏈接這些聲音來(lái)?yè)艽螂娫?huà)號(hào)碼。然而，聲音并不是很自然流暢。

4）統(tǒng)計(jì)參數(shù)語(yǔ)音合成（SPSS）：通過(guò)創(chuàng)建聲學(xué)模型、估計(jì)模型參數(shù)并使用它來(lái)生成音頻的模型。它可以大致分為三個(gè)部分。

首先，“文本分析” ，將輸入文本轉(zhuǎn)換為語(yǔ)言特征，“聲學(xué)模型” ，將語(yǔ)言特征轉(zhuǎn)換為聲學(xué)特征，最后是聲學(xué)特征。這是聲碼器。該領(lǐng)域使用最廣泛的聲學(xué)模型是隱馬爾可夫模型（HMM）。使用 HMM，能夠創(chuàng)建比以前更好的聲學(xué)特征。但是，大部分生成的音頻比較機(jī)械，例如機(jī)器人聲音等。

5)神經(jīng) TTS：隨著我們?cè)?2010 年代進(jìn)入深度學(xué)習(xí)時(shí)代，已經(jīng)開(kāi)發(fā)了基于幾種新神經(jīng)網(wǎng)絡(luò)的模型。這些逐漸取代了HMM，并被用于“聲學(xué)模型”部分，逐漸提高了語(yǔ)音生成的質(zhì)量。從某種意義上說(shuō)，它可以看作是SPSS的一次進(jìn)化，但隨著模型性能的逐漸提高，它朝著逐漸簡(jiǎn)化上述三個(gè)組成部分的方向發(fā)展。比如下圖中，可以看出它是在從上（0）到下（4）的方向發(fā)展的。

現(xiàn)在推出的大致分為三種模型：

-聲學(xué)模型：以字符（文本）或音素（音素；發(fā)音單位）為輸入并創(chuàng)建任何聲學(xué)特征的模型。如今，大多數(shù)聲學(xué)特征都是指梅爾頻譜圖。

-聲碼器：一種將梅爾頻譜圖（和類(lèi)似的頻譜圖）作為輸入并生成真實(shí)音頻的模型。

-完全端到端的 TTS 模型：接收字符或音素作為輸入并立即生成音頻的模型。

2.文本分析

文本分析是將字符文本轉(zhuǎn)換為語(yǔ)言特征。要考慮以下問(wèn)題：

1) 文本規(guī)范化：將縮寫(xiě)或數(shù)字更改為發(fā)音。例如把1989改成'一九八九’

2）分詞：這在中文等基于字符的語(yǔ)言中是必須的部分。例如，它根據(jù)上下文判斷是把“包”看成單個(gè)詞還是把'書(shū)包'和'包子'分開(kāi)看.

3）詞性標(biāo)注：把動(dòng)詞、名詞、介詞等分析出來(lái)。

4) Prosody prediction:表達(dá)對(duì)句子的哪些部分重讀、每個(gè)部分的長(zhǎng)度如何變化、語(yǔ)氣如何變化等的微妙感覺(jué)的詞。如果沒(méi)有這個(gè)，它會(huì)產(chǎn)生一種真正感覺(jué)像“機(jī)器人說(shuō)話(huà)”的聲音。尤其是英語(yǔ)（stress-based）等語(yǔ)言在這方面差異很大，只是程度不同而已，但每種語(yǔ)言都有自己的韻律。如果我們可以通過(guò)查看文本來(lái)預(yù)測(cè)這些韻律，那肯定會(huì)有所幫助。例如，文本末尾的“?”。如果有，自然會(huì)產(chǎn)生上升的音調(diào)。

5) Grapheme-to-phoneme (G2P)：即使拼寫(xiě)相同，也有很多部分發(fā)音不同。例如，“resume”這個(gè)詞有時(shí)會(huì)讀作“rizju:m”，有時(shí)讀作“rezjumei”，因此必須查看整個(gè)文本的上下文。所以，如果優(yōu)先考慮字素轉(zhuǎn)音素的部分，也就是將'語(yǔ)音’轉(zhuǎn)換成'spiy ch’等音標(biāo)的部分。

在過(guò)去的 SPSS 時(shí)代，添加和開(kāi)發(fā)了這些不同的部分以提高生成音頻的質(zhì)量。在 neural TTS 中，這些部分已經(jīng)簡(jiǎn)化了很多，但仍然有一些部分是肯定需要的。比如1）文本規(guī)范化text normalization 或者5）G2P基本上都是先處理后輸入。如果有的論文說(shuō)可以接收字符和音素作為輸入，那么很多情況下都會(huì)寫(xiě)“實(shí)際上，當(dāng)輸入音素時(shí)結(jié)果更好”。盡管如此，它還是比以前簡(jiǎn)單了很多，所以在大多數(shù)神經(jīng) TTS 中，文本分析部分并沒(méi)有單獨(dú)處理，它被認(rèn)為是一個(gè)簡(jiǎn)單的預(yù)處理。特別是在 G2P 的情況下，已經(jīng)進(jìn)行了幾項(xiàng)研究，例如英語(yǔ) [Chae18]、中文 [Park20]、韓語(yǔ) [Kim21d]。

3.聲學(xué)模型

聲學(xué)模型是指通過(guò)接收字符或音素作為輸入或通過(guò)接收在文本分析部分創(chuàng)建的語(yǔ)言特征來(lái)生成聲學(xué)特征的部分。前面提到，在SPSS時(shí)代，HMM（Hidden Markov Model）在Acoustic Model中的比重很大，后來(lái)神經(jīng)網(wǎng)絡(luò)技術(shù)逐漸取而代之。例如，[Zen13][Qian14] 表明用 DNN 替換 HMM 效果更好。不過(guò)RNN系列可能更適合語(yǔ)音等時(shí)間序列。因此，在[Fan14][Zen15]中，使用LSTM等模型來(lái)提高性能。然而，盡管使用了神經(jīng)網(wǎng)絡(luò)模型，這些模型仍然接收語(yǔ)言特征作為輸入和輸出，如 MCC（梅爾倒譜系數(shù)）、BAP（帶非周期性）、LSP（線譜對(duì)）、LinS（線性譜圖）和 F0 .（基頻）等。因此，這些模型可以被認(rèn)為是改進(jìn)的 SPSS 模型。

DeepVoice [Ar?k17a]，吳恩達(dá)在百度研究院時(shí)宣布的，其實(shí)更接近SPSS模型。它由幾個(gè)部分組成，例如一個(gè)G2P模塊，一個(gè)尋找音素邊界的模塊，一個(gè)預(yù)測(cè)音素長(zhǎng)度的模塊，一個(gè)尋找F0的模塊，每個(gè)模塊中使用了各種神經(jīng)網(wǎng)絡(luò)模型。之后發(fā)布的DeepVoice 2 [Ar?k17b]，也可以看作是第一版的性能提升和多揚(yáng)聲器版本，但整體結(jié)構(gòu)類(lèi)似。

3.1.基于Seq2seq的聲學(xué)模型

在2014-5年的機(jī)器翻譯領(lǐng)域，使用attention的seq2seq模型成為一種趨勢(shì)。然而，由于字母和聲音之間有很多相似之處，所以可以應(yīng)用于語(yǔ)音?；谶@個(gè)想法，Google 開(kāi)發(fā)了 Tacotron[Wang17]（因?yàn)樽髡呦矚g tacos 而得名）。通過(guò)將 CBHG 模塊添加到作為 seq2seq 基礎(chǔ)的 RNN 中，終于開(kāi)始出現(xiàn)可以接收字符作為輸入并立即提取聲學(xué)特征的適當(dāng)神經(jīng) TTS，從而擺脫了以前的 SPSS。這個(gè)seq2seq模型從那以后很長(zhǎng)一段時(shí)間都是TTS模型的基礎(chǔ)。

在百度，DeepVoice 3 [Ping18] 拋棄了之前的舊模型，加入了使用注意力的 seq2seq 。然而，DeepVoice 持續(xù)基于 CNN 的傳統(tǒng)仍然存在。DeepVoice 在版本 3 末尾停止使用這個(gè)名稱(chēng)，之后的 ClariNet [Ping19] 和 ParaNet [Peng20] 也沿用了該名稱(chēng)。特別是，ParaNet 引入了幾種技術(shù)來(lái)提高 seq2seq 模型的速度。

谷歌的 Tacotron 在保持稱(chēng)為 seq2seq 的基本形式的同時(shí)，也向各個(gè)方向發(fā)展。第一個(gè)版本有點(diǎn)過(guò)時(shí)，但從 Tacotron 2 [Shen18] 開(kāi)始，mel-spectrogram 被用作默認(rèn)的中間表型。在 [Wang18] 中，學(xué)習(xí)了定義某種語(yǔ)音風(fēng)格的風(fēng)格標(biāo)記，并將其添加到 Tacotron 中，以創(chuàng)建一個(gè)控制風(fēng)格的 TTS 系統(tǒng)。同時(shí)發(fā)表的另一篇谷歌論文 [Skerry-Ryan18] 也提出了一種模型，可以通過(guò)添加一個(gè)部分來(lái)學(xué)習(xí)韻律嵌入到 Tacotron 中來(lái)改變生成音頻的韻律。在 DCTTS [Tachibana18] 中，將 Tacotron 的 RNN 部分替換為 Deep CNN 表明在速度方面有很大的增益。從那時(shí)起，該模型已改進(jìn)為快速模型 Fast DCTTS，尺寸顯著減小 [Kang21]。

在 DurIAN [Yu20] 中，Tacotron 2 的注意力部分更改為對(duì)齊模型，從而減少了錯(cuò)誤。Non-Attentive Tacotron [Shen20] 也做了類(lèi)似的事情，但在這里，Tacotron 2 的注意力部分被更改為持續(xù)時(shí)間預(yù)測(cè)器，以創(chuàng)建更穩(wěn)健的模型。在FCL-TACO2 [Wang21]中，提出了一種半自回歸（SAR）方法，每個(gè)音素用AR方法制作，整體用NAR方法制作，以提高速度，同時(shí)保持質(zhì)量。此外，蒸餾用于減小模型的大小。建議使用基于 Tacotron 2 的模型，但速度要快 17-18 倍。

3.2.基于變壓器的聲學(xué)模型

隨著2017年Transformers的出現(xiàn)，注意力模型演變成NLP領(lǐng)域的Transformers，使用Transformers的模型也開(kāi)始出現(xiàn)在TTS領(lǐng)域。TransformerTTS [Li19a]可以看作是一個(gè)起點(diǎn)，這個(gè)模型原樣沿用了Tacotron 2的大部分，只是將RNN部分改成了Transformer。這允許并行處理并允許考慮更長(zhǎng)的依賴(lài)性。

FastSpeech [Ren19a] 系列可以被引用為使用 Transformer 模型的 TTS 的代表。在這種情況下，可以通過(guò)使用前饋 Transformer 以非常高的速度創(chuàng)建梅爾頻譜圖。作為參考，mel-spectrogram是一種考慮人的聽(tīng)覺(jué)特性，對(duì)FFT的結(jié)果進(jìn)行變換的方法，雖然是比較舊的方法，但仍然被使用。優(yōu)點(diǎn)之一是可以用少量維度（通常為 80）表示。

在 TTS 中，將輸入文本與梅爾頻譜圖的幀相匹配非常重要。需要準(zhǔn)確計(jì)算出一個(gè)字符或音素變化了多少幀，其實(shí)attention方法過(guò)于靈活，對(duì)NLP可能有好處，但在speech上反而不利（單詞重復(fù)或跳過(guò)）。因此，F(xiàn)astSpeech 排除了注意力方法，并利用了一個(gè)準(zhǔn)確預(yù)測(cè)長(zhǎng)度的模塊（長(zhǎng)度調(diào)節(jié)器）。后來(lái)，F(xiàn)astSpeech 2 [Ren21a] 進(jìn)一步簡(jiǎn)化了網(wǎng)絡(luò)結(jié)構(gòu)，并額外使用了音高、長(zhǎng)度和能量等更多樣化的信息作為輸入。FastPitch[ ?ancucki21] 提出了一個(gè)模型，通過(guò)向 FastSpeech 添加詳細(xì)的音高信息進(jìn)一步改進(jìn)了結(jié)果。LightSpeech [Luo21] 提出了一種結(jié)構(gòu)，通過(guò)使用 NAS（Neural Architecture Search）優(yōu)化原本速度很快的 FastSpeech 的結(jié)構(gòu)，將速度提高了 6.5 倍。

MultiSpeech [Chen20] 還介紹了各種技術(shù)來(lái)解決 Transformer 的缺點(diǎn)。在此基礎(chǔ)上，對(duì) FastSpeech 進(jìn)行訓(xùn)練以創(chuàng)建一個(gè)更加改進(jìn)的 FastSpeech 模型。TransformerTTS 作者隨后還提出了進(jìn)一步改進(jìn)的 Transformer TTS 模型，在 RobuTrans [Li20] 模型中使用基于長(zhǎng)度的硬注意力。AlignTTS [Zeng20] 還介紹了一種使用單獨(dú)的網(wǎng)絡(luò)而不是注意力來(lái)計(jì)算對(duì)齊方式的方法。來(lái)自 Kakao 的 JDI-T [Lim20] 引入了一種更簡(jiǎn)單的基于 transformer 的架構(gòu)，還使用了改進(jìn)的注意力機(jī)制。NCSOFT 提出了一種在文本編碼器和音頻編碼器中分層使用轉(zhuǎn)換器的方法，方法是將它們堆疊在多個(gè)層中 [Bae21]。限制注意力范圍和使用多層次音高嵌入也有助于提高性能。

3.3.基于流的聲學(xué)模型

2014年左右開(kāi)始應(yīng)用于圖像領(lǐng)域的新一代方法Flow，也被應(yīng)用到聲學(xué)模型中。Flowtron [Valle20a] 可以看作是 Tacotron 的改進(jìn)模型，它是一個(gè)通過(guò)應(yīng)用 IAF（逆自回歸流）生成梅爾譜圖的模型。在 Flow-TTS [Miao20] 中，使用非自回歸流制作了一個(gè)更快的模型。在后續(xù)模型 EfficientTTS [Miao21] 中，在模型進(jìn)一步泛化的同時(shí)，對(duì)對(duì)齊部分進(jìn)行了進(jìn)一步改進(jìn)。

來(lái)自 Kakao 的 Glow-TTS [Kim20] 也使用流來(lái)創(chuàng)建梅爾頻譜圖。Glow-TTS 使用經(jīng)典的動(dòng)態(tài)規(guī)劃來(lái)尋找文本和梅爾幀之間的匹配，但 TTS 表明這種方法也可以產(chǎn)生高效準(zhǔn)確的匹配。后來(lái)，這種方法（Monotonic Alignment Search）被用于其他研究。

3.4.基于VAE的聲學(xué)模型

另一個(gè)誕生于 2013 年的生成模型框架 Variational autoencoder (VAE) 也被用在了 TTS 中。顧名思義，谷歌宣布的 GMVAE-Tacotron [Hsu19]使用 VAE 對(duì)語(yǔ)音中的各種潛在屬性進(jìn)行建模和控制。同時(shí)問(wèn)世的VAE-TTS[Zhang19a]也可以通過(guò)在Tacotron 2模型中添加用VAE建模的樣式部件來(lái)做類(lèi)似的事情。BVAE-TTS [Lee21a] 介紹了一種使用雙向 VAE 快速生成具有少量參數(shù)的 mel 的模型。Parallel Tacotron [Elias21a] 是 Tacotron 系列的擴(kuò)展，還引入了 VAE 以加快訓(xùn)練和創(chuàng)建速度。

3.5.基于GAN的聲學(xué)模型

在 2014 年提出的 Generative Adversarial Nets (GAN) 在 [Guo19] 中，Tacotron 2 被用作生成器，GAN 被用作生成更好的 mels 的方法。在 [Ma19] 中，使用 Adversarial training 方法讓 Tacotron Generator 一起學(xué)習(xí)語(yǔ)音風(fēng)格。Multi-SpectroGAN [Lee21b] 還以對(duì)抗方式學(xué)習(xí)了幾種樣式的潛在表示，這里使用 FastSpeech2 作為生成器。GANSpeech [Yang21b] 還使用帶有生成器的 GAN 方法訓(xùn)練 FastSpeech1/2，自適應(yīng)調(diào)整特征匹配損失的規(guī)模有助于提高性能。

3.6.基于擴(kuò)散的聲學(xué)模型

最近備受關(guān)注的使用擴(kuò)散模型的TTS也相繼被提出。Diff-TTS [Jeong21] 通過(guò)對(duì)梅爾生成部分使用擴(kuò)散模型進(jìn)一步提高了結(jié)果的質(zhì)量。Grad-TTS [Popov21] 也通過(guò)將解碼器更改為擴(kuò)散模型來(lái)做類(lèi)似的事情，但在這里，Glow-TTS 用于除解碼器之外的其余結(jié)構(gòu)。在 PriorGrad [Lee22a] 中，使用數(shù)據(jù)統(tǒng)計(jì)創(chuàng)建先驗(yàn)分布，從而實(shí)現(xiàn)更高效的建模。在這里，我們介紹一個(gè)使用每個(gè)音素的統(tǒng)計(jì)信息應(yīng)用聲學(xué)模型的示例。騰訊的 DiffGAN-TTS [Liu22a] 也使用擴(kuò)散解碼器，它使用對(duì)抗訓(xùn)練方法。這大大減少了推理過(guò)程中的步驟數(shù)并降低了生成速度。

3.7.其他聲學(xué)模型

其實(shí)上面介紹的這些技術(shù)不一定要單獨(dú)使用，而是可以相互結(jié)合使用的。 FastSpeech 的作者自己分析發(fā)現(xiàn)，VAE 即使在小尺寸下也能很好地捕捉韻律等長(zhǎng)信息，但質(zhì)量略差，而 Flow 保留細(xì)節(jié)很好，而模型需要很大為了提高質(zhì)量， PortaSpeech提出了一種模型，包含Transformer VAE Flow的每一個(gè)元素。

VoiceLoop [Taigman18] 提出了一種模型，該模型使用類(lèi)似于人類(lèi)工作記憶模型的模型來(lái)存儲(chǔ)和處理語(yǔ)音信息，稱(chēng)為語(yǔ)音循環(huán)。它是考慮多揚(yáng)聲器的早期模型，之后，它被用作Facebook[Akuzawa18] [Nachmani18] 和 [deKorte20] 的其他研究的骨干網(wǎng)絡(luò)。

DeviceTTS [Huang21] 是一個(gè)使用深度前饋?lái)樞蛴洃浘W(wǎng)絡(luò)（DFSMN）作為基本單元的模型。該網(wǎng)絡(luò)是一種帶有記憶塊的前饋網(wǎng)絡(luò)，是一種小型但高效的網(wǎng)絡(luò)，可以在不使用遞歸方案的情況下保持長(zhǎng)期依賴(lài)關(guān)系。由此，提出了一種可以在一般移動(dòng)設(shè)備中充分使用的 TTS 模型。

4.聲碼器

聲碼器是使用聲學(xué)模型生成的聲學(xué)特征并將其轉(zhuǎn)換為波形的部件。即使在 SPSS 時(shí)代，當(dāng)然也需要聲碼器，此時(shí)使用的聲碼器包括 STRAIGHT [Kawahara06] 和 WORLD [Morise16]。

4.1.自回歸聲碼器

Neural Vocoder 從 WaveNet [Oord16] 引入擴(kuò)張卷積層來(lái)創(chuàng)建長(zhǎng)音頻樣本很重要，并且可以使用自回歸方法生成高級(jí)音頻，該方法使用先前創(chuàng)建的樣本生成下一個(gè)音頻樣本（一個(gè)接一個(gè)）。實(shí)際上，WaveNet本身可以作為一個(gè)Acoustic Model Vocoder，將語(yǔ)言特征作為輸入，生成音頻。然而，從那時(shí)起，通過(guò)更復(fù)雜的聲學(xué)模型創(chuàng)建梅爾頻譜圖，并基于 WaveNet 生成音頻就變得很普遍。

在 Tacotron [Wang17] 中，創(chuàng)建了一個(gè)線性頻譜圖，并使用 Griffin-Lim 算法 [Griffin84] 將其轉(zhuǎn)換為波形。由于該算法是40年前使用的，盡管網(wǎng)絡(luò)的整體結(jié)構(gòu)非常好，但得到的音頻并不是很令人滿(mǎn)意。在 DeepVoice [Ar?k17a] 中，從一開(kāi)始就使用了 WaveNet 聲碼器，特別是在論文 DeepVoice2 [Ar?k17b] 中，除了他們自己的模型外，還通過(guò)將 WaveNet 聲碼器添加到另一家公司的模型 Tacotron 來(lái)提高性能（這么說(shuō)來(lái)，在單個(gè)speaker上比DeepVoice2好）給出了更好的性能。自版本 2 [Shen18] 以來(lái)，Tacotron 使用 WaveNet 作為默認(rèn)聲碼器。

SampleRNN [Mehri17] 是另一種自回歸模型，在 RNN 方法中一個(gè)一個(gè)地創(chuàng)建樣本。這些自回歸模型生成音頻的速度非常慢，因?yàn)樗鼈兺ㄟ^(guò)上一個(gè)樣本一個(gè)一個(gè)地構(gòu)建下一個(gè)樣本。因此，許多后來(lái)的研究建議采用更快生產(chǎn)率的模型。

FFTNet [Jin18] 著眼于WaveNet的dilated convolution的形狀與FFT的形狀相似，提出了一種可以加快生成速度的技術(shù)。在 WaveRNN [Kalchbrenner18] 中，使用了各種技術(shù)（GPU 內(nèi)核編碼、剪枝、縮放等）來(lái)加速 WaveNet 。WaveRNN 從此演變成通用神經(jīng)聲碼器和各種形式。在 [Lorenzo-Trueba19] 中，使用 74 位說(shuō)話(huà)人和 17 種語(yǔ)言的數(shù)據(jù)對(duì) WaveRNN 進(jìn)行了訓(xùn)練，以創(chuàng)建 RNN_MS（多說(shuō)話(huà)人）模型，證明它是一種即使在說(shuō)話(huà)人和環(huán)境中也能產(chǎn)生良好質(zhì)量的聲碼器。數(shù)據(jù)。[Paul20a] 提出了 SC(Speaker Conditional)_WaveRNN 模型，即通過(guò)額外使用 speaker embedding 來(lái)學(xué)習(xí)的模型。該模型還表明它適用于不在數(shù)據(jù)中的說(shuō)話(huà)人和環(huán)境。

蘋(píng)果的TTS[Achanta21]也使用了WaveRNN作為聲碼器，并且在server端和mobile端做了各種優(yōu)化編碼和參數(shù)設(shè)置，使其可以在移動(dòng)設(shè)備上使用。

通過(guò)將音頻信號(hào)分成幾個(gè)子帶來(lái)處理音頻信號(hào)的方法，即較短的下采樣版本，已應(yīng)用于多個(gè)模型，因?yàn)樗哂锌梢钥焖俨⑿杏?jì)算的優(yōu)點(diǎn)，并且可以對(duì)每個(gè)子帶執(zhí)行不同的處理。。例如，在 WaveNet 的情況下，[Okamoto18a] 提出了一種子帶 WaveNet，它通過(guò)使用濾波器組將信號(hào)分成子帶來(lái)處理信號(hào)，[Rabiee18] 提出了一種使用小波的方法。[Okamoto18b] 提出了 FFTNet 的子帶版本。DurIAN [Yu19] 是一篇主要處理聲學(xué)模型的論文，但也提出了 WaveRNN 的子帶版本。

現(xiàn)在，很多后來(lái)推出的聲碼器都使用非自回歸方法來(lái)改善自回歸方法生成速度慢的問(wèn)題。換句話(huà)說(shuō)，一種無(wú)需查看先前樣本（通常表示為平行）即可生成后續(xù)樣本的方法。已經(jīng)提出了各種各樣的非自回歸方法，但最近一篇表明自回歸方法沒(méi)有死的論文是 Chunked Autoregressive GAN (CARGAN) [Morrison22]。它表明許多非自回歸聲碼器存在音高錯(cuò)誤，這個(gè)問(wèn)題可以通過(guò)使用自回歸方法來(lái)解決。當(dāng)然，速度是個(gè)問(wèn)題，但是通過(guò)提示可以分成chunked單元計(jì)算，紹一種可以顯著降低速度和內(nèi)存的方法。

4.2.基于流的聲碼器

歸一化基于流的技術(shù)可以分為兩大類(lèi)。首先是自回歸變換，在有代表性的IAF（inverse autoregressive flow）的情況下，生成速度非?？?，而不是需要很長(zhǎng)的訓(xùn)練時(shí)間。因此，它可以用來(lái)快速生成音頻。然而，訓(xùn)練速度慢是一個(gè)問(wèn)題，在Parallel WaveNet [Oord18]中，首先創(chuàng)建一個(gè)自回歸WaveNet模型，然后訓(xùn)練一個(gè)類(lèi)似的非自回歸IAF模型。這稱(chēng)為教師-學(xué)生模型，或蒸餾。之后，ClariNet [Ping19] 使用類(lèi)似的方法提出了一種更簡(jiǎn)單、更穩(wěn)定的訓(xùn)練方法。在成功訓(xùn)練 IAF 模型后，現(xiàn)在可以快速生成音頻。但訓(xùn)練方法復(fù)雜，計(jì)算量大。

另一種流技術(shù)稱(chēng)為二分變換，一種使用稱(chēng)為仿射耦合層的層來(lái)加速訓(xùn)練和生成的方法。大約在同一時(shí)間，提出了兩個(gè)使用這種方法的聲碼器，WaveGlow [Prenger19] 和 FloWaveNet [Kim19]。這兩篇論文來(lái)自幾乎相似的想法，只有細(xì)微的結(jié)構(gòu)差異，包括混合通道的方法。Bipartite transform的優(yōu)點(diǎn)是簡(jiǎn)單，但也有缺點(diǎn)，要?jiǎng)?chuàng)建一個(gè)等價(jià)于IAF的模型，需要堆疊好幾層，所以參數(shù)量比較大。

從那時(shí)起，WaveFlow [Ping20] 提供了幾種音頻生成方法的綜合視圖。不僅解釋了 WaveGlow 和 FloWaveNet 等流方法，還解釋了WaveNet 作為廣義模型的生成方法，我們提出了一個(gè)計(jì)算速度比這些更快的模型。此外，SqueezeWave [Zhai20] 提出了一個(gè)模型，該模型通過(guò)消除 WaveGlow 模型的低效率并使用深度可分離卷積，速度提高了幾個(gè)數(shù)量級(jí)（性能略有下降）。WG-WaveNet [Hsu20] 還提出了一種方法，通過(guò)在 WaveGlow 中使用權(quán)重共享顯著減小模型大小并添加一個(gè)小的 WaveNet 濾波器來(lái)提高音頻質(zhì)量來(lái)創(chuàng)建模型，從而使 44.1kHz 音頻在 CPU 上比實(shí)時(shí)音頻更快音頻...

4.3.基于 GAN 的聲碼器

廣泛應(yīng)用于圖像領(lǐng)域的生成對(duì)抗網(wǎng)絡(luò)（GANs）經(jīng)過(guò)很長(zhǎng)一段時(shí)間（4-5年）后成功應(yīng)用于音頻生成領(lǐng)域。WaveGAN [Donahue19] 可以作為第一個(gè)主要研究成果被引用。在圖像領(lǐng)域發(fā)展起來(lái)的結(jié)構(gòu)在音頻領(lǐng)域被沿用，所以雖然創(chuàng)造了一定質(zhì)量的音頻，但似乎仍然有所欠缺。

從GAN-TTS [Binkowski20]開(kāi)始，為了讓模型更適合音頻，也就是我開(kāi)始思考如何做一個(gè)能夠很好捕捉波形特征的判別器。在 GAN-TTS 中，使用多個(gè)隨機(jī)窗口（Random window discriminators）來(lái)考慮更多樣化的特征，而在 MelGAN [Kumar19] 中，使用了一種在多個(gè)尺度（Multi-scale discriminator）中查看音頻的方法。來(lái)自Kakao的HiFi-GAN [Kong20]提出了一種考慮更多音頻特征的方法，即一個(gè)周期（Multi-period discriminator）。在 VocGAN [Yang20a] 的情況下，還使用了具有多種分辨率的鑒別器。在 [Gritsenko20] 中，生成的分布與實(shí)際分布之間的差異以廣義能量距離 (GED) 的形式定義，并在最小化它的方向上學(xué)習(xí)。復(fù)雜的鑒別器以各種方式極大地提高了生成音頻的性能。[You21] 進(jìn)一步分析了這一點(diǎn)，并提到了多分辨率鑒別器的重要性。在 Fre-GAN [Kim21b] 中，生成器和鑒別器都使用多分辨率方法連接。使用離散波形變換 (DWT) 也有幫助。

在generator的情況下，很多模型使用了MelGAN提出的dilated transposed convolution組合。如果稍有不同，Parallel WaveGAN [Yamamoto20] 也接收高斯噪聲作為輸入，而 VocGAN 生成各種尺度的波形。在 HiFi-GAN 中，使用了具有多個(gè)感受野的生成器。[Yamamoto19] 還建議在 GAN 方法中訓(xùn)練 IAF 生成器。

前面提到的 Parallel WaveGAN [Yamamoto20] 是 Naver/Line 提出的一種模型，它可以通過(guò)提出非自回歸 WaveNet 生成器來(lái)以非常高的速度生成音頻。[Wu20] 通過(guò)在此處添加依賴(lài)于音高的擴(kuò)張卷積提出了一個(gè)對(duì)音高更穩(wěn)健的版本。之后，[Song21]提出了一種進(jìn)一步改進(jìn)的 Parallel WaveGAN，通過(guò)應(yīng)用感知掩蔽濾波器來(lái)減少聽(tīng)覺(jué)敏感錯(cuò)誤。此外，[Wang21] 提出了一種通過(guò)將 Pointwise Relativistic LSGAN（一種改進(jìn)的最小二乘 GAN）應(yīng)用于音頻來(lái)創(chuàng)建具有較少局部偽影的 Parallel WaveGAN（和 MelGAN）的方法。在 LVCNet [Zeng21] 中，使用根據(jù)條件變化的卷積層的生成器，稱(chēng)為位置可變卷積，被放入 Parallel WaveGAN 并訓(xùn)練以創(chuàng)建更快（4x）的生成模型，質(zhì)量差異很小。

此后，MelGAN 也得到了多種形式的改進(jìn)。在Multi-Band MelGAN [Yang21a]中，增加了原有MelGAN的感受野，增加了多分辨率STFT loss（Parallel WaveGAN建議），計(jì)算了多波段劃分（DurIAN建議），使得速度更快，更穩(wěn)定的模型。還提出了 Universal MelGAN [Jang20] 的多揚(yáng)聲器版本，它也使用多分辨率鑒別器來(lái)生成具有更多細(xì)節(jié)的音頻。這個(gè)想法在后續(xù)的研究 UnivNet [Jang21] 中得到延續(xù)，并進(jìn)一步改進(jìn)，比如一起使用多周期判別器。在這些研究中，音頻質(zhì)量也通過(guò)使用更寬的頻帶 (80->100) mel 得到改善。

首爾國(guó)立大學(xué)/NVIDIA 推出了一種名為 BigVGAN [Lee22b] 的新型聲碼器。作為考慮各種錄音環(huán)境和未見(jiàn)語(yǔ)言等的通用Vocoder，作為技術(shù)改進(jìn)，使用snake函數(shù)為HiFi-GAN生成器提供周期性的歸納偏置，并加入低通濾波器以減少邊由此造成的影響。另外，模型的大小也大大增加了（~112M），訓(xùn)練也成功了。

4.4.基于擴(kuò)散的聲碼器

擴(kuò)散模型可以稱(chēng)為最新一代模型，較早地應(yīng)用于聲碼器。ICLR21同時(shí)介紹了思路相似的DiffWave[Kong21]和WaveGrad[Chen21a]。Diffusion Model用于音頻生成部分是一樣的，但DiffWave類(lèi)似于WaveNet，WaveGrad基于GAN-TTS。處理迭代的方式也有所不同，因此在比較兩篇論文時(shí)閱讀起來(lái)很有趣。之前聲學(xué)模型部分介紹的PriorGrad [Lee22a]也以創(chuàng)建聲碼器為例進(jìn)行了介紹。在這里，先驗(yàn)是使用梅爾譜圖的能量計(jì)算的。

擴(kuò)散法的優(yōu)點(diǎn)是可以學(xué)習(xí)復(fù)雜的數(shù)據(jù)分布并產(chǎn)生高質(zhì)量的結(jié)果，但最大的缺點(diǎn)是生成時(shí)間相對(duì)較長(zhǎng)。另外，由于這種方法本身是以去除噪聲的方式進(jìn)行的，因此如果進(jìn)行時(shí)間過(guò)長(zhǎng)，存在原始音頻中存在的許多噪聲（清音等）也會(huì)消失的缺點(diǎn)。FastDiff [Huang22] 通過(guò)將 LVCNet [Zeng21] 的思想應(yīng)用到擴(kuò)散模型中，提出了時(shí)間感知的位置-變化卷積。通過(guò)這種方式，可以更穩(wěn)健地應(yīng)用擴(kuò)散，并且可以通過(guò)使用噪聲調(diào)度預(yù)測(cè)器進(jìn)一步減少生成時(shí)間。

來(lái)自騰訊的 BDDM [Lam22] 也提出了一種大大減少創(chuàng)建時(shí)間的方法。換句話(huà)說(shuō)，擴(kuò)散過(guò)程的正向和反向過(guò)程使用不同的網(wǎng)絡(luò)（正向：調(diào)度網(wǎng)絡(luò)，反向：分?jǐn)?shù)網(wǎng)絡(luò)），并為此提出了一個(gè)新的理論目標(biāo)。在這里，我們展示了至少可以通過(guò)三個(gè)步驟生成音頻。在這個(gè)速度下，擴(kuò)散法也可以用于實(shí)際目的。雖然以前的大多數(shù)研究使用 DDPM 型建模，但擴(kuò)散模型也可以用隨機(jī)微分方程 (SDE) 的形式表示。ItoWave [Wu22b] 展示了使用 SDE 類(lèi)型建模生成音頻的示例。

4.5.基于源濾波器的聲碼器

在這篇文章的開(kāi)頭，在處理 TTS 的歷史時(shí)，我們簡(jiǎn)單地了解了 Formant Synthesis。人聲是一種建模方法，認(rèn)為基本聲源（正弦音等）經(jīng)過(guò)口部結(jié)構(gòu)過(guò)濾，轉(zhuǎn)化為我們聽(tīng)到的聲音。這種方法最重要的部分是如何制作過(guò)濾器。在 DL 時(shí)代，我想如果這個(gè)過(guò)濾器用神經(jīng)網(wǎng)絡(luò)建模，性能會(huì)不會(huì)更好。在神經(jīng)源濾波器方法 [Wang19a] 中，使用 f0（音高）信息創(chuàng)建基本正弦聲音，并訓(xùn)練使用擴(kuò)張卷積的濾波器以產(chǎn)生優(yōu)質(zhì)聲音。不是自回歸的方法，所以速度很快。之后，在[Wang19b]中，將其擴(kuò)展重構(gòu)為諧波噪聲模型以提高性能。DDSP [Engel20] 提出了一種使用神經(jīng)網(wǎng)絡(luò)和多個(gè) DSP 組件創(chuàng)建各種聲音的方法，其中諧波使用加法合成方法，噪聲使用線性時(shí)變?yōu)V波器。

另一種方法是將與語(yǔ)音音高相關(guān)的部分（共振峰）和其他部分（稱(chēng)為殘差、激勵(lì)等）進(jìn)行劃分和處理的方法。這也是一種歷史悠久的方法。共振峰主要使用了LP（線性預(yù)測(cè)），激勵(lì)使用了各種模型。GlotNet [Juvela18]，在神經(jīng)網(wǎng)絡(luò)時(shí)代提出，將（聲門(mén)）激勵(lì)建模為 WaveNet。之后，GELP [Juvela19] 使用 GAN 訓(xùn)練方法將其擴(kuò)展為并行格式。

Naver/Yonsei University 的 ExcitNet [Song19] 也可以看作是具有類(lèi)似思想的模型，然后，在擴(kuò)展模型 LP-WaveNet [Hwang20a] 中，source 和 filter 一起訓(xùn)練，并使用更復(fù)雜的模型。在 [Song20] 中，引入了逐代建模 (MbG) 概念，從聲學(xué)模型生成的信息可用于聲碼器以提高性能。在神經(jīng)同態(tài)聲碼器 [Liu20b] 中，諧波使用線性時(shí)變 (LTV) 脈沖序列，噪聲使用 LTV 噪聲。[Yoneyama21] 提出了一種模型，它使用 Parallel WaveGAN 作為聲碼器，并集成了上述幾種源濾波器模型。Parallel WaveGAN本身也被原作者組（Naver等）不斷擴(kuò)充，首先在[Hwang21b]中，Generator被擴(kuò)充為Harmonic Noise模型，同時(shí)也加入了subband版本。此外，[Yamamoto21] 提出了幾種提高鑒別器性能的技術(shù)，其中，模型濁音（諧波）和清音（噪聲）的鑒別器分為考慮因素。

LPCNet [Valin19] 可以被認(rèn)為是繼這種源過(guò)濾器方法之后使用最廣泛的模型。作為在 WaveRNN 中加入線性預(yù)測(cè)的模型， LPCNet 此后也進(jìn)行了多方面的改進(jìn)。在 Bunched LPCNet [Vipperla20] 中，通過(guò)利用原始 WaveRNN 中引入的技術(shù)，LPCNet 變得更加高效。Gaussian LPCNet [Popov20a] 還通過(guò)允許同時(shí)預(yù)測(cè)多個(gè)樣本來(lái)提高效率。[Kanagawa20] 通過(guò)使用張量分解進(jìn)一步減小 WaveRNN 內(nèi)部組件的大小來(lái)提高另一個(gè)方向的效率。iLPCNet [ Hwang20b] 提出了一種模型，該模型通過(guò)利用連續(xù)形式的混合密度網(wǎng)絡(luò)顯示出比現(xiàn)有 LPCNet 更高的性能。[Popov20b] 提出了一種模型，在LPCNet中的語(yǔ)音中找到可以切斷的部分（例如，停頓或清音），將它們劃分，并行處理，并通過(guò)交叉淡入淡出來(lái)加快生成速度. LPCNet 也擴(kuò)展到了子帶版本，首先在 FeatherWave [Tian20] 中引入子帶 LPCNet。在 [Cui20] 中，提出了考慮子帶之間相關(guān)性的子帶 LPCNet 的改進(jìn)版本。最近LPCNet的作者也推出了改進(jìn)版（好像是從Mozilla/Google轉(zhuǎn)到Amazon）[Valin22]，使用樹(shù)結(jié)構(gòu)來(lái)減少采樣時(shí)的計(jì)算量，使用8位量化權(quán)重。建議。這些都是有效使用緩存并利用最新 GPU 改進(jìn)的并行計(jì)算能力的所有方法。

聲碼器的發(fā)展正朝著從高質(zhì)量、慢速的AR（Autoregressive）方法向快速的NAR（Non-autoregressive）方法轉(zhuǎn)變的方向發(fā)展。由于幾種先進(jìn)的生成技術(shù)，NAR 也逐漸達(dá)到 AR 的水平。例如在TTS-BY-TTS [Hwang21a]中，使用AR方法創(chuàng)建了大量數(shù)據(jù)并用于NAR模型的訓(xùn)練，效果不錯(cuò)。但是，使用所有數(shù)據(jù)可能會(huì)很糟糕。因此，TTS-BY-TTS2 [Song22] 提出了一種僅使用此數(shù)據(jù)進(jìn)行訓(xùn)練的方法，方法是使用 RankSVM 獲得與原始音頻更相似的合成音頻。

DelightfulTTS [Liu21]，微軟使用的 TTS 系統(tǒng)，有一些自己的結(jié)構(gòu)修改，例如使用 conformers，并且特別以生成 48 kHz 的最終音頻為特征（大多數(shù) TTS 系統(tǒng)通常生成 16 kHz 音頻）。為此，梅爾頻譜圖以 16kHz 的頻率生成，但最終音頻是使用內(nèi)部制作的 HiFiNet 以 48kHz 的頻率生成的。

5.完全端到端的TTS

通過(guò)一起學(xué)習(xí)聲學(xué)模型和聲碼器，介紹在輸入文本或音素時(shí)立即創(chuàng)建波形音頻的模型。實(shí)際上，最好一次完成所有操作，無(wú)需劃分訓(xùn)練步驟，更少的步驟減少錯(cuò)誤。無(wú)需使用 Mel Spectrum 等聲學(xué)功能。其實(shí)Mel是好的，但是被人任意設(shè)定了（次優(yōu)），相位信息也丟失了。然而，這些模型之所以不容易從一開(kāi)始就開(kāi)發(fā)出來(lái)，是因?yàn)楹茈y一次全部完成。

例如，作為輸入的文本在 5 秒內(nèi)大約為 20，對(duì)于音素大約為 100。但波形是 80,000 個(gè)樣本（采樣率為 16 kHz）。因此，一旦成為問(wèn)題，不好完全與其匹配（文本->音頻樣本），不如使用中等分辨率的表達(dá)方式（如Mel）分兩步進(jìn)行比較簡(jiǎn)單。但是，隨著技術(shù)的逐漸發(fā)展，可以找到一些用這種 Fully End-to-End 方法訓(xùn)練的模型。作為參考，在許多處理聲學(xué)模型的論文中，他們經(jīng)常使用術(shù)語(yǔ)端到端模型，這意味著文本分析部分已被一起吸收到他們的模型中，或者他們可以通過(guò)將聲碼器附加到他們的模型來(lái)生成音頻. 它通常用于表示能夠。

也許這個(gè)領(lǐng)域的第一個(gè)是 Char2Wav [Sotelo17]。這是蒙特利爾大學(xué)名人Yoshua Bengio教授團(tuán)隊(duì)的論文，通過(guò)將其團(tuán)隊(duì)制作的SampleRNN [Mehri17] vocoder添加到Acoustic Model using seq2seq中一次性訓(xùn)練而成。ClariNet[Mehri17]的主要內(nèi)容其實(shí)就是讓W(xué)aveNet->IAF方法的Vocoder更加高效，但是有他們團(tuán)隊(duì)（百度）創(chuàng)建的Acoustic Model（DeepVoice 3），所以在里面添加一個(gè)新創(chuàng)建的vocoder并且趕緊學(xué)起來(lái)吧，還介紹了如何創(chuàng)建-to-End模型。

FastSpeech 2 [Ren21a] 也是關(guān)于一個(gè)好的 Acoustic Model，這篇論文也介紹了一個(gè) Fully End-to-End 模型，叫做 FastSpeech 2s。FastSpeech 2模型附加了一個(gè)WaveNet聲碼器，為了克服訓(xùn)練的困難，采取了使用預(yù)先制作的mel編碼器的方法。名為EATS [Donahue21]的模型使用他們團(tuán)隊(duì)（谷歌）創(chuàng)建的GAN-TTS [Binkowski20]作為聲碼器，創(chuàng)建一個(gè)新的Acoustic Model，并一起訓(xùn)練。但是，一次訓(xùn)練很困難，因此創(chuàng)建并使用了中等分辨率的表示。Wave-Tacotron [Weiss21]，是一種通過(guò)將聲碼器連接到 Tacotron 來(lái)立即訓(xùn)練的模型。這里使用了流式聲碼器，作者使用 Kingma，因此可以在不顯著降低性能的情況下創(chuàng)建更快的模型。

之前Acoustic Model部分介紹的EfficientTTS [Miao21]也介紹了一種模型（EFTS-Wav），通過(guò)將decoder換成MelGAN，以端到端的方式進(jìn)行訓(xùn)練。該模型還表明，它可以顯著加快音頻生成速度，同時(shí)仍然表現(xiàn)良好。Kakao 團(tuán)隊(duì)開(kāi)發(fā)了一種名為 Glow-TTS [Kim20] 的聲學(xué)模型和一種名為 HiFi-GAN [Kong20] 的聲碼器。然后可以將兩者放在一起以創(chuàng)建端到端模型。這樣創(chuàng)建的模型是 VITS [Kim21a]，它使用 VAE 連接兩個(gè)部分，并使用對(duì)抗性方法進(jìn)行整個(gè)訓(xùn)練，提出了具有良好速度和質(zhì)量的模型。

延世大學(xué)/Naver 還在 2021 年推出了 LiteTTS [Nguyen21]，這是一種高效的完全端到端 TTS。使用了前饋?zhàn)儞Q器和 HiFi-GAN 結(jié)構(gòu)的輕量級(jí)版本。特別是，域傳輸編碼器用于學(xué)習(xí)與韻律嵌入相關(guān)的文本信息。騰訊和浙江大學(xué)提出了一種名為 FastDiff [Huang22] 的聲碼器，還引入了 FastDiff-TTS，這是一種結(jié)合 FastSpeech 2的完全端到端模型。Kakao 還引入了 JETS，它可以一起訓(xùn)練 FastSpeech2 和 HiFi-GAN [Lim22]。微軟在將現(xiàn)有的 DelightfulTTS 升級(jí)到版本 2 的同時(shí)，也引入了 Fully End-to-End 方法 [Liu22b]。這里，VQ音頻編碼器被用作中間表達(dá)方法。

參考文獻(xiàn)

【1】[?????] Neural Text-to-Speech(TTS)

【2】1906.10859.pdf ()

Reference

[Griffin84] D.Griffin, J.Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
[Kawahara06] H.Kawahara. Straight, exploitation of the other aspect of vocoder: Perceptually isomor- phic decomposition of speech sounds. Acoustical science and technology, 27(6):349–353, 2006.
[Zen13] H.Zen, A.Senior, M.Schuster. Statistical parametric speech synthesis using deep neural networks. ICASSP 2013.
[Fan14] Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong. TTS synthesis with bidirectional lstm based recurrent neural networks. Fifteenth annual conference of the international speech communication association, 2014.
[Qian14] Y. Qian, Y.-C. Fan, W.-P. Hum, F. K. Soong, On the training aspects of deep neural network (DNN) for parametric TTS synthesis. ICASSP 2014.
[Zen15] H.Zen, Hasim Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. ICASSP 2015.
[Morise16] M.Morise, F.Yokomori, K.Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7):1877–1884, 2016.
[Oord16] A.van den Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, K.Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[Ar?k17a] S.?.Ar?k, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, J.Raiman, S.Sengupta, M.Shoeybi. Deep Voice: Real-time neural text-to-speech. ICML 2017.
[Ar?k17b] S.?.Ar?k, G.Diamos, A.Gibiansky, J.Miller, K.Peng, W.Ping, J.Raiman, Y.Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. NeurIPS 2017.
[Lee17] Y.Lee, A.Rabiee, S.-Y.Lee. Emotional end-to-end neural speech synthesizer. arXiv preprint arXiv:1711.05447, 2017.
[Mehri17] S.Mehri, K.Kumar, I.Gulrajani, R.Kumar, S.Jain, J.Sotelo, A.Courville, Y.Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. ICLR 2017.
[Ming17] H.Ming, Y.Lu, Z.Zhang, M.Dong. Alight-weight method of building an LSTM-RNN-based bilingual TTS system. International Conference on Asian Language Processing 2017.
[Sotelo17] J.Sotelo, S.Mehri, K.Kumar, J.F.Santos, K.Kastner, A.Courville, Y.Bengio. Char2wav: End-to-end speech synthesis. ICLR workshop 2017.
[Tjandra17] A.Tjandra, S.Sakti, S.Nakamura. Listening while speaking: Speech chain by deep learning. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017.
[Wang17] Y.Wang, RJ Skerry-Ryan, D.Stanton, Y.Wu, R.Weiss, N.Jaitly, Z.Yang, Y.Xiao, Z.Chen, S.Bengio, Q.Le, Y.Agiomyrgiannakis, R.Clark, R.A.Saurous. Tacotron: Towards end-to-end speech synthesis. Interspeech 2017.
[Adigwe18] A.Adigwe, N.Tits, K.El Haddad, S.Ostadabbas, T.Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
[Akuzawa18] K.Akuzawa, Y.Iwasawa, Y.Matsuo. Expressive speech synthesis via modeling expressions with variational autoencoder. Interspeech 2018.
[Ar?k18] S.?.Ar?k, J.Chen, K.Peng, W.Ping, Y.Zhou. Neural voice cloning with a few samples. NeurIPS 2018.
[Chae18] M.-J.Chae, K.Park, J.Bang, S.Suh, J.Park, N.Kim, L.Park. Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. ICASSP 2018.
[Guo18] W.Guo, H.Yang, Z.Gan. A dnn-based mandarin-tibetan cross-lingual speech synthesis. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018.
[Kalchbrenner18] N.Kalchbrenner, E.Elsen, K.Simonyan, S.Noury, N.Casagrande, E.Lockhart, F.Stimberg, A.van den Oord, S.Dieleman, K.Kavukcuoglu. Efficient neural audio synthesis. ICML 2018.
[Jia18] Y.Jia, Y.Zhang, R.J.Weiss, Q.Wang, J.Shen, F.Ren, Z.Chen, P.Nguyen, R.Pang, I.L.Moreno, Y.Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. NeurIPS 2018.
[Jin18] Z.Jin, A.Finkelstein, G.J.Mysore, J.Lu. FFTNet: A real-time speaker-dependent neural vocoder. ICASSP 2018.
[Juvela18] L.Juvela, V.Tsiaras, B.Bollepalli, M.Airaksinen, J.Yamagishi, P. Alku. Speaker-independent raw waveform model for glottal excitation. Interspeech 2018.
[Nachmani18] E.Nachmani, A.Polyak, Y.Taigman, L.Wolf. Fitting new speakers based on a short untranscribed sample. ICML 2018.
[Okamoto18a] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. ICASSP 2018.
[Okamoto18b] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai. Improving FFT-Net vocoder with noise shaping and subband approaches. IEEE Spoken Language Technology Workshop (SLT) 2018.
[Oord18] A.van den Oord, Y.Li, I.Babuschkin, K.Simonyan, O.Vinyals, K.Kavukcuoglu, G.van den Driessche, E.Lockhart, L.C.Cobo, F.Stimberg et al., Parallel WaveNet: Fast high-fidelity speech synthesis. ICML 2018.
[Ping18] W.Ping, K.Peng, A.Gibiansky, S.O.Ar?k, A.Kannan, S.Narang, J.Raiman, J.Miller. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. ICLR 2018.
[Shen18] J.Shen, R.Pang, R.J.Weiss, M.Schuster, N.Jaitly, Z.Yang, Z.Chen, Y.Zhang, Y.Wang, RJ S.Ryan, R.A.Saurous, Y.Agiomyrgiannakis, Y.Wu. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. ICASSP 2018.
[Skerry-Ryan18] R.J.Skerry-Ryan, E.Battenberg, Y.Xiao, Y.Wang, D.Stanton, J.Shor, R.Weiss, R.Clark, R.A.Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. ICML 2018.
[Tachibana18] H.Tachibana, K.Uenoyama, S.Aihara. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. ICASSP 2018.
[Taigman18] Y.Taigman, L.Wolf, A.Polyak, E.Nachmani. VoiceLoop: Voice fitting and synthesis via a phonological loop. ICLR 2018.
[Tjandra18] A.Tjandra, S.Sakti, S.Nakamura. Machine speech chain with one-shot speaker adaptation. Interspeech 2018.
[Wang18] Y.Wang, D.Stanton, Y.Zhang, R.J.Skerry-Ryan, E.Battenberg, J.Shor, Y.Xiao, Y.Jia, F.Ren, R.A.Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. ICML 2018.
[Bollepalli19] B.Bollepalli, L.Juvela, P.Alkuetal. Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system. Interspeech 2019.
[Chen19a] Y.-J.Chen, T.Tu, C.-c.Yeh, H.-Y.Lee. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. Interspeech 2019.
[Chen19b] Y.Chen, Y.Assael, B.Shillingford, D.Budden, S.Reed, H.Zen, Q.Wang, L.C.Cobo, A.Trask, B.Laurie, C.Gulcehre, A.van den Oord, O.Vinyals, N.de Freitas. Sample efficient adaptive text-to-speech. ICLR 2019.
[Chen19c] M.Chen, M.Chen, S.Liang, J.Ma, L.Chen, S.Wang, J.Xiao. Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. Interspeech 2019.
[Chung19] Y.-A.Chung, Y.Wang, W.-N.Hsu,Y.Zhang, R.J.Skerry-Ryan.Semi-supervised training for improving data efficiency in end-to-end speech synthesis. ICASSP 2019.
[Donahue19] C.Donahue, J.McAuley, M.Puckette. Adversarial audio synthesis. ICLR 2019. [????]
[Fang19] W.Fang, Y.-A.Chung, J.Glass. Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models. arXiv preprint arXiv:1906.07307, 2019.
[Guo19] H.Guo, F.K.Soong, L.He, L.Xie. A new GAN-based end-to-end tts training algorithm. Interspeech 2019.
[Gururani19] S.Gururani, K.Gupta, D.Shah, Z.Shakeri, J.Pinto. Prosody transfer in neural text to speech using global pitch and loudness features. arXiv preprint arXiv:1911.09645, 2019.
[Habib19] R.Habib, S.Mariooryad, M.Shannon, E.Battenberg, R.J.Skerry-Ryan, D.Stanton, D.Kao, T.Bagby. Semi-supervised generative modeling for controllable speech synthesis. ICLR 2019.
[Hayashi19] T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and K. Livescu. Pre-trained text embeddings for enhanced text-to-speech synthesis. Interspeech 2019.
[Hsu19] W.-N.Hsu, Y.Zhang, R.J.Weiss, H.Zen, Y.Wu, Y.Wang, Y.Cao, Y.Jia, Z.Chen, J.Shen, P.Nguyen, R.Pang. Hierarchical generative modeling for controllable speech synthesis. ICLR 2019.
[Jia19] Y.Jia, R.J.Weiss, F.Biadsy, W.Macherey, M.Johnson, Z.Chen, Y.Wu. Direct speech-to-speech translation with a sequence-to-sequence model. Interspeech 2019.
[Juvela19] L.Juvela, B.Bollepalli, J.Yamagishi, P.Alku. Gelp: Gan-excited linear prediction for speech synthesis from mel-spectrogram. Interspeech 2019.
[Kim19] S.Kim, S.Lee, J.Song, J.Kim, S.Yoon. FloWaveNet: A Generative flow for raw audio. ICML 2019.
[Kenter19] T.Kenter, V.Wan, C.-A.Chan, R.Clark, J.Vit. Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. ICML 2019.
[Klimkov19] V.Klimkov, S.Ronanki, J.Rohnke, T.Drugman. Fine-grained robust prosody transfer for single-speaker neural text-to-speech. Interspeech 2019.
[Kons19] Z.Kons, S.Shechtman, A.Sorin, C.Rabinovitz, R.Hoory. High quality, lightweight and adaptable TTS using LPCNet. Interspeech 2019.
[Kwon19] O.Kwon, E.Song, J.-M.Kim, H.-G.Kang. Effective parameter estimation methods for an excitnet model in generative text-to-speech systems. arXiv preprint arXiv:1905.08486, 2019.
[Kumar19] K.Kumar, R.Kumar, T.de Boissiere, L.Gestin, W.Z.Teoh, J.Sotelo, A.de Brebisson, Y.Bengio, A. Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. NeurIPS 2019.
[Lee19] Y.Lee, T.Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. ICASSP 2019.
[Li19a] N.Li, S.Liu, Y.Liu, S.Zhao, M.Liu, M.Zhou. Neural speech synthesis with transformer network. AAAI 2019.
[Li19b] B. Li, Y. Zhang, T. Sainath, Y. Wu, W. Chan. Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. ICASSP, 2019.
[Lorenzo-Trueba19] J.Lorenzo-Trueba, T.Drugman, J.Latorre, T.Merritt, B.Putrycz, R.Barra-Chicote, A.Moinet, V.Aggarwal. Towards achieving robust universal neural vocoding. Interspeech 2019.
[Ma19] S.Ma, D.Mcduff, Y.Song. Neural TTS stylization with adversarial and collaborative games. ICLR 2019.
[Ming19] H. Ming, L. He, H. Guo, and F. Soong. Feature reinforcement with word embedding and parsing information in neural TTS. arXiv preprint arXiv:1901.00707, 2019.
[Nachmani19] E.Nachmani, L.Wolf. Unsupervised polyglot text to speech. ICASSP 2019.
[Ping19] W.Ping, K.Peng, J.Chen. ClariNet: Parallel wave generation in end-to-end text-to-speech. ICLR 2019.
[Prenger19] R.Prenger, R.Valle, B.Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. ICASSP 2019.
[Ren19a] Y.Ren, Y.Ruan, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.Y.Liu. FastSpeech: Fast, robust and controllable text to speech. NeurIPS 2019.
[Ren19b] Y.Ren, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.-Y.Liu. Almost unsupervised text to speech and automatic speech recognition. ICML 2019.
[Song19] E.Song, K.Byun, H.-G.Kang. ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems. EUSIPCO, 2019.
[Tits19a] N.Tits, K.E.Haddad, T.Dutoit. Exploring transfer learning for low resource emotional TTS. SAI Intelligent Systems Conference. Springer 2019.
[Tits19b] N.Tits, F.Wang, K.E.Haddad, V.Pagel, T.Dutoit. Visualization and interpretation of latent spaces for controlling expressive speech synthesis through audio analysis,. arXiv preprint arXiv:1903.11570, 2019.
[Tjandra19] A.Tjandra, B.Sisman, M.Zhang, S.Sakti, H.Li, S.Nakamura. VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. Interspeech 2019.
[Valin19] J.-M.Valin, J.Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. ICASSP 2019.
[Wang19a] X.Wang, S.Takaki, J.Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. ICASSP 2019.
[Wang19b] X.Wang, S.Takaki, J.Yamagishi. Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis. ISCA Speech Synthesis Workshop 2019.
[Yamamoto19] R.Yamamoto, E.Song, J.-M.Kim. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. Interspeech 2019.
[Yang19] B.Yang, J.Zhong, S.Liu. Pre-trained text representations for improving front-end text processing in Mandarin text-to-speech synthesis. Interspeech 2019.
[Zhang19a] Y.-J.Zhang, S.Pan, L.He, Z.-H.Ling. Learning latent representations for style control and transfer in end-to-end speech synthesis. ICASSP 2019.
[Zhang19b] M.Zhang, X.Wang, F.Fang, H.Li, J.Yamagishi. Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet. Interspeech 2019.
[Zhang19c] W.Zhang, H.Yang, X.Bu, L.Wang. Deep learning for mandarin-tibetan cross-lingual speech synthesis. IEEE Access 2019.
[Zhang19d] Y.Zhang, R.J.Weiss, H.Zen, Y.Wu, Z.Chen, R.J.Skerry-Ryan, Y.Jia, A.Rosenberg, B.Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. Interspeech 2019.
[Azizah20] K.Azizah, M.Adriani, W.Jatmiko. Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages. IEEE Access 2020.
[Bae20] J.-S.Bae, H.Bae, Y.-S.Joo, J.Lee, G.-H.Lee, H.-Y.Cho. Speaking speed control of end-to-end speech synthesis using sentence-level conditioning. Interspeech 2020.
[Binkowski20] M.Binkowski, J.Donahue, S.Dieleman, A.Clark, E.Elsen, N.Casagrande, L.C.Cobo, K.Simonyan. High fidelity speech synthesis with adversarial networks. ICLR 2020. [????]
[Chen20] M.Chen, X.Tan, Y.Ren, J.Xu, H.Sun, S.Zhao, T.Qin. MultiSpeech: Multi-speaker text to speech with transformer. Interspeech 2020.
[Choi20] S.Choi, S.Han, D.Kim, S.Ha. Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Interspeech 2020.
[Cooper20a] E.Cooper, C.-I.Lai, Y.Yasuda, F.Fang, X.Wang, N.Chen, J.Yamagishi. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. ICASSP 2020.
[Cooper20b] E.Cooper, C.-I.Lai, Y.Yasuda, J.Yamagishi. Can speaker augmentation improve multi-speaker end-to-end TTS? Interspeech 2020.
[Cui20] Y.Cui, X.Wang, L.He, F.K.Soong. An efficient subband linear prediction for lpcnet-based neural synthesis. Interspeech 2020.
[deKorte20] M.de Korte, J.Kim, E.Klabbers. Efficient neural speech synthesis for low-resource languages through multilingual modeling. Interspeech 2020.
[Engel20] J.Engel, L.Hantrakul, C.Gu, A.Roberts, DDSP: Differentiable digital signal processing. ICLR 2020.
[Gritsenko20] A.Gritsenko, T.Salimans, R.van den Berg, J.Snoek, N.Kalchbrenner. A spectral energy distance for parallel speech synthesis. NeurIPS 2020.
[Hemati20] H.Hemati, D.Borth. Using IPA-based tacotron for data efficient cross-lingual speaker adaptation and pronunciation enhancement. arXiv preprint arXiv:2011.06392, 2020.
[Himawan20] I.Himawan, S.Aryal, I.Ouyang, S.Kang, P.Lanchantin, S.King. Speaker adaptation of a multilingual acoustic model for cross-language synthesis. ICASSP 2020.
[Hsu20] P.-C.Hsu and H.-Y.Lee. WG-WaveNet: Real-time high-fidelity speech synthesis without GPU. Interspeech 2020.
[Hwang20a] M.-J.Hwang, F.Soong, E.Song, X.Wang, H. ang, H.-G.Kang. LP-WaveNet: Linear prediction-based WaveNet speech synthesis. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2020.
[Hwang20b] M.-J.Hwang, E.Song, R.Yamamoto, F.Soong, H.-G.Kang. Improving LPCNet-based text-to-speech with linear prediction-structured mixture density network. ICASSP 2020.
[Jang20] W.Jang, D.Lim, J.Yoon. Universal MelGAN: A robust neural vocoder for high-fidelity waveform generation in multiple domains. arXiv preprint arXiv:2011.09631, 2020.
[Kanagawa20] H.Kanagawa, Y.Ijima. Lightweight LPCNet-based neural vocoder with tensor decomposition. Interspeech 2020.
[Kenter20] T. Kenter, M. K. Sharma, and R. Clark. Improving prosody of RNN-based english text-to-speech synthesis by incorporating a BERT model. Interspeech 2020.
[Kim20] J.Kim, S.Kim, J.Kong, S.Yoon. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. NeurIPS 2020
[Kong20] J.Kong, J.Kim, J.Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS 2020.
[Li20] N.Li, Y.Liu, Y.Wu, S.Liu, S.Zhao, M.Liu. RobuTrans: A robust transformer-based text-to-speech model. AAAI 2020.
[Lim20] D.Lim, W.Jang, G.O, H.Park, B.Kim, J.Yoon. JDI-T: Jointly trained duration informed transformer for text-to-speech without explicit alignment. Interspeech 2020.
[Liu20a] A.H.Liu, T.Tu, H.-y.Lee, L.-s.Lee. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. ICASSP 2020.
[Liu20b] Z.Liu, K.Chen, K.Yu. Neural homomorphic vocoder. Interspeech 2020.
[Luong20] H.-T.Luong, J.Yamagishi. NAUTILUS: a versatile voice cloning system. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2020.
[Maiti20] S.Maiti, E.Marchi, A.Conkie. Generating multilingual voices using speaker space translation based on bilingual speaker data. ICASSP 2020.
[Miao20] C.Miao, S.Liang, M.Chen, J.Ma, S.Wang, J.Xiao. Flow-TTS: A non-autoregressive network for text to speech based on flow. ICASSP 2020.
[Morrison20] M.Morrison, Z.Jin, J.Salamon, N.J.Bryan, G.J.Mysore. Controllable neural prosody synthesis. Interspeech 2020.
[Moss20] H.B.Moss, V.Aggarwal, N.Prateek, J.González, R.Barra-Chicote. BOFFIN TTS: Few-shot speaker adaptation by bayesian optimization. ICASSP 2020.
[Nekvinda20] T.Nekvinda, O.Du?ek. One model, many languages: Meta-learning for multilingual text-to-speech. Interspeech 2020.
[Park20] K.Park, S.Lee. G2PM: A neural grapheme-to-phoneme conversion package for mandarin chinese based on a new open benchmark dataset. Interspeech 2020.
[Paul20a] D.Paul, Y.Pantazis, Y.Stylianou. Speaker Conditional WaveRNN: Towards universal neural vocoder for unseen speaker and recording conditions. Interspeech 2020.
[Paul20b] D.Paul, M.P.V.Shifas, Y.Pantazis, Y.Stylianou. Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. Interspeech 2020.
[Peng20] K.Peng, W.Ping, Z.Song, K.Zhao. Non-autoregressive neural text-to-speech. ICML 2020. [????]
[Ping20] W.Ping, Ka.Peng, K.Zhao, Z.Song. WaveFlow: A compact flow-based model for raw audio. ICML 2020. [????]
[Popov20a] V.Popov, M.Kudinov, T.Sadekova. Gaussian LPCNet for multisample speech synthesis. ICASSP 2020.
[Popov20b] V.Popov, S.Kamenev, M.Kudinov, S.Repyevsky, T.Sadekova, V.Bushaev, V.Kryzhanovskiy, D.Parkhomenko. Fast and lightweight on-device tts with Tacotron2 and LPCNet. Interspeech 2020.
[Shen20] J.Shen, Y.Jia, M.Chrzanowski, Y.Zhang, I.Elias, H.Zen, Y.Wu. Non-Attentive Tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.
[Song20] E.Song, M.-J.Hwang, R.Yamamoto, J.-S.Kim, O.Kwon, J.- M.Kim. Neural text-to-speech with a modeling-by-generation excitation vocoder. Interspeech 2020.
[Staib20] M.Staib, T.H.Teh, A.Torresquintero, D.S.R.Mohan, L.Foglianti, R.Lenain, J.Gao. Phonological features for 0-shot multilingual speech synthesis. Interspeech 2020.
[Sun20a] G.Sun, Y.Zhang, R.J.Weiss, Y.Cao, H.Zen, A.Rosenberg, B.Ramabhadran, Y.Wu. Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and autoregressive prosody prior. ICASSP 2020.
[Sun20b] G.Sun, Y.Zhang, R.J.Weiss, Y.Cao, H.Zen, Y.Wu. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. ICASSP 2020.
[Tian20] Q.Tian, Z.Zhang, L.Heng, L.Chen, S.Liu. FeatherWave: An efficient high-fidelity neural vocoder with multiband linear prediction. Interspeech 2020.
[Tu20] T.Tu, Y.-J.Chen, A.H.Liu, H.-y.Lee. Semi-supervised learning for multi-speaker text-to-speech synthesis using discrete speech representation. Interspeech 2020.
[Um20] S.-Y.Um, S.Oh, K.Byun, I.Jang, C.H.Ahn, H.-G.Kang. Emotional speech synthesis with rich and granularized control. ICASSP 2020.
[Valle20a] R.Valle, K.Shih, R.Prenger, B.Catanzaro. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957, 2020.
[Valle20b] R.Valle, J.Li, R.Prenger, B.Catanzaro. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. ICASSP 2020.
[Vipperla20] R.Vipperla, S.Park, K.Choo, S.Ishtiaq, K.Min, S.Bhattacharya, A.Mehrotra, A.G.C.P.Ramos, N.D.Lane. Bunched LPCNet: Vocoder for low-cost neural text-to-speech systems. Interspeech 2020.
[Wu20] Y.-C.Wu, T.Hayashi, T.Okamoto, H.Kawai, T.Toda. Quasi-periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation. Interspeech 2020.
[Xiao20] Y.Xiao, L.He, H.Ming, F.K.Soong. Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS. ICASSP 2020.
[Xu20] J.Xu, X.Tan, Y.Ren, T.Qin, J.Li, S.Zhao, T.-Y.Liu. LRSpeech: Extremely low-resource speech synthesis and recognition. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2020.
[Yamamoto20] R.Yamamoto, E.Song, and J.M.Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. ICASSP 2020.
[Yang20a] J.Yang, J.Lee, Y.Kim, H.-Y.Cho, I.Kim. VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. Interspeech 2020.
[Yang20b] J.Yang, L.He. Towards universal text-to-speech. Interspeech 2020.
[Yu20] C.Yu, H.Lu, N.Hu, M.Yu, C.Weng, K.Xu, P.Liu, D.Tuo, S.Kang, G.Lei, D.Su, D.Yu. DurIAN: Duration informed attention network for speech synthesis. Interspeech 2020.
[Zhang20a] H.Zhang, Y.Lin. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages. Interspeech 2020.
[Zhang20b] Z.Zhang, Q.Tian, H.Lu, L.-H.Chen, S.Liu. AdaDurIAN: Few-shot adaptation for neural text-to-speech with durian. arXiv preprint arXiv:2005.05642, 2020.
[Zhai20] B.Zhai, T.Gao, F.Xue, D.Rothchild, B.Wu, J.E.Gonzalez, K.Keutzer. SqueezeWave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685, 2020.
[Zhao20] S.Zhao, T.H.Nguyen, H.Wang, B.Ma. Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. Interspeech 2020.
[Zeng20] Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, and Jing Xiao. AlignTTS: Efficient feed-forward text-to-speech system without explicit alignment. ICASSP 2020.
[Zhou20] X.Zhou, X.Tian, G.Lee, R.K.Das, H.Li. End-to-end code-switching TTS with cross-lingual language model. ICASSP 2020.
[Achanta21] S.Achanta, A.Antony, L.Golipour, J.Li, T.Raitio, R.Rasipuram, F.Rossi, J.Shi, J.Upadhyay, D.Winarsky, H.Zhang. On-device neural speech synthesis. IEEE Workshop on Automatic Speech Recongnition and Understanding 2021.
[Bak21] T.Bak, J.-S.Bae, H.Bae, Y.-I.Kim, H.-Y.Cho. FastPitchFormant: Source-filter based decomposed modeling for speech syntehsis. Interspeech 2021.
[Bae21] J.-S.Bae, T.-J.Bak, Y.-S.Joo, H.-Y.Cho. Hierarchical context-aware transformers for non-autoregressive text to speech. Interspeech 2021.
[Casanova21] E.Casanova, C.Shulby, E.G?lge, N.M.Müller,F.S.de Oliveira, A.C.Junior, A.d.Soares, S.M.Aluisio, M.A.Ponti. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. Interspeech 2021.
[Chen21a] N.Chen, Y.Zhang, H.Zen, R.J.Weiss, M.Norouzi, W.Chan. WaveGrad: Estimating gradients for waveform generation. ICLR 2021.
[Chen21b] M.Chen, X.Tan, B.Li, Y.Liu, T.Qin, S.Zhao, T.-Y.Liu. AdaSpeech: Adaptive text to speech for custom voice. ICLR 2021.
[Chien21] C.-M.Chien, J.-H.Lin, C.-y.Huang, P.-c.Hsu, H.-y.Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. ICASSP 2021.
[Christidou21] M.Christidou, A.Vioni, N.Ellinas, G.Vamvoukakis, K.Markopoulos, P.Kakoulidis, J.S.Sung, H.Park, A.Chalamandaris, P.Tsiakoulis. Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control. SPECOM 2021.
[Donahue21] J.Donahue, S.Dieleman, M.Binkowski, E.Elsen, K.Simonyan. End-to-end adversarial text-to-speech. ICLR 2021.
[Du21] Chenpeng Du and Kai Yu. Rich prosody diversity modelling with phone-level mixture density network. Interspeech 2021.
[Elias21a] I.Elias, H.Zen, J.Shen, Y.Zhang, Y.Jia, R.Weiss, Y.Wu. Parallel Tacotron: Non-autoregressive and controllable TTS. ICASSP 2021.
[Elias21b] I.Elias, H.Zen, J.Shen, Y.Zhang, Y.Jia, R.J.Skerry-Ryan, Y.Wu. Parallel Tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling. Interspeech 2021.
[Hu21] Q.Hu, T.Bleisch, P.Petkov, T.Raitio, E.Marchi, V.Lakshminarasimhan. Whispered and lombard neural speech synthesis. IEEE Spoken Language Technology Workshop (SLT) 2021.
[Huang21] Z.Huang, H.Li, M.Lei. DeviceTTS: A small-footprint, fast, stable network for on-device text-to-speech. ICASSP 2021.
[Huybrechts21] G.Huybrechts, T.Merritt, G.Comini, B.Perz, R.Shah, J.Lorenzo-Trueba. Low-resource expressive text-to-speech using data augmentation. ICASSP 2021.
[Hwang21a] M.-J.Hwang, R.Yamamoto, E.Song, J.-M.Kim. TTS-by-TTS: Tts-driven data augmentation for fast and high-quality speech synthesis. ICASSP 2021.
[Hwang21b] M.-J.Hwang, R.Yamamoto, E.Song, J.-M.Kim. High-fidelity Parallel WaveGAN with multi-band harmonic-plus-noise model. Interspeech 2021.
[Jang21] W.Jang, D.Lim, J.Yoon, B.Kim, J.Kim. UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. Interspeech 2021.
[Jeong21] M.Jeong, H.Kim, S.J.Cheon, B.J.Choi, N.S.Kim. Diff-TTS: A Denoising diffusion model for text-to-speech. Interspeech 2021.
[Jia21] Y.Jia, H.Zen, J.Shen, Y.Zhang, Y.Wu. PnG BERT: Augmented bert on phonemes and graphemes for neural TTS. arXiv preprint arXiv:2103.15060, 2021.
[Kang21] M.Kang, J.Lee, S.Kim, I.Kim. Fast DCTTS: Efficient deep convolutional text-to-speech. ICASSP 2021.
[Kim21a] J.Kim, J.Kong, J.Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ICML 2021.
[Kim21b] J.-H.Kim, S.-H.Lee, J.-H.Lee, S.-W.Lee. Fre-GAN: Adversarial frequency-consistent audio synthesis. Interspeech 2021.
[Kim21c] M.Kim, S.J.Cheon, B.J.Choi, J.J.Kim, N.S.Kim. Expressive text-to-speech using style tag. Interspeech 2021.
[Kim21d] H.-Y.Kim, J.-H.Kim, J.-M.Kim. NN-KOG2P: A novel grapheme-to-phoneme model for Korean language. ICASSP 2021.
[Kong21] Z.Kong, W.Ping, J.Huang, K.Zhao, B.Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. ICLR 2021.
[?ancucki21] A.?ancucki. FastPitch: Parallel text-to-speech with pitch prediction. ICASSP 2021.
[Lee21a] Y.Lee, J.Shin, K.Jung. Bidirectional variational inference for non-autoregressive text-to-speech. ICLR 2021.
[Lee21b] S.-H.Lee, H.-W.Yoon, H.-R.Noh, J.-H. Kim, S.-W.Lee. Multi-SpectroGAN: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. AAAI 2021.
[Lee21c] K.Lee, K.Park, D.Kim. Styler: Style modeling with rapidity and robustness via speech decomposition for expressive and controllable neural text to speech. Interspeech 2021.
[Li21a] T.Li, S.Yang, L.Xue, L.Xie. Controllable emotion transfer for end-to-end speech synthesis. International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021.
[Li21b] X.Li, C.Song, J.Li, Z.Wu, J.Jia, H.Meng. Towards multiscale style control for expressive speech synthesis. Interspeech, 2021.
[Liu21] Y.Liu, Z.Xu, G.Wang, K.Chen, B.Li, X.Tan, J.Li, L.He, S.Zhao. DelightfulTTS: The Microsoft speech synthesis system for Blizzard challenge 2021. arXiv preprint arXiv:2110.12612, 2021.
[Luo21] R.Luo, X.Tan, R.Wang, T.Qin, J.Li, S.Zhao, E.Chen, T.-Y.Liu. LightSpeech: Lightweight and fast text to speech with neural architecture search. ICASSP 2021.
[Miao21] C.Miao, S.Liang, Z.Liu, M.Chen, J.Ma, S.Wang, J.Xiao. EfficientTTS: An efficient and high-quality text-to-speech architecture. ICML 2021.
[Min21] D.Min, D.B.Lee, E.Yang, S.J.Hwang. Meta-StyleSpeech: Multi-speaker adaptive text-to-speech generation. ICML 2021.
[Morisson21] M.Morrison, Z.Jin, N.J.Bryan, J.-P.Caceres, B.Pardo. Neural pitch-shifting and time-stretching with controllable LPCNet. arXiv preprint arXiv:2110.02360, 2021.
[Nguyen21] H.-K.Nguyen, K.Jeong, S.Um, M.-J.Hwang, E.Song, H.-G.Kang. LiteTTS: A lightweight mel-spectrogram-free text-to-wave synthesizer based on generative adversarial networks. Interspeech 2021.
[Pan21] S.Pan, L.He. Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. Interspeech 2021.
[Popov21] C.Popov, I.Vovk, V.Gogoryan, T.Sadekova, M.Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. ICML 2021.
[Ren21a] Y.Ren, C,Hu, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.-Y.Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech. ICLR 2021.
[Ren21b] Y.Ren, J.Liu, Z.Zhao. PortaSpeech: Portable and high-quality generative text-to-speech. NeurIPS 2021.
[Sivaprasad21] S.Sivaprasad, S.Kosgi, V.Gandhi. Emotional prosody control for speech generation. Interspeech 2021.
[Song21] E.Song, R.Yamamoto, M.-J.Hwang, J.-S.Kim, O.Kwon, J.- M.Kim. Improved Parallel WaveGAN vocoder with perceptually weighted spectrogram loss. IEEE Spoken Language Technology Workshop (SLT) 2021.
[Tan21] X.Tan, T.Qin, F.Soong, T.-Y. Liu. A survey on neural speech synthesis. arXiv: 2106.15561v3.
[Wang21] D.Wang, L.Deng, Y.Zhang, N.Zheng, Y.T.Yeung, X.Chen, X.Liu, H.Meng. FCL-Taco2: Towards fast, controllable and lightweight text-to-speech synthesis. ICASSP 2021.
[Weiss21] R.J.Weiss, R.J.Skerry-Ryan, E.Battenberg, S.Mariooryad, D.P.Kingma. Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis. ICASSP 2021.
[Xu21] G.Xu, W.Song, Z.Zhang, C.Zhang, X.He, B.Zhou. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. ICASSP 2021.
[Yamamoto21] R.Yamamoto, E.Song, M.-J.Hwang, J.-M.Kim. Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators. ICASSP 2021.
[Yan21a] Y.Yan, X.Tan, B.Li, T.Qin, S.Zhao, Y.Shen, T.-Y.Liu. AdaSpeech 2: Adaptive text to speech with untranscribed data. ICASSP 2021.
[Yan21b] Y.Yan, X.Tan, B.Li, G.Zhang, T.Qin, S.Zhao, Y.Shen, W.-Q.Zhang, T.-Y.Liu. AdaSpeech 3: Adaptive text to speech for spontaneous style. Interspeech 2021.
[Yang21a] G.Yang, S.Yang, K.Liu, P.Fang, W.Chen, L.Xie. Multi-Band MelGAN: Faster waveform generation for high-quality text-to-speech. IEEE Spoken Language Technology Workshop (SLT) 2021.
[Yang21b] J.Yang, J.-S.Bae, T.Bak, Y.Kim, H.-Y.Cho. GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis. Interspeech 2021.
[Yoneyama21] R.Yoneyama, Y.-C.Wu, T.Toda. Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic Parallel WaveGAN. Interspeech 2021.
[You21] J.You, D.Kim, G.Nam, G.Hwang, G.Chae. GAN Vocoder: Multi-resolution discriminator is all you need. Interspeech 2021.
[Yue21] F.Yue, Y.Deng, L.He, T.Ko. Exploring machine speech chain for domain adaptation and few-shot speaker adaptation. arXiv preprint arXiv:2104.03815, 2021.
[Zaidi21] J.Zaidi, H.Seute, B.van Niekerk, M.-A.Carbonneau. Daft-Exprt: Cross-speaker prosody transfer on any text for expressive speech synthesis. arXiv preprint arXiv:2108.02271, 2021.
[Zhang21a] C.Zhang, X.Tan, Y.Ren, T.Qin, K.Zhang, T.-Y.Liu. UWSpeech: Speech to speech translation for unwritten languages. AAAI 2021.
[Zhang21b] G.Zhang, Y.Qin, D.Tan, T.Lee. Applying the information bottleneck principle to prosodic representation learning. arXiv preprint arXiv:2108.02821, 2021.
[Zeng21] Z.Zeng, J.Wang, N.Cheng, J.Xiao. LVCNet: Efficient condition-dependent modeling network for waveform generation. ICASSP 2021.
[Bae22] J.-S.Bae, J.Yang, T.-J.Bak, Y.-S.Joo. Hierarchical and multi-scale variational autoencoder for diverse and natural non-autoregressive text-to-speech. Interspeech 2022.
[Cho22] H.Cho, W.Jung, J.Lee, S.H.Woo. SANE-TTS: Stable and natural end-to-end multilingual text-to-speech. Interspeech 2022.
[Comini22] G.Comini, G.Huybrechts, M.S.Ribeiro, A.Gabrys, J.Lorenzo-Trueba. Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. Interspeech 2022.
[Dai22] Z.Dai, J.Yu, Y.Wang, N.Chen, Y.Bian, G.Li, D.Cai, D.Yu. Automatic prosody annotation with pre-trained text-speech model. Interspeech 2022.
[Hsu22] P.-C.Hsu, D.-R.Liu, A.T.Liu, H.-y.Lee. Parallel synthesis for autoregressive speech generation. arXiv preprint arXiv:2204.11806, 2022.
[Huang22a] R.Huang, M.W.Y.Lam, J.Wang, D.Su, D.Yu, Y.Ren, Z.Zhao. FastDiff: A fast conditional diffusion model for high-quality speech synthesis. International Joint Conference on Artificial Intelligence 2022.
[Huang22b] R.Huang, Y.Ren, J.Liu, C.Cui, Z.Zhao. GenerSpeech: Towards style transfer for generalizable out-of-domain TTS synthesis. arXiv preprint arXiv:2205.07211, 2022.
[Kharitonov22] E.Kharitonov, A.Lee, A.Polyak, Y.Adi, J.Copet, K.Lakhotia, T.-A.Nguyen, M.Riviere, A.Mohamed, E.Dupoux, W.-N.Hsu. Text-free prosody-aware generative spoken language modeling. Annual Meeting of the Association for Computational Linguistics (ACL) 2022.
[Kim22a] H.Kim, S.Kim, S.Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. ICML 2022.
[Kim22b] S.Kim, H.Kim, S.Yoon. Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370, 2022.
[Koch22] J.Koch, F.Lux, N.Schauffler, T.Bernhart, F.Dieterle, J.Kuhn, S.Richter, G.Viehhauser, N.T.Vu. PoeticTTS: Controllable poetry reading for literary studies. Interspeech 2022.
[Lam22] M.W.Y.Lam, J.Wang, D.Su, D.Yu. BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. ICLR 2022.
[Lee22a] S.-G.Lee, H.Kim, C.Shin, X.Tan, C.Liu, Q.Meng, T.Qin, W.Chen, S.Yoon, T.-Y.Liu. PriorGrad: Improving conditional denoising diffusion models with data-driven adaptive prior. ICLR 2022.
[Lee22b] S.-G.Lee, W.Ping, B.Ginsburg, B.Catanzaro, S.Yoon. BigVGAN: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
[Lei22] Y.Lei, S.Yang, X.Wang, MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis. IEEE/ACM Transactions on Audio, Speech and Language Process Vol.30, 2022.
[Li22a] Y.A.Li, C.Han, N.Mesgarani. StyleTTS: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.
[Li22b] T.Li, X.Wang, Q.Xie, Z.Wang, M.Jiang, L.Xie. Cross-speaker emotion transfer based on prosody compensation for end-to-end speech synthesis. arXiv preprint arXiv:2207.01198, 2022.
[Li22c] X.Li, C.Song, X.Wei, Z.Wu, J.Jia, H.Meng. Towards cross-speaker reading style transfer on audiobook dataset. Interspeech 2022.
[Lian22] J.Lian, C.Zhang ,G.K.Anumanchipalli, D.Yu. UTTS: Unsupervised TTS with conditional disentangled sequential variational auto-encoder. arXiv preprint arXiv:2206.02512, 2022.
[Lim22] D.Lim, S.Jung, E.Kim. JETS: Jointly training FastSpeech2 and HiFi-GAN for end-to-end text-to-speech. Interspeech 2022.
[Liu22a] S.Liu, D.Su, D.Yu. DiffGAN-TTS: High-fidelity and efficient text-to-speech with denoising diffusion GANs. arXiv preprint arXiv:2201.11972, 2022.
[Liu22b] Y.Liu, R.Xue, L.He, X.Tan, S.Zhao. DelightfulTTS 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders. Interspeech 2022.
[Lu22] Z.Lu, M.He, R.Zhang, C.Gong. A post auto-regressive GAN vocoder focused on spectrum fracture. arXiv preprint arXiv:2204.06086, 2022.
[Lux22] F.Lux, J.Koch, N.T.Vu. Prosody cloning in zero-shot multispeaker text-to-speech. arXiv preprint arXiv:2206.12229, 2022.
[Mehta22] S.Mehta, E.Szekely, J.Beskow, G.E.Henter. Neural HMMs are all you need (for high-quality attention-free TTS). ICASSP 2022.
[Mitsui22] K.Mitsui, T.Zhao, K.Sawada, Y.Hono, Y.Nankaku, K.Tokuda. End-to-end text-to-speech based on latent representation of speaking styles using spontaneous dialogue. Interspeech 2022.
[Morrison22] M.Morrison, R.Kumar, K.Kumar, P.Seetharaman, A.Courville, Y.Bengio. Chunked autoregressive GAN for conditional waveform synthesis. ICLR 2022.
[Nishimura22] Y.Nishimura, Y.Saito, S.Takamichi, K.Tachibana, H.Saruwatari. Acoustic modeling for end-to-end empathetic dialogue speech synthesis using linguistic and prosodic contexts of dialogue history. Interspeech 2022.
[Raitio22] T.Raitio, J.Li, S.Seshadri. Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS. ICASSP 2022.
[Ren22] Y.Ren, M.Lei, Z.Huang, S.Zhang, Q.Chen, Z.Yan, Z.Zhao. ProsoSpeech: Enhancing prosody with quantized vector pre-training in TTS. ICASSP 2022.
[Ribeiro22] M.S.Ribeiro, J.Roth, G.Comini, G.Huybrechts, A.Gabrys, J.Lorenzo-Trueba. Cross-speaker style transfer for text-to-speech using data augmentation. ICASSP 2022.
[Saeki22] T.Saeki, K.Tachibana, R.Yamamoto. DRSpeech: Degradation-robust text-to-speech synthesis with frame-level and utterance-level acoustic representation learning. Interspeech 2022.
[Shin22] Y.Shin, Y.Lee, S.Jo, Y.Hwang, T.Kim. Text-driven emotional style control and cross-speaker style transfer in neural TTS. Interspeech 2022.
[Song22] E.Song, R.Yamamoto, O.Kwon, C.-H.Song, M.-J.Hwang, S.Oh, H.-W.Yoon, J.-S.Kim, J.-M.Kim. TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking Support Vector Machine with variational autoencoder. Interspeech 2022.
[Tan22] X.Tan, J.Chen, H.Liu, J.Cong, C.Zhang, Y.Liu, X.Wang, Y.Leng, Y.Yi, L.He, F.Soong, T.Qin, S.Zhao, T.-Y.Liu. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
[Terashima22] R.Terashima, R.Yamamoto, E.Song, Y.Shirahata, H.-W.Yoon, J.-M.Kim, K.Tachibana. Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation. Interspeech 2022.
[Valin22] J.-M.Valin, U.Isik, P.Smaragdis, A.Krishnaswamy. Neural speech synthesis on a shoestring: Improving the efficiency of LPCNET. ICASSP 2022.
[Wang22] Y.Wang, Y.Xie, K.Zhao, H.Wang, Q.Zhang. Unsupervised quantized prosody representation for controllable speech synthesis. IEEE International Conference on Multimedia and Expo (ICME) 2022.
[Wu22a] Y.Wu, X.Tan, B.Li, L.He, S.Zhao, R.Song, T.Qin, T.-Y.Liu. AdaSpeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436, 2022.
[Wu22b] S.Wu, Z.Shi. ItoWave: Ito stochastic differential equation is all you need for wave generation. ICASSP 2022.
[Xie22] Q.Xie, T.Li, X.Wang, Z.Wang, L.Xie, G.Yu, G.Wan. Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. ICASSP 2022.
[Yang22] J.Yang, L.He. Cross-lingual TTS using multi-task learning and speaker classifier joint training. arXiv preprint arXiv:2201.08124, 2022.
[Ye22] Z.Ye, Z.Zhao, Y.Ren, F.Wu. SyntaSpeech: Syntax-aware generative adversarial text-to-speech. International Joint Conference on Artificial Intelligence 2022.
[Yoon22] H.-W.Yoon, O.Kwon, H.Lee, R.Yamamoto, E.Song, J.-M.Kim, M.-J.Hwang. Language model-based emotion prediction methods for emotional speech synthesis systems. Interspeech 2022.
[Zhang22] G.Zhang, Y.Qin, W.Zhang, J.Wu, M.Li, Y.Gai, F.Jiang, T.Lee. iEmoTTS: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre. arXiv preprint arXiv:2206.14866, 2022.

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶(hù)發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自： yliu277 > 《人工智能》

舉報(bào)/認(rèn)領(lǐng)