Models
=======
This section provides a brief overview of TTS models that NeMo's TTS collection currently supports.

* **Model Recipes** can be accessed through `examples/tts/*.py <https://github.com/NVIDIA/NeMo/tree/stable/examples/tts>`_.
* **Configuration Files** can be found in the directory of `examples/tts/conf/ <https://github.com/NVIDIA/NeMo/tree/stable/examples/tts/conf>`_. For detailed information about TTS configuration files and how they
  should be structured, please refer to the section :doc:`./configs`.
* **Pretrained Model Checkpoints** are available for any users for immediately synthesizing speech or fine-tuning models on
  your custom datasets. Please follow the section :doc:`./checkpoints` for instructions on how to use those pretrained models.


Mel-Spectrogram Generators
--------------------------

.. _FastPitch_model:

FastPitch
~~~~~~~~~
FastPitch is a fully-parallel text-to-speech synthesis model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to the state of the art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformers architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance. The architecture of FastPitch is shown below. It is based on FastSpeech and consists of two feed-forward Transformer (FFTr) stacks. The first FFTr operates in the resolution of input tokens, and the other one in the resolution of the output frames. Please refer to :cite:`tts-models-lancucki2021fastpitch` for details.

    .. image:: images/fastpitch_model.png
        :align: center
        :alt: fastpitch model
        :scale: 30%

SSL FastPitch
~~~~~~~~~~~~~
This **experimental** version of FastPitch takes in content and speaker embeddings generated by an SSL Disentangler and generates mel-spectrograms, with the goal that voice characteristics are taken from the speaker embedding while the content of speech is determined by the content embedding. Voice conversion can be done using this model by swapping the speaker embedding input to that of a target speaker, while keeping the content embedding the same. More details to come.


End-to-End LLM-based TTS
------------------------

MagpieTTS
~~~~~~~~~
MagpieTTS is an encoder-decoder transformer TTS model that operates on discrete audio tokens from a neural audio codec. It uses monotonic alignment (CTC loss and attention priors) to reduce hallucinations and supports voice cloning via audio or text context conditioning. For architecture, training, inference, and preference optimization (DPO/GRPO), see :doc:`Magpie-TTS documentation <magpietts>`.


Vocoders
--------

HiFiGAN
~~~~~~~
HiFi-GAN focuses on designing a vocoder model that efficiently synthesizes raw waveform audios from the intermediate mel-spectrograms. It consists of one generator and two discriminators (multi-scale and multi-period). The generator and discriminators are trained adversarially with two additional loses for improving training stability and model performance. The generator is a fully convolutional neural network which takes a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module. The architecture of the generator is shown below (left). Multi-period discriminator (MPD) is a mixer of sub-discriminators, each of which only accepts equally spaced samples of an input audio. The sub-discriminators are designed to capture different implicit structures from each other by looking at different parts of an input audio. While MPD only accepts disjoint samples, multi-scale discriminator (MSD) is added to consecutively evaluate the audio sequence. MSD is a mixer of 3 sub-discriminators operating on different input scales (raw audio, x2 average-pooled audio, and x4 average-pooled audio). HiFi-GAN could achieve both higher computational efficiency and sample quality than the best publicly available auto-regressive or flow-based models, such as WaveNet and WaveGlow. Please refer to :cite:`tts-models-kong2020hifi` for details.

    .. figure:: images/hifigan_g_model.png
        :alt: hifigan_g model
        :scale: 30%

(a) Generator

    .. figure:: images/hifigan_d_model.png
        :alt: hifigan_d model
        :scale: 30%

(b) Discriminators



Codecs
------

Audio Codec
~~~~~~~~~~~

The NeMo Audio Codec model is a non-autoregressive convolutional encoder-quantizer-decoder model for coding or tokenization of raw audio signals or mel-spectrogram features.
The NeMo Audio Codec model supports residual vector quantizer (RVQ) :cite:`tts-models-zeghidour2022soundstream` and finite scalar quantizer (FSQ) :cite:`tts-models-mentzer2023finite` for quantization of the encoder output.
This model is trained end-to-end using generative loss, discriminative loss, and reconstruction loss, similar to other neural audio codecs such as SoundStream :cite:`tts-models-zeghidour2022soundstream` and EnCodec :cite:`tts-models-defossez2022encodec`.
For further information refer to the ``Audio Codec Training`` tutorial in the TTS tutorial section.

    .. image:: images/audiocodec_model.png
        :align: center
        :alt: audiocodec model
        :scale: 35%


References
----------

.. bibliography:: tts_all.bib
    :style: plain
    :labelprefix: TTS-MODELS
    :keyprefix: tts-models-
