I want to use pre-trained VCTK multi-speaker model as a base for fine-tuned single speaker model #2606

oinuar · 2023-05-09T15:38:44Z

oinuar
May 9, 2023

Hello community,

I am new in the wonderful world of TTS. My case is the follows:

I have approx 10 hours of training material of single speaker.
I want to make a TTS model for this speaker.
I can use whatever model that is the most suitable for this. I tried to use training recipes from the repo to train GlowTTS and Vits models, but I don't have good knowledge of what could be the most suitable model for this case.

My initial idea was to use a model trained with LJSpeech dataset but my target speaker is male. So I don't know if a female voiced model can be fine-tuned into it. I followed the tutorial in the blog and it was good but unfortunately my results were not that good.

My second idea was to use model trained with VCTK dataset that have multiple speakers. I synthesized test sentences for all the speakers and chose the speaker which voice is the closest to my target speaker voice. Unfortunately, I couldn't wrap my head around how could I load a multi-speaker model, leaving only one speaker and then fine-tune it with the new data. The final model should contain only one speaker.

Now my question is: how to, or is it possible, to convert a multi-speaker model to a single-speaker model leaving only one speaker? Or alternatively, how to use multi-speaker model as a base for fine-tuning for the final single-speaker model? Using the first approach is also OK but I need a confirmation if the female voiced model can be fine-tuned into male voice.

All help is greatly appreciated, thanks!

Answered by oinuar

May 25, 2023

I ended up using recipes/vctk/yourtts/train_yourtts.py with the following modifications:

Using only my custom dataset.
Generate speaker embeddings to ../../dataset/speakers.pth for all my voice samples
Set text_cleaner to "english_cleaners" since the custom dataset is in English.
Set use_weighted_sampler = False.
Restore model from ~/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Use batch_size = 20 because that's maximum for my hardware.

This will train a Vits model that has only one speaker. You can synthesize voice like this:

tts --text "Hello, this is the voice I generated. Pretty cool, huh?" --model_path=best_model.pth --config_path=config.json --…

View full answer

oinuar · 2023-05-25T17:31:33Z

oinuar
May 25, 2023
Author

I ended up using recipes/vctk/yourtts/train_yourtts.py with the following modifications:

Using only my custom dataset.
Generate speaker embeddings to ../../dataset/speakers.pth for all my voice samples
Set text_cleaner to "english_cleaners" since the custom dataset is in English.
Set use_weighted_sampler = False.
Restore model from ~/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Use batch_size = 20 because that's maximum for my hardware.

This will train a Vits model that has only one speaker. You can synthesize voice like this:

tts --text "Hello, this is the voice I generated. Pretty cool, huh?" --model_path=best_model.pth --config_path=config.json --speaker_idx my-speaker --language_idx en --speakers_file_path=../../dataset/speakers.pth

I still don't know if the model is multi-speaker and how to make it single-speaker... but after 20k steps the results are already pretty amazing.

This is really cool.

4 replies

KaikeWesleyReis Apr 3, 2024

Hi @oinuar I'm facing the same challenge, could you share your python code please?

I'm trying to apply your tutorial, but without success...

I have a custom dataset (It's a robot called Harbinger, from mass effect series) of 10 minutes and I'm looking to fine tune a male voice, which can only be found on those multilingual models.

I successfully fine tuned a ljspeech-vits-single (female voice), but now I'm trying to fine tune a vctk-vits-multi (male voice)

oinuar Apr 15, 2024
Author

Hi @KaikeWesleyReis!

I ended up using YourTTS model, which seem to be working well. The model is multi-speaker model but I used it only for one speaker. Please find my Python script below.

import os

from trainer import Trainer, TrainerArgs

from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig, CharactersConfig

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    dataset_name="male-narrator", # Change your dataset name here
    formatter="thorsten",
    meta_file_train="metadata.csv",
    meta_file_val="metadata.csv",
    language="en",
    path=os.path.join(output_path, "..", "dataset") # Change path to your own dataset
)

audio_config = VitsAudioConfig(
    sample_rate=16000, # Make sure that your dataset audios use this bitrate
    hop_length=256,
    win_length=1024,
    fft_size=1024,
    mel_fmin=0.0,
    mel_fmax=None,
    num_mels=80
)

vitsArgs = VitsArgs(
    d_vector_file=[os.path.join(dataset_config.path, "speakers.pth")],
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path="https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
    speaker_encoder_config_path="https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
    resblock_type_decoder="2"
)

# Here we generate speaker embeddings for the audio samples
for d_vector_file in vitsArgs.d_vector_file:
    if not os.path.isfile(d_vector_file):
        print(f">>> Computing the speaker embeddings to {d_vector_file} dataset")
        compute_embeddings(
            vitsArgs.speaker_encoder_model_path,
            vitsArgs.speaker_encoder_config_path,
            d_vector_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_config.formatter,
            dataset_name=dataset_config.dataset_name,
            dataset_path=dataset_config.path,
            meta_file_train=dataset_config.meta_file_train,
            meta_file_val=dataset_config.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_male-narrator",
    epochs=120,
    batch_size=24,
    eval_batch_size=8,
    batch_group_size=5,
    num_loader_workers=8,
    eval_split_max_size=256,
    num_eval_loader_workers=8,
    run_eval=True,
    test_delay_epochs=-1,
    target_loss="loss_1",
    add_blank=True,
    text_cleaner="english_cleaners",
    characters=CharactersConfig(
        characters_class="TTS.tts.models.vits.VitsCharacters",
        pad="_",
        eos="&",
        bos="*",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        punctuations="!'(),-.:;? ",
        phonemes="",
        is_unique=True,
        is_sorted=True,
    ),
    phonemizer="espeak",
    phoneme_language="en",
    use_phonemes=False,
    phoneme_cache_path=None,
    compute_input_seq_cache=True,
    print_step=50,
    plot_step=100,
    print_eval=False,
    mixed_precision=False,
    start_by_longest=True,
    max_text_len=325,  # change this if you have a larger VRAM than 16GB
    output_path=output_path,
    datasets=[dataset_config],
    cudnn_benchmark=False,
    test_sentences=[
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "thorsten",
            None,
            "en",
        ],
        [
            "Be a voice, not an echo.",
            "thorsten",
            None,
            "en",
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "thorsten",
            None,
            "en",
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "thorsten",
            None,
            "en",
        ],
        [
            "Prior to November 22, 1963.",
            "thorsten",
            None,
            "en",
        ],
    ],
    # Enable the weighted sampler
    use_weighted_sampler=False,
    # Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
    weighted_sampler_attrs={"speaker_name": 1.0},
    weighted_sampler_multipliers={},
    # It defines the Speaker Consistency Loss (SCL) α to 9 like the paper
    speaker_encoder_loss_alpha=9.0,
)

# Load all the datasets samples and split training and evaluation sets
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# Init the model
model = Vits.init_from_config(config)

# Init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(restore_path="~/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

Then, after training, you can use it like so:

tts --text "Hello, this is the voice I generated. Pretty cool, huh?" --model_path=best_model.pth --config_path=config.json --speaker_idx male-narrator --language_idx en --speakers_file_path=../../dataset/speakers.pth

Hope this helps!

KaikeWesleyReis Apr 15, 2024

Thanks @oinuar , but another question emerges: VITS is a 22050 Hz model, why did you change the sample rate here?
Reading this thread here: #2552, I believe that would not be a problem to work with a embedding of 16k and a VITS model of 22k.
So my question is: this mismatch is a problem for fine tuning VITS (today I'm fine tuning with a sample rate of 22k using 16k embeddings)?

oinuar Apr 16, 2024
Author

My code is based on YourTTS training script which uses 16k sampling rate so I just used the same value. I am by no means expert in this area so unfortunately I cannot provide more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to use pre-trained VCTK multi-speaker model as a base for fine-tuned single speaker model #2606

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

I want to use pre-trained VCTK multi-speaker model as a base for fine-tuned single speaker model #2606

oinuar May 9, 2023

Replies: 1 comment · 4 replies

oinuar May 25, 2023 Author

KaikeWesleyReis Apr 3, 2024

oinuar Apr 15, 2024 Author

KaikeWesleyReis Apr 15, 2024

oinuar Apr 16, 2024 Author

oinuar
May 9, 2023

Replies: 1 comment 4 replies

oinuar
May 25, 2023
Author

oinuar Apr 15, 2024
Author

oinuar Apr 16, 2024
Author