I want to use pre-trained VCTK multi-speaker model as a base for fine-tuned single speaker model #2606
-
Hello community, I am new in the wonderful world of TTS. My case is the follows:
My initial idea was to use a model trained with LJSpeech dataset but my target speaker is male. So I don't know if a female voiced model can be fine-tuned into it. I followed the tutorial in the blog and it was good but unfortunately my results were not that good. My second idea was to use model trained with VCTK dataset that have multiple speakers. I synthesized test sentences for all the speakers and chose the speaker which voice is the closest to my target speaker voice. Unfortunately, I couldn't wrap my head around how could I load a multi-speaker model, leaving only one speaker and then fine-tune it with the new data. The final model should contain only one speaker. Now my question is: how to, or is it possible, to convert a multi-speaker model to a single-speaker model leaving only one speaker? Or alternatively, how to use multi-speaker model as a base for fine-tuning for the final single-speaker model? Using the first approach is also OK but I need a confirmation if the female voiced model can be fine-tuned into male voice. All help is greatly appreciated, thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I ended up using
This will train a Vits model that has only one speaker. You can synthesize voice like this:
I still don't know if the model is multi-speaker and how to make it single-speaker... but after 20k steps the results are already pretty amazing. This is really cool. |
Beta Was this translation helpful? Give feedback.
I ended up using
recipes/vctk/yourtts/train_yourtts.py
with the following modifications:../../dataset/speakers.pth
for all my voice samplestext_cleaner
to"english_cleaners"
since the custom dataset is in English.use_weighted_sampler = False
.~/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
batch_size = 20
because that's maximum for my hardware.This will train a Vits model that has only one speaker. You can synthesize voice like this:
tts --text "Hello, this is the voice I generated. Pretty cool, huh?" --model_path=best_model.pth --config_path=config.json --…