Skip to content
Snippets Groups Projects
Unverified Commit b5aad6a2 authored by Jean-Marc Valin's avatar Jean-Marc Valin
Browse files

Better model building instructions

parent 55513e81
No related branches found
No related tags found
No related merge requests found
Pipeline #6309 passed
The following datasets can be used to train a language-independent LPCNet model.
A good choice is to include all the data from these datasets, except for
hi_fi_tts for which only a small subset is recommended (since it's very large
but has few speakers). Note that this data typically needs to be resampled
before it can be used.
The following datasets can be used to train a language-independent FARGAN model
and a Deep REDundancy (DRED) model. Note that this data typically needs to be
resampled before it can be used.
https://www.openslr.org/resources/30/si_lk.tar.gz
https://www.openslr.org/resources/32/af_za.tar.gz
......@@ -61,7 +59,6 @@ https://www.openslr.org/resources/83/welsh_english_female.zip
https://www.openslr.org/resources/83/welsh_english_male.zip
https://www.openslr.org/resources/86/yo_ng_female.zip
https://www.openslr.org/resources/86/yo_ng_male.zip
https://www.openslr.org/resources/109/hi_fi_tts_v0.tar.gz
The corresponding citations for all these datasets are:
......@@ -164,10 +161,3 @@ The corresponding citations for all these datasets are:
doi = {10.21437/Interspeech.2020-1096},
url = {http://dx.doi.org/10.21437/Interspeech.2020-1096},
}
@article{bakhturina2021hi,
title={{Hi-Fi Multi-Speaker English TTS Dataset}},
author={Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
journal={arXiv preprint arXiv:2104.01497},
year={2021}
}
......@@ -8,24 +8,31 @@ skip straight to the Inference section.
## Data preparation
First, fetch all the data from the datasets.txt file using:
```
./download_datasets.sh
```
Then concatenate and resample the data into a single 16-kHz file:
```
./process_speech.sh
```
The script will produce an all_speech.pcm speech file in raw 16-bit PCM format.
For data preparation you need to build Opus as detailed in the top-level README.
You will need to use the --enable-dred configure option.
The build will produce an executable named "dump_data".
To prepare the training data, run:
```
./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm
./dump_data -train all_speech.pcm all_features.f32 /dev/null
```
Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz.
The speech data used for training the model can be found at:
https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw
The out_speech.pcm file isn't needed for DRED, but it is needed to train
the FARGAN vocoder (see dnn/torch/fargan/ for details).
## Training
To perform training, run the following command:
```
python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir
python ./train_rdovae.py --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 all_features.f32 output_dir
```
The final model will be in output_dir/checkpoints/chechpoint_400.pth.
......
mkdir datasets
cd datasets
for i in `grep https ../../../datasets.txt`
do
wget $i
done
#!/bin/sh
cd datasets
#parallel -j +2 'unzip -n {}' ::: *.zip
find . -name "*.wav" | parallel -k -j 20 'sox --no-dither {} -t sw -r 16000 -c 1 -' > ../all_speech.sw
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment