Trying to join LLMs and speech models in a multi-task way. Inspired by GPT-4o and the details in the blog post from LAION.
I think most of the following need to be ticked off on the route to a full multi-modal model:
- Text-to-speech direction works in principle - although not very high quality at the moment and some issues that hopefully will get resolved with some iteration or scaling - see the
examples
directory for what the current model can (and can't) do. - Warm starting text-to-speech training from a pretrained text LLM (will have to be a small one to fit on my 4090 dev box)
- Text-to-speech finetuning on more expressive and higher quality datasets (Expresso?)
- Speech-to-text direction
- Join text-to-speech and speech-to-text in a multi-task
- Interleaving like SpiRit-LM
There will probably be some more things to add along the way.
- First training run on the MLS Eng dataset with a small model (slightly cherry picked)
temperature.1-top_k.64-1717255980.mp4
- Slightly cherry picked example.
- Quick and dirty finetuning from the MLS Eng model on the Expresso dataset for 2000 steps (slightly cherry picked)
temperature.1-top_k.64-1717255877.mp4
python3.11 -m venv env
source env/bin/activate
python -m pip install -e .[dev]
Currently just hardcoded for text-to-speech
python scripts/preprocess-mls-eng.py # downloads and processes the dataset MLS English dataset from HuggingFace
# wget -nc https://dl.fbaipublicfiles.com/textless_nlp/expresso/data/expresso.tar
# tar -xf expresso.tar
Train a small ~200M model on the MLS English dataset:
python -m llmspeech.train configs/small.yaml
A simple demo for TT
python app.py
- Add kv-caching 😬
- Streaming inference with SNAC decoder
- Add DDP - just training on single 4090s currently
-
torch.compile
not completely working (CUDA Graphs aren't being used for some reason) I need to take a deeper look - Add prompting to try and force a specific speaker
- Andrej Karpathy for nanoGPT amongst all the other great things he's done
- Hubert Siuzdak for the awesome SNAC codec
- Julien Blanchon for creating the snac_llm_parler_tts dataset and saving me a lot of compute time (and effort)