llm.speech

Trying to join LLMs and speech models in a multi-task way. Inspired by GPT-4o and the details in the blog post from LAION.

Progress

I think most of the following need to be ticked off on the route to a full multi-modal model:

Text-to-speech direction works in principle - although not very high quality at the moment and some issues that hopefully will get resolved with some iteration or scaling - see the examples directory for what the current model can (and can't) do.
Warm starting text-to-speech training from a pretrained text LLM (will have to be a small one to fit on my 4090 dev box)
Text-to-speech finetuning on more expressive and higher quality datasets (Expresso?)
Speech-to-text direction
Join text-to-speech and speech-to-text in a multi-task
Interleaving like SpiRit-LM

There will probably be some more things to add along the way.

2024-05-31

First training run on the MLS Eng dataset with a small model (slightly cherry picked)

temperature.1-top_k.64-1717255980.mp4

Slightly cherry picked example.

2024-06-01

Quick and dirty finetuning from the MLS Eng model on the Expresso dataset for 2000 steps (slightly cherry picked)

temperature.1-top_k.64-1717255877.mp4

Getting Started

python3.11 -m venv env
source env/bin/activate
python -m pip install -e .[dev]

Training

Currently just hardcoded for text-to-speech

Datasets

python scripts/preprocess-mls-eng.py # downloads and processes the dataset MLS English dataset from HuggingFace
# wget -nc https://dl.fbaipublicfiles.com/textless_nlp/expresso/data/expresso.tar
# tar -xf expresso.tar

Train a small ~200M model on the MLS English dataset:

python -m llmspeech.train configs/small.yaml

Demo

A simple demo for TT

python app.py

TODO

Add kv-caching 😬
Streaming inference with SNAC decoder
Add DDP - just training on single 4090s currently
torch.compile not completely working (CUDA Graphs aren't being used for some reason) I need to take a deeper look
Add prompting to try and force a specific speaker

Acknowledgements

Andrej Karpathy for nanoGPT amongst all the other great things he's done
Hubert Siuzdak for the awesome SNAC codec
Julien Blanchon for creating the snac_llm_parler_tts dataset and saving me a lot of compute time (and effort)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
examples		examples
notebooks		notebooks
scripts		scripts
src/llmspeech		src/llmspeech
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
app.py		app.py
hf-upload.py		hf-upload.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm.speech

Progress

2024-05-31

2024-06-01

Getting Started

Training

Datasets

Demo

TODO

Acknowledgements

About

Releases

Packages

Languages

License

jamesparsloe/llm.speech

Folders and files

Latest commit

History

Repository files navigation

llm.speech

Progress

2024-05-31

2024-06-01

Getting Started

Training

Datasets

Demo

TODO

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages