Skip to content

Trying to build an all in one speech-text language model - a bit like GPT-4o

License

Notifications You must be signed in to change notification settings

jamesparsloe/llm.speech

Repository files navigation

llm.speech

Trying to join LLMs and speech models in a multi-task way. Inspired by GPT-4o and the details in the blog post from LAION.

Progress

I think most of the following need to be ticked off on the route to a full multi-modal model:

  • Text-to-speech direction works in principle - although not very high quality at the moment and some issues that hopefully will get resolved with some iteration or scaling - see the examples directory for what the current model can (and can't) do.
  • Warm starting text-to-speech training from a pretrained text LLM (will have to be a small one to fit on my 4090 dev box)
  • Text-to-speech finetuning on more expressive and higher quality datasets (Expresso?)
  • Speech-to-text direction
  • Join text-to-speech and speech-to-text in a multi-task
  • Interleaving like SpiRit-LM

There will probably be some more things to add along the way.

2024-05-31

  • First training run on the MLS Eng dataset with a small model (slightly cherry picked)
temperature.1-top_k.64-1717255980.mp4
  • Slightly cherry picked example.

2024-06-01

  • Quick and dirty finetuning from the MLS Eng model on the Expresso dataset for 2000 steps (slightly cherry picked)
temperature.1-top_k.64-1717255877.mp4

Getting Started

python3.11 -m venv env
source env/bin/activate
python -m pip install -e .[dev]

Training

Currently just hardcoded for text-to-speech

Datasets

python scripts/preprocess-mls-eng.py # downloads and processes the dataset MLS English dataset from HuggingFace
# wget -nc https://dl.fbaipublicfiles.com/textless_nlp/expresso/data/expresso.tar
# tar -xf expresso.tar

Train a small ~200M model on the MLS English dataset:

python -m llmspeech.train configs/small.yaml

Demo

A simple demo for TT

python app.py

TODO

  • Add kv-caching 😬
  • Streaming inference with SNAC decoder
  • Add DDP - just training on single 4090s currently
  • torch.compile not completely working (CUDA Graphs aren't being used for some reason) I need to take a deeper look
  • Add prompting to try and force a specific speaker

Acknowledgements

About

Trying to build an all in one speech-text language model - a bit like GPT-4o

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published