Welcome to the LLM-API project! This endeavor opens the door to the exciting world of Large Language Models (LLMs) by offering a versatile API that allows you to effortlessly run a variety of LLMs on different consumer hardware configurations. Whether you prefer to operate these powerful models within Docker containers or directly on your local machine, this solution is designed to adapt to your preferences.
With LLM-API, all you need to get started is a simple YAML configuration file. the app streamlines the process by automatically downloading the model of your choice and executing it seamlessly. Once initiated, the model becomes accessible through a unified and intuitive API.
There is also a client that's reminiscent of OpenAI's approach, making it easy to harness the capabilities of your chosen LLM. You can find the Python at llm-api-python
In addition to this, a LangChain integration exists, further expanding the possibilities and potential applications of LLM-API. You can explore this integration at langchain-llm-api
Whether you're a developer, researcher, or enthusiast, the LLM-API project simplifies the use of Large Language Models, making their power and potential accessible to all.
LLM enthusiasts, developers, researchers, and creators are invited to join this growing community. Your contributions, ideas, and feedback are invaluable in shaping the future of LLM-API. Whether you want to collaborate on improving the core functionality, develop new integrations, or suggest enhancements, your expertise is highly appreciated
- Different Llama based-models in different versions such as (Llama, Alpaca, Vicuna, Llama 2 ) on CPU using llama.cpp
- Llama & Llama 2 quantized models using GPTQ-for-LLaMa
- Generic huggingface pipeline e.g. gpt-2, MPT
- Mistral 7b
- Several quantized models using AWQ
- OpenAI-like interface using llm-api-python
- Support RWKV-LM
To run LLM-API on a local machine, you must have a functioning Docker engine. The following steps outline the process for running LLM-API:
- Create a Configuration File: Begin by creating a
config.yaml
file with the configurations as described below (use the examples inconfig.yaml.example
).
models_dir: /models # dir inside the container
model_family: llama # also `gptq_llama` or `huggingface`
setup_params:
key: value
model_params:
key: value
setup_params
and model_params
are model specific, see below for model specific configs.
You can override any of the above mentioned configs using environment vars prefixed with LLM_API_
for example: LLM_API_MODELS_DIR=/models
- Run LLM-API Using Docker: Execute the following command in your terminal:
docker run -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 --ulimit memlock=16000000000 1b5d/llm-api
This command launches a Docker container and mounts your local directory for models, the configuration file, and maps port 8000 for API access.
Alternatively, you can use the provided docker-compose.yaml
file within this repository and run the application using Docker Compose. To do so, execute the following command:
docker compose up
Upon the first run, LLM-API will download the model from Hugging Face, based on the configurations defined in the setup_params
of your config.yaml
file. It will then name the local model file accordingly. Subsequent runs will reference the same local model file and load it into memory for seamless operation
The LLM-API provides a standardized set of endpoints that are applicable across all Large Language Models (LLMs). These endpoints enable you to interact with the models effectively. Here are the primary endpoints:
- POST /generate
- Request Example:
{ "prompt": "What is the capital of France?", "params": { // Additional parameters... } }
- Description: Use this endpoint to generate text based on a given prompt. You can include additional parameters for fine-tuning and customization.
- Request Example:
- POST /agenerate
- Request Example:
{ "prompt": "What is the capital of France?", "params": { // Additional parameters... } }
- Description: This endpoint is designed for asynchronous text generation. It allows you to initiate text generation tasks that can run in the background while your application continues to operate.
- Request Example:
- POST /embeddings
- Request Example:
{ "text": "What is the capital of France?" }
- Description: Use this endpoint to obtain embeddings for a given text. This is valuable for various natural language processing tasks such as semantic similarity and text analysis.
- Request Example:
Generally models for which can be inferenced using transformer's AutoConfig
, AutoModelForCausalLM
and AutoTokenizer
can run using the model_family: huggingface
config, the following is an example (runs one of the MPT models):
models_dir: /models
model_family: huggingface
setup_params:
repo_id: <repo_id>
tokenizer_repo_id: <repo_id>
trust_remote_code: True
config_params:
init_device: cuda:0
attn_config:
attn_impl: triton
model_params:
device_map: "cuda:0"
trust_remote_code: True
torch_dtype: torch.bfloat16
Leverage the flexibility of LLM-API by configuring various attributes using the following methods:
-
Pass specific configuration attributes within the
config_params
to fine-tuneAutoConfig
. These attributes allow you to tailor the behavior of your language model further. -
Modify the model's parameters directly by specifying them within the
model_params
. These parameters are passed into theAutoModelForCausalLM.from_pretrained
andAutoTokenizer.from_pretrained
initialization calls.
Here's an example of how you can use parameters in the generate
(or agenerate
) endpoints, but remember, you can pass any arguments accepted by transformer's GenerationConfig:
POST /generate
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of paris",
"params": {
"max_length": 25,
"max_new_tokens": 25,
"do_sample": true,
"top_k": 40,
"top_p": 0.95
}
}'
If you're looking to accelerate inference using a GPU, the 1b5d/llm-api:latest-gpu
image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
docker compose -f docker-compose.gpu.yaml up
Note: currenty only linux/amd64
architecture is supported for gpu images
Utilizing Llama on a CPU is made simple by configuring the model usage in a local config.yaml
file. Below are the possible configurations:
models_dir: /models
model_family: llama
setup_params:
repo_id: user/repo_id
filename: ggml-model-q4_0.bin
model_params:
n_ctx: 512
n_parts: -1
n_gpu_layers: 0
seed: -1
use_mmap: True
n_threads: 8
n_batch: 2048
last_n_tokens_size: 64
lora_base: null
lora_path: null
low_vram: False
tensor_split: null
rope_freq_base: 10000.0
rope_freq_scale: 1.0
verbose: True
Ensure to specify the repo_id and filename parameters to point to a Hugging Face repository where the desired model is hosted. The application will then handle the download for you.
Running in this mode can be done using the docker image 1b5d/llm-api:latest
, several images are also available to support different BLAS backends:
- OpenBLAS:
1b5d/llm-api:latest-openblas
- cuBLAS:
1b5d/llm-api:latest-cublas
- CLBlast:
1b5d/llm-api:latest-clblast
- hipBLAS:
1b5d/llm-api:latest-hipblas
The following example demonstrates the various parameters that can be sent to the Llama generate and agenerate endpoints:
POST /generate
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of paris",
"params": {
"suffix": null or string,
"max_tokens": 128,
"temperature": 0.8,
"top_p": 0.95,
"logprobs": null or integer,
"echo": False,
"stop": ["\Q"],
"frequency_penalty: 0.0,
"presence_penalty": 0.0,
"repeat_penalty": 1.1
"top_k": 40,
}
}'
AWQ quantization is supported using the AutoAWQ implementation, below is an example config
models_dir: /models
model_family: autoawq
setup_params:
repo_id: <repo id>
tokenizer_repo_id: <repo id>
filename: <model file name>
model_params:
trust_remote_code: False
fuse_layers: False
safetensors: True
device_map: "cuda:0"
To run this model, the gpu supported docker image is needed 1b5d/llm-api:latest-gpu
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
Or you can use the docker-compose.gpu.yaml file available in this repo:
docker compose -f docker-compose.gpu.yaml up
Important Note: Before running Llama or Llama 2 on GPU, make sure to install the NVIDIA Driver on your host machine. You can verify the NVIDIA environment by executing the following command:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi
You should see a table displaying the current NVIDIA driver version and related information, confirming the proper setup.
When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, 1b5d/llm-api:latest-gpu
, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
docker compose -f docker-compose.gpu.yaml up
Or by directly running the container:
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
Important Note: The llm-api:x.x.x-gptq-llama-cuda
and llm-api:x.x.x-gptq-llama-triton
images have been deprecated. Please switch to the 1b5d/llm-api:latest-gpu
image when GPU support is required
Example config file:
models_dir: /models
model_family: gptq_llama
setup_params:
repo_id: user/repo_id
filename: <model.safetensors or model.pt>
model_params:
group_size: 128
wbits: 4
cuda_visible_devices: "0"
device: "cuda:0"
Example request:
POST /generate
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of paris",
"params": {
"temp": 0.8,
"top_p": 0.95,
"min_length": 10,
"max_length": 50
}
}'
- llama.cpp for making it possible to run Llama models on CPU.
- llama-cpp-python for the python bindings lib for
llama.cpp
. - GPTQ-for-LLaMa for providing a GPTQ quantization implementation for Llama based models.
- AutoAWQ for providing an implementation for AWQ quantization
- Huggingface for the great ecosystem of tooling they provide.