-
Notifications
You must be signed in to change notification settings - Fork 37
2.2.6 Backend: mistral.rs
av edited this page Sep 14, 2024
·
1 revision
Handle:
mistralrs
URL: http://localhost:33951
Blazingly fast LLM inference.
| Rust Documentation | Python Documentation | Discord | Matrix |
Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
# [Optional] Pull the mistralrs images
harbor pull mistralrs
# Start the service
harbor up mistralrs
# Verify service health and logs
harbor mistralrs health
harbor logs mistralrs -n 200
# Open HF Search to find compatible models
harbor hf find gemma2
# For "plain" models:
# Download the model to the global HF cache
harbor hf download IlyaGusev/gemma-2-2b-it-abliterated
# Set model/type/arch
harbor mistralrs model IlyaGusev/gemma-2-2b-it-abliterated
harbor mistralrs type plain
harbor mistralrs arch gemma2
# Gemma 2 doesnt't support paged attention
harbor mistralrs args --no-paged-attn
# Launch, mistralrs
# Running model will be available in the webui
harbor up mistralrs
mistral.rs supports an interesting technique of in situ quantization
# Gemma2 from the previous example
# IlyaGusev/gemma-2-2b-it-abliterated
# nvidia-smi > 5584MiB
# Enable ISQ
harbor mistralrs isq Q2K
# Restart the service
harbor restart mistralrs
# nvidia-smi > 2094MiB
# Disable ISQ if not needed
harbor mistralrs isq ""
The difference will increase for models with larger Linear
layers. Note that ISQ will affect the performance of the model.
Harbor mounts global llama.cpp
cache to the mistralrs
service as a gguf
folder. You can download models in the same way as for llama.cpp
.
# Set the model type to GGUF
harbor mistralrs type gguf
# - Unset ISQ off, as it's not supported
# for GGUF models
# - For GGUFs, architecture is inferred from the file
harbor mistralrs isq ""
harbor mistralrs arch ""
# Example 1: llama.cpp cache
# [Optional] See which GGUFs were already downloaded for the llama.cpp
# `config get llamacpp.cache` is also a folder Harbor will mount for Mistral.rs
ls $(eval echo "$(harbor config get llamacpp.cache)")
# Use "folder" specifier to point to the model
# "gguf" - mounted llama.cpp cache
# "-f Model.gguf" - the model file
harbor mistralrs model "gguf -f Phi-3-mini-4k-instruct-q4.gguf"
# Example 2: HF cache
# [Optional] Grab the folder where the model is located
harbor hf scan-cache
# Use "folder" specifier to point to the model
# "hf/full/path" - mounted HF cache. Note that you need
# a full path to the folder with .gguf
# "-f Model.gguf" - the model file
harbor mistralrs model "hf/hub/models--microsoft--Phi-3-mini-4k-instruct-gguf/snapshots/999f761fe19e26cf1a339a5ec5f9f201301cbb83/ -f Phi-3-mini-4k-instruct-q4.gguf"
# When configured, launch
harbor up mistralrs
Specify extra args via the Harbor CLI:
# See available options
harbor run mistralrs --help
# Get/Set the extra arguments
harbor mistralrs args --no-paged-attn