Poly

Poly is a versatile LLM serving back-end. What it offers:

High-performance, efficient and reliable serving of multiple local LLM models
Optional GPU acceleration through either CUDA or Metal
Configurable LLM completion tasks (prompts, recall, stop tokens, etc.)
Streaming completion responses through HTTP SSE, chat using WebSockets
Biased sampling of completion output using JSON schema
Memory retrieval using vector databases (either built-in file based, or external such as Qdrant)
Accepts and automatically chunks PDF and DOCX files for storage to memory
API secured using either static API keys or JWT tokens
Simple, single binary + config file server deployment, horizontally scalable

Nice extras:

A web client to easily test and fine-tune configuration
A single-binary cross platform desktop client for locally running models

Supported models include:

LLaMa and derivatives (Alpaca, Vicuna, Guanaco, etc.)
LLaMA2
RedPajamas
MPT
Orca-mini

Web client	Desktop app

Sample of a model+task+memory configuration:

[models.gpt2dutch]
model_path = "./data/gpt2-small-dutch-f16.bin"
architecture = "gpt2"
use_gpu = false

[memories.dutch_qdrant]
store = { qdrant = { url = "http://127.0.0.1:6333", collection = "nl" } }
dimensions = 768
embedding_model = "gpt2dutch"

[tasks.dutch_completion]
model = "gpt2dutch"
prelude = "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n"
prefix =  "\n### User:\n"
postfix = "\n### Response:"
memorization = { memory = "dutch_qdrant", retrieve = 2 }

See config.example.toml for more example configurations.

Custom samplers can be configured using a string-based description, see here. Any biaser (i.e. JSON biaser) is injected as first sampler in the chain.

Concepts

In Poly, models are LLM models that support basic text generation and embedding operations. Models can be run on the GPU and have specific context lengths, but are otherwise unconfigurable.

A task uses a model in a specific way (i.e. using specific prompts, stop tokens, sampling, et cetera. Tasks are highly configurable. A model may be shared by multiple tasks.

A memory is a database that stores chunks of text, and allows retrieval of such chunks using vector similarity (where each chunk has a vector calculated as an embedding from an LLM). Memories can be re-used between tasks.

classDiagram
    class Model {
        String name
        Boolean use_gpu
        Int context_length
    }
    class Task {
        String name
        String prompts
        Inference parameters
    }
    class Memory {
        String name
        Chunking parameters
    }
    class Chunk {
        String text
        Vector embedding
    }

    Task "*" --> "1" Model: Generation model
    Task "*" --> "1" Memory
    Memory "*"--> "1" Model: Embedding model
    Memory "1" --> "*" Chunk: Stored chunks

The API exposes models, tasks and memories by their name (which is unique within their category).

Tasks

A task configures the way user input is transformed before it is fed to an LLM and the way the LLM output is transformed before it is returned to the user, in order to perform a specific task. A task can be configured to use (optional) prelude, prefix and postfix prompts. The prelude is fed once to the model for each session. The prefix and postfix are applied to each user input (i.e. each chat message):

sequenceDiagram
    actor User
    Task->>LLM: Prelude
    loop
        User->>+Task: Prompt
		alt When recall is enabled
			Task->>LLM: Recalled memory items (based on user prompt)
		end
        Task->>+LLM: Prefix
        Task->>LLM: Prompt
        Task->>LLM: Postfix

        note over LLM: Generate until stop token occurs or token/context limit reached
        LLM->>-Task: Response
        Task->>-User: Completion
    end

When biasing is enabled, an optional bias prompt can be configured. When configured the model will be asked to generate a response (following the flow as shown above). This response is however not directly returned to the user. Instead, the bias prompt is then fed, after which the biaser is enabled (and the biased response is returned to the user).

sequenceDiagram
    actor User
    Task->>LLM: Prelude
    loop
        User->>+Task: Prompt
		alt When recall is enabled
			Task->>LLM: Recalled memory items (based on user prompt)
		end
        Task->>+LLM: Prefix
        Task->>LLM: Prompt
        Task->>LLM: Postfix

        note over LLM: Generate until stop token occurs or token/context limit reached
        Task->>LLM: Bias prompt
        note over LLM: Generate with biaser enabled

        LLM->>-Task: Response (biased)
        Task->>-User: Completion
    end

Architecture

Poly is divided into separate crates that can be used independently:

poly-server: Serve LLMs through HTTP and WebSocket APIs (provides llmd)
poly-backend: Back-end implementation of LLM tasks
poly-extract: Crate for extracting plaintext from various document types
poly-bias: Crate for biasing LLM output to e.g. JSON following a schema
poly-ui: Simple desktop UI for local LLMs

Applications that want to employ Poly's functionality should use the HTTP REST API exposed by poly-server. Rust applications looking to integrate Poly's capabilities could also depend on poly-backend directly.

flowchart TD

subgraph Poly
	PW[poly web-client]
	PS[poly-server]
	PB[poly-backend]
	PE[poly-extract]
	PU[poly-ui]
	PBs[poly-bias]

	PW-->|HTTP,WS,SSE|PS
	PB-->PBs
	PS<-.->PE

	PS-->PB
	PU-->PB
end

LLM[rustformers/llm]
PBs<-.->LLM
PB<-.->Qdrant

T[tokenizers]
GGML
Metal
PB-->LLM
LLM<-.->T
LLM-->GGML
GGML-->Metal
GGML-->CUDA

Authors

Tommy van der Vorst (vandervorst@dialogic.nl)

License

Poly is licensed under the Apache 2.0 license from Git revision f329d35 onwards (only). Kindly note that this license provides no entitlement to support whatsoever. If you have any specific licensing or support requirements, please contact the authors.

We will accept contributions only if accompanied with a written statement that the contributed code is also (at least) Apache 2.0 licensed.

Specific licenses apply to the following files:

./data/gpt2.bin: MIT, source here, license here.
./data/gpt2-small-dutch-f16.bin: Apache 2.0, source here, license here.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github/workflows		.github/workflows
client		client
data		data
docs		docs
poly-backend		poly-backend
poly-bias		poly-bias
poly-extract		poly-extract
poly-server		poly-server
poly-ui		poly-ui
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.woodpecker.yml		.woodpecker.yml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.example.toml		config.example.toml
cublas.Dockerfile		cublas.Dockerfile
rustfmt.toml		rustfmt.toml
test.ps1		test.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Poly

Concepts

Tasks

Architecture

Authors

License

About

Releases

Packages

Languages

License

pixelspark/poly

Folders and files

Latest commit

History

Repository files navigation

Poly

Concepts

Tasks

Architecture

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages