Running Multiple LLM Models with llamacpp Router
Guide to setting up llamacpp router for managing multiple GGUF models including embeddings and LLMs
· 2 min read
How to use llamacpp’s new router feature to load and manage multiple models efficiently.
A new feature in llamacpp allows you to set up a router that can handle multiple models, such as running embedding models alongside large language models (LLMs). This is particularly useful for applications that require both embedding generation and text generation capabilities.
Setup #
This guide does not cover the installation of llamacpp itself. Please refer to the llamacpp GitHub repository
My preference is installing from source to ensure the latest features are available.
Using the Router #
Downloading of models #
You can manually download the GGUF models into a directory. A quick way is using llama-cli to download like:
llama-cli -hf org/model:GGUF_format
# example
llama-cli -hf meta/llama-3:Q4_K_MOr you can manually do a download or wget and place the files into $HOME/.cache/llamacpp or any specified path.
Configuring embedding models #
As of today, router does not have any smart way to allocate different sizes, so you cannot specify that you want one 80B, and maybe 4 8B models.
Embedding models must be configured correctly, just as how you would run:
llama-server -hf embedding-model --embeddingYou can configure the ini file as such
[*]
sleep-idle-seconds = 60
[<embedding model name>]
embedding = true
load-on-startup = true
sleep-idle-seconds = -1Note: I am still trying out the configuration for my own use.
Running the router #
llama-server --models-preset llama-config.ini --host 0.0.0.0Security note: --host 0.0.0.0 makes the server accessible anywhere on the network. I run llama-server on a dedicated machine and do development on a separate machine, so I need to expose it. Access is restricted at the network level via firewalls and VPN, so I don’t configure authentication at the application level.
For local-only access, use --host 127.0.0.1 instead.
For production deployment, llama-server supports built-in security options:
--api-key KEY- API key for authentication (or setLLAMA_API_KEYenv var)--api-key-file FNAME- path to file containing API keys--ssl-key-file FNAME- PEM-encoded SSL private key--ssl-cert-file FNAME- PEM-encoded SSL certificate
Alternatively, you can put llama-server behind a reverse proxy like nginx.
Benchmarking #
For benchmarking, I have written a tool here: https://github.com/wheynelau/llmperf-rs
Which can come in handy for testing settings for concurrency and ctx-size.