.NET Technology Show

Backstage pass to all things ServiceStack & .NET

Listen

Using llama.cpp to self-host Large Language Models in Production

A practical guide to self-hosting LLMs in production using llama.cpp's llama-server with Docker compose and Systemd


This episode covers llama-server, a production-focused tool built on llama.cpp for self-hosting large language models, contrasting it with user-friendly local alternatives.

It details how to deploy llama-server using Docker, including GPU-accelerated configurations, and natively with Systemd for optimized performance.

The episode also introduces AI Server, an open-source managed gateway designed to streamline AI integrations by centralizing management of multiple LLM providers, including llama-server instances and cloud-based APIs.

It explains how to use AI Server by registering llama-server endpoints, creating API keys for applications, and utilizing typed APIs in various programming languages for synchronous, queued, and callback-based interactions.