sie: a self-hosted inference cluster for agents that serves 85+ models through a single unified API

sie: a self-hosted inference cluster for agents that serves 85+ models through a single unified API

What it solves

SIE (Superlinked Inference Engine) eliminates the need to manage multiple separate model servers for different AI agent tasks. Instead of a patchwork of servers for embedding, reranking, OCR, and generation, it provides a single, self-hosted open-source cluster that serves over 85 pre-configured models through one unified API.

How it works

SIE runs as a server (via Docker or native install) that manages a library of quality-verified models from Hugging Face. It uses on-demand loading and LRU (Least Recently Used) eviction to serve multiple models simultaneously without exhausting resources. The system provides a unified SDK for Python and TypeScript to call functions like encode, score, extract, and generate across different model architectures.

Who it’s for

Developers building AI agents who want to self-host their inference stack in their own cloud (GKE, EKS) and avoid the complexity of deploying and maintaining individual model servers for every specialized task.

Highlights

  • Unified API: One interface for search/retrieval, document-to-markdown conversion, structured output, and agent loops.
  • Extensive Model Library: 85+ pre-configured models including Stella, SPLADE, Qwen3, and GLiNER.
  • Production-Ready Stack: Includes a load-balancing gateway, KEDA autoscaling, Grafana dashboards, and Terraform modules for GKE/EKS.
  • Broad Integration: Compatible with LangChain, LlamaIndex, Haystack, DSPy, CrewAI, and popular vector databases like Chroma, Qdrant, and Weaviate.
  • OpenAI Compatible: Offers a /v1/embeddings endpoint for easy migration.

Sources