OpenAI-compatible, self-hosted

Route every prompt to the cheapest model fast enough.

One endpoint over Groq, Gemini, Ollama and more, with streaming, semantic caching, per-key limits, and cost tracked to the cent.

$ pip install openaichange one base_url
routing enginecheapest within target
POST /v1/chat/completionsstream: true
latency ≤ 400msquality ≥ tier 2
llama-3.1-8brouted
250ms$0.08/M
gemini-flash-lite480ms > 400ms
480ms$0.30/M
gemini-2.0-flash610ms > 400ms
610ms$0.40/M
llama-3.3-70b560ms > 400ms
560ms$0.79/M
routed
2,847
via
llama
saved
$18.42
routes across
OpenAIGroqGeminiOllamaAnthropicMistral

A control plane for every model you call.

Drop in one base URL. Relay handles the routing, the streaming, the caching, the limits, and the accounting underneath.

Cost-aware routing

Every request is scored against your latency and quality target. The cheapest model that clears the bar wins, with the rest queued as automatic fallbacks.

llama-3.1-8b250ms / $0.08
gemini-flash-lite480ms / $0.30
llama-3.3-70b560ms / $0.79

SSE streaming

Token-by-token over Server-Sent Events, with first-token timing recorded per request.

shippingeverytokenasitlands

Semantic cache

Local embeddings plus Redis vector search return a prior answer when prompts mean the same thing.

capital of France?
0.88 hit
France's capital city

Per-key limits

Atomic requests-per-minute and tokens-per-minute, reserved up front and reconciled after.

60 rpm100k tpm429 + Retry-After

Cost tracking

Every call priced from token usage and logged to Postgres, with cache savings attributed.

Observability, not an afterthought

Structured logs, a Prometheus endpoint, and a dashboard that reads straight off the request log. The metrics are there the moment the gateway boots.

gateway_requests_totalgateway_request_latency_secondsgateway_cost_usd_totalgateway_cache_hits_totalgateway_provider_errors_totalgateway_rate_limited_total

The path of one request

Six steps from key to logged answer.

  1. 01

    Authenticate

    Hash the bearer key, check it is live.

  2. 02

    Rate limit

    Atomic rpm, then reserve tpm.

  3. 03

    Route

    Cheapest model meeting the target.

  4. 04

    Cache

    Semantic lookup before any provider.

  5. 05

    Stream

    Tee tokens to client and buffer.

  6. 06

    Account

    Price it, log it, store the answer.

ships with the gateway

Every request, costed and charted.

The dashboard is part of the product, not a hosted add-on. Run the stack and it reads straight off your request logs: spend, savings, provider mix, per-model latency, and the keys driving the bill.

localhost:5173 / analytics
/ analytics
1h24h7d30d
demo
Requests
9.4K
5.1M tokens
Spend
$2.64
$1.22 saved
Cache hit rate
36.0%
3.4K hits
Avg latency
412 ms
1.2% errors

Spend and savings

per hour

Provider mix

by requests
  • groq4.9K / $0.8978
  • gemini2.9K / $1.37
  • ollama1.1K / $0
  • anthropic471 / $0.3697

Latency by model

p50 / p95, ms

Top keys by spend

4 active
KeyNameRequestsSpend
glw_7Qk2prod-chat4.3K$1.21
glw_a1Bdbatch-enrich2.6K$0.8186
glw_Lp9cinternal-tools1.6K$0.4225
glw_Zx04staging848$0.1848

The cache pays for itself.

Repeated and near-repeat prompts never reach a provider. Relay records what each cached hit would have cost, so the savings are a line item, not a guess.

saved, 48h
$0.00
of total spend
32%
spend avoidedcents / hour

If you can call OpenAI, you already use it.

Point the SDK at your Relay host and pass model="auto". Run the whole stack locally with one command.

# clone, then
$ cp .env.example .env && make up
# postgres + redis + gateway, migrated and live
quickstart.py
1from openai import OpenAI
3client = OpenAI(
4 base_url="https://relay.your-host.dev/v1",
5 api_key="glw_*************",
6)
8# model="auto" lets Relay pick the route
9res = client.chat.completions.create(
10 model="auto",
11 messages=[{"role": "user", "content": "ship it"}],
12 extra_body={"routing": {"max_latency_ms": 600}},
13)