OpenAI-compatible, self-hosted

Route every prompt to the cheapest model fast enough.

One endpoint over Groq, Gemini, Ollama and more, with streaming, semantic caching, per-key limits, and cost tracked to the cent.

Start routing See the dashboard

$ pip install openaichange one base_url

routing enginecheapest within target

POST /v1/chat/completionsstream: true

latency ≤ 400msquality ≥ tier 2

llama-3.1-8brouted

250ms$0.08/M

gemini-flash-lite480ms > 400ms

480ms$0.30/M

gemini-2.0-flash610ms > 400ms

610ms$0.40/M

llama-3.3-70b560ms > 400ms

560ms$0.79/M

routed

2,847

via

llama

saved

$18.42

routes across

OpenAIGroqGeminiOllamaAnthropicMistral

A control plane for every model you call.

Drop in one base URL. Relay handles the routing, the streaming, the caching, the limits, and the accounting underneath.

Cost-aware routing

Every request is scored against your latency and quality target. The cheapest model that clears the bar wins, with the rest queued as automatic fallbacks.

llama-3.1-8b250ms / $0.08

gemini-flash-lite480ms / $0.30

llama-3.3-70b560ms / $0.79

SSE streaming

Token-by-token over Server-Sent Events, with first-token timing recorded per request.

shippingeverytokenasitlands

Semantic cache

Local embeddings plus Redis vector search return a prior answer when prompts mean the same thing.

capital of France?

0.88 hit

France's capital city

Per-key limits

Atomic requests-per-minute and tokens-per-minute, reserved up front and reconciled after.

60 rpm100k tpm429 + Retry-After

Cost tracking

Every call priced from token usage and logged to Postgres, with cache savings attributed.

Observability, not an afterthought

Structured logs, a Prometheus endpoint, and a dashboard that reads straight off the request log. The metrics are there the moment the gateway boots.

gateway_requests_totalgateway_request_latency_secondsgateway_cost_usd_totalgateway_cache_hits_totalgateway_provider_errors_totalgateway_rate_limited_total

The path of one request

Six steps from key to logged answer.

01
Authenticate
Hash the bearer key, check it is live.
02
Rate limit
Atomic rpm, then reserve tpm.
03
Route
Cheapest model meeting the target.
04
Cache
Semantic lookup before any provider.
05
Stream
Tee tokens to client and buffer.
06
Account
Price it, log it, store the answer.

ships with the gateway

Every request, costed and charted.

The dashboard is part of the product, not a hosted add-on. Run the stack and it reads straight off your request logs: spend, savings, provider mix, per-model latency, and the keys driving the bill.

localhost:5173 / analytics

/ analytics

1h24h7d30d

demo

Requests

9.4K

5.1M tokens

Spend

$2.64

$1.22 saved

Cache hit rate

36.0%

3.4K hits

Avg latency

412 ms

1.2% errors

Spend and savings

per hour

Provider mix

by requests

groq4.9K / $0.8978
gemini2.9K / $1.37
ollama1.1K / $0
anthropic471 / $0.3697

Latency by model

p50 / p95, ms

Top keys by spend

4 active

Key	Name	Requests	Spend
glw_7Qk2	prod-chat	4.3K	$1.21
glw_a1Bd	batch-enrich	2.6K	$0.8186
glw_Lp9c	internal-tools	1.6K	$0.4225
glw_Zx04	staging	848	$0.1848

The cache pays for itself.

Repeated and near-repeat prompts never reach a provider. Relay records what each cached hit would have cost, so the savings are a line item, not a guess.

saved, 48h

$0.00

of total spend

32%

spend avoidedcents / hour

If you can call OpenAI, you already use it.

Point the SDK at your Relay host and pass model="auto". Run the whole stack locally with one command.

# clone, then

$ cp .env.example .env && make up

# postgres + redis + gateway, migrated and live

Clone the repo See the dashboard

quickstart.py

1from openai import OpenAI
 
3client = OpenAI(
4    base_url="https://relay.your-host.dev/v1",
5    api_key="glw_*************",
6)
 
8# model="auto" lets Relay pick the route
9res = client.chat.completions.create(
10    model="auto",
11    messages=[{"role": "user", "content": "ship it"}],
12    extra_body={"routing": {"max_latency_ms": 600}},
13)