Route every prompt to the cheapest model fast enough.
One endpoint over Groq, Gemini, Ollama and more, with streaming, semantic caching, per-key limits, and cost tracked to the cent.
A control plane for every model you call.
Drop in one base URL. Relay handles the routing, the streaming, the caching, the limits, and the accounting underneath.
Cost-aware routing
Every request is scored against your latency and quality target. The cheapest model that clears the bar wins, with the rest queued as automatic fallbacks.
SSE streaming
Token-by-token over Server-Sent Events, with first-token timing recorded per request.
Semantic cache
Local embeddings plus Redis vector search return a prior answer when prompts mean the same thing.
Per-key limits
Atomic requests-per-minute and tokens-per-minute, reserved up front and reconciled after.
Cost tracking
Every call priced from token usage and logged to Postgres, with cache savings attributed.
Observability, not an afterthought
Structured logs, a Prometheus endpoint, and a dashboard that reads straight off the request log. The metrics are there the moment the gateway boots.
The path of one request
Six steps from key to logged answer.
- 01
Authenticate
Hash the bearer key, check it is live.
- 02
Rate limit
Atomic rpm, then reserve tpm.
- 03
Route
Cheapest model meeting the target.
- 04
Cache
Semantic lookup before any provider.
- 05
Stream
Tee tokens to client and buffer.
- 06
Account
Price it, log it, store the answer.
Every request, costed and charted.
The dashboard is part of the product, not a hosted add-on. Run the stack and it reads straight off your request logs: spend, savings, provider mix, per-model latency, and the keys driving the bill.
Spend and savings
per hourProvider mix
by requests- groq4.9K / $0.8978
- gemini2.9K / $1.37
- ollama1.1K / $0
- anthropic471 / $0.3697
Latency by model
p50 / p95, msTop keys by spend
4 active| Key | Name | Requests | Spend |
|---|---|---|---|
| glw_7Qk2 | prod-chat | 4.3K | $1.21 |
| glw_a1Bd | batch-enrich | 2.6K | $0.8186 |
| glw_Lp9c | internal-tools | 1.6K | $0.4225 |
| glw_Zx04 | staging | 848 | $0.1848 |
The cache pays for itself.
Repeated and near-repeat prompts never reach a provider. Relay records what each cached hit would have cost, so the savings are a line item, not a guess.
If you can call OpenAI, you already use it.
Point the SDK at your Relay host and pass model="auto". Run the whole stack locally with one command.
1from openai import OpenAI 3client = OpenAI(4 base_url="https://relay.your-host.dev/v1",5 api_key="glw_*************",6) 8# model="auto" lets Relay pick the route9res = client.chat.completions.create(10 model="auto",11 messages=[{"role": "user", "content": "ship it"}],12 extra_body={"routing": {"max_latency_ms": 600}},13)