OpenAI-compatible endpoints. Point any client at your FreeAI instance and it works.
Auth: Authorization: Bearer <fai_...> for /v1/* (created in the Users tab or via POST /api/clients);
X-Admin-Token or a JWT cookie for admin routes. For the full reference with
edge cases, see docs/API.md.
CLIENT ROUTES
Authorization: Bearer fai_…
04 / 12
POST
/v1/chat/completions
CLIENT
REQUEST BODY
{
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
"model": "freeai-fast",
"strategy": "auto",
"preferred_provider": null,
"temperature": 0.7,
"max_tokens": 512,
"stream": false,
"fallback": true
}
OpenAI-compatible chat with multi-provider fallback.
strategy controls routing (auto, fastest, best_quality, coding, custom…).
preferred_provider boosts a specific provider; fallback: true tries the next candidate on failure.
Virtual models: put one of these in "model" to pick a strategy directly
without setting strategy:
freeai-auto freeai-cheap
freeai-fast freeai-vision
freeai-quality freeai-long
freeai-code freeai-reasoning
The response echoes the virtual name in model and the upstream's real model in real_model.
Vision: multimodal blocks route to vision providers automatically:
{"role": "user", "content": [
{"type": "text", "text": "Describe this"},
{"type": "image_url", "image_url":
{"url": "data:image/png;base64,..."}}
]}
Tool calling: full OpenAI histories round-trip. Assistant turns may use
content: null with tool_calls, and role: "tool" messages
carry their tool_call_id. Top-level tools, tool_choice,
response_format, seed, top_p, stop,
presence_penalty, frequency_penalty, logit_bias,
user, n are accepted and forwarded to OpenAI-compatible providers.
Fallback chain: response includes "fallback_chain": ["mistral", "groq"]
so you can tell whether a request needed failover. fallback_position per attempt
is logged to usage_events.
Streaming: set stream: true. Returns text/event-stream
with chunks mirroring OpenAI's format plus provider on every frame. The orchestrator
enforces a per-chunk idle timeout (default 45 s, from app_config.stream_idle_timeout_s);
a stalled upstream falls back only if no bytes were flushed yet.
POST
/v1/embeddings
CLIENT
REQUEST / RESPONSE
// Request:
{
"input": ["first doc", "second doc"],
"model": "mistral-embed",
"preferred_provider": "mistral",
"fallback": true
}
// Response:
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [...]},
{"object": "embedding", "index": 1, "embedding": [...]}
],
"model": "mistral-embed",
"provider": "mistral",
"usage": {"prompt_tokens": 14, "total_tokens": 14},
"fallback_position": 1
}
OpenAI-compatible embeddings with Mistral → Gemini fallback.
input accepts a string or a list; vectors are aligned with the list order.
⚠ Only native model names are valid.
mistral-embed (1024-dim) or text-embedding-004 (Gemini, 768-dim).
OpenAI names like text-embedding-3-small are passed verbatim and the
upstream returns 400. Omit model to use each provider's default safely.
Vectors from different models are not comparable; tag every stored vector
with its provider+model. For production RAG, pin
preferred_provider and set fallback: false so a silent provider switch
doesn't corrupt your index.
POST
/v1/audio/transcriptions
CLIENT
MULTIPART / RESPONSE
// multipart/form-data
file: <audio file>
model: "whisper-1" // optional, ignored
language: "en" // optional (ISO 639-1)
// Response:
{
"text": "Hello world...",
"provider": "groq",
"model": "whisper-large-v3-turbo",
"latency_ms": 1230,
"fallback_position": 1
}
Audio transcription with fallback: Groq Whisper → Gemini.
Accepts mp3, wav, ogg, flac, webm, m4a, aac. Max 20 MB for Gemini inline.
Same auth as chat completions.
RESPONSE
{
"object": "list",
"data": [
{"id": "freeai-auto", "object": "model", "owned_by": "freeai"},
{"id": "freeai-fast", "object": "model", "owned_by": "freeai"},
{"id": "freeai-quality", "object": "model", "owned_by": "freeai"},
{"id": "freeai-code", "object": "model", "owned_by": "freeai"},
...
]
}
OpenAI-compatible model list. Public — no auth required.
Returns the 8 virtual models (strategies) exposed for chat completions.
Intended for SDK model-picker UIs.
USER ROUTES
JWT cookie · multi-user scope
03 / 12
GET
/api/me/providers
USER
RESPONSE
[
{
"provider_name": "groq",
"has_key": true,
"key_preview": "gsk_***abc",
"enabled": true,
"rpm_limit": 30,
"rpd_limit": 14400,
"tpd_limit": 500000,
"weight": 1.0,
"tags": ["fast", "cheap"],
"default_model": "llama-3.3-70b-versatile",
"max_retries": null
}, ...
]
Per-user provider credentials and overrides. Every field (except tags and provider_name)
can override the catalog default.
max_retries overrides app_config.provider_max_retries for this user+provider;
null = use the global.
PATCH
/api/me/providers/{name}
USER
REQUEST BODY
{
"api_key": "sk-...", // empty string to remove
"enabled": true,
"rpm_limit": 30,
"rpd_limit": 14400,
"tpd_limit": 500000,
"weight": 1.2,
"default_model": "llama-3.1-8b-instant",
"max_retries": 2 // override global retry budget
}
Upsert your credentials for a provider. All fields optional —
send only what you want to change. The raw key is encrypted at rest (Fernet).
REQUEST / RESPONSE
// Request:
{ "name": "my-app", "rpm_limit": 60 }
// Response:
{
"name": "my-app",
"api_key": "fai_EzjQ2OcPp_...",
"key_hash": "a1b2c3...",
"rpm_limit": 60
}
Issue a client API key for /v1/*. The raw key is shown only once —
save it immediately. Clients use Authorization: Bearer fai_....
Keys are scoped to the issuing user; each user sees only their own.
ADMIN ROUTES
X-Admin-Token or JWT (admin role)
03 / 12
RESPONSE
[{
"name": "groq",
"enabled": true,
"has_key": true,
"healthy": true,
"requests_today": 42,
"requests_this_minute": 3,
"rpm_limit": 30,
"rpd_limit": 14400,
"tpd_limit": 500000,
"tokens_today": 84200,
"weight": 1.0,
"last_latency_ms": 420,
"latency_ema_ms": 385.2,
"tags": ["fast", "cheap", "audio"],
"default_model": "llama-3.3-70b-versatile"
}]
Admin only. Live health + rate status for every provider (for the current user).
latency_ema_ms is a smoother average than last_latency_ms; the ranker
uses it for scoring.
GET
/api/strategies
ADMIN
RESPONSE
[
{
"name": "fastest",
"description": "Lowest observed latency",
"is_builtin": true,
"definition": {
"prefer": [
{"when": "latency_ema_ms < 500", "weight": 30}
]
}
}, ...
]
Admin only. Lists every routing strategy. Built-ins can be edited but not
deleted. Custom strategies go via POST /api/strategies.
DSL reference: docs/STRATEGY_DSL.md.
QUERY / RESPONSE
GET /api/analytics?window_seconds=86400&bucket_count=24
{
"total_calls": 142,
"success_rate": 0.9577,
"by_provider": [...],
"by_strategy": [...],
"by_outcome": [...],
"by_client": [...],
"time_buckets": [...]
}
Admin only. Aggregated telemetry with breakdowns by provider, strategy, outcome,
and client. window_seconds: 60–604800 (7 d).
bucket_count: 1–168. Windows > 30 d are served from
usage_daily_rollup so they stay fast.
PUBLIC / HEALTH
No auth · probes & metrics
02 / 12
RESPONSE
{ "status": "ok" }
Public. Liveness probe, no auth. Intentionally
minimal so an unauthenticated scanner can't fingerprint the
deployment. Use /api/setup/status and
/api/auth/status for the frontend bootstrap flow, and
/api/providers / /api/analytics (admin) for
fleet state.
PROMETHEUS EXPOSITION
# HELP freeai_provider_calls_total ...
# TYPE freeai_provider_calls_total counter
freeai_provider_calls_total{provider="groq",outcome="success"} 1847
freeai_provider_circuit_breaker_trips_total{provider="mistral"} 2
freeai_orchestrator_fallbacks_total{from_provider="mistral",to_provider="groq"} 14
...
Public. Prometheus exposition format.
See docs/OPERATIONS.md § 3.2
for the full metric inventory and the Grafana dashboard bundled under the
observability docker-compose profile.