Skip to main content

System & Operations

Health checks, error codes, rate limits, and operational information.

Health Check

Check the health of all Dolly services.

GET /api/health

Response (All Healthy)

{
"status": "healthy",
"timestamp": "2026-03-12T14:30:00Z",
"services": {
"api": { "status": "healthy", "latencyMs": 2 },
"postgres": { "status": "healthy", "latencyMs": 5 },
"keydb": { "status": "healthy", "latencyMs": 1 },
"minio": { "status": "healthy", "latencyMs": 8 },
"queue": { "status": "healthy", "pendingJobs": 5 },
"embedder": { "status": "loading", "progress": "45%", "eta": "2s" }
}
}

Status Values

  • healthy — Service is operational
  • loading — Service is warming up (first request, loading models)
  • degraded — Service has issues but is partially operational
  • unhealthy — Service is down

When to Worry

  • api: Check application logs
  • postgres: Database connectivity issue
  • keydb: Cache connectivity issue
  • queue: BullMQ jobs are stuck — check worker logs
  • embedder: ONNX models loading (normal on first request)

Rate Limiting

Limits

  • Default: 100 requests / minute per tenant
  • Webhook: Fair-queued (see /api/tenants/me/analytics for queue depth)

Headers

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1678951234

When Rate Limited

HTTP 429
{
"error": {
"code": "RATE_LIMIT",
"message": "Rate limit exceeded",
"retryAfter": 5
}
}

Action: Wait the number of seconds in retryAfter, then retry.

Backoff Strategy

async function retryWithBackoff(fn, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (err.status === 429) {
const delay = Math.min(1000 * Math.pow(2, attempt - 1), 30000);
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
throw err;
}
}
}

Error Code Reference

Client Errors (4xx)

CodeHTTPMeaningAction
INVALID_REQUEST400Malformed request (missing fields, wrong types)Fix request schema
UNAUTHORIZED401Missing or invalid JWT tokenLogin again, refresh token
FORBIDDEN403Insufficient permissions (accessing another tenant's data)Check tenant context
NOT_FOUND404Resource does not existVerify resource ID
CONFLICT409Resource already exists (e.g., duplicate KB entry)Check for duplicates
RATE_LIMIT429Too many requestsImplement backoff
UNPROCESSABLE_ENTITY422Request validation failed (e.g., invalid email format)Check field values

Server Errors (5xx)

CodeHTTPMeaningAction
INTERNAL_ERROR500Unexpected server errorRetry with exponential backoff
SERVICE_UNAVAILABLE503Service is temporarily downRetry later

Special Codes

CodeHTTPContext
INVALID_SIGNATURE401Webhook signature validation failed
TENANT_NOT_FOUND404Tenant was deleted or doesn't exist
GATE_FAILURE400AI response failed gate validation (debug only)
LLM_ERROR500Claude API error (usually temporary)
TOOL_EXECUTION_ERROR400Tool returned error or timeout
PANCAKE_API_ERROR500Pancake API unreachable

Common Issues & Troubleshooting

"INVALID_SIGNATURE" on Webhook

Cause: Pancake webhook signature doesn't match.

Fix:

  1. Verify PANCAKE_WEBHOOK_SECRET is correct
  2. Ensure you're validating the raw request body (not parsed JSON)
  3. Use timing-safe comparison: crypto.timingSafeEqual()

"LLM_ERROR" When Processing Messages

Cause: Claude API returned error (rate limit, overloaded, etc).

Fix:

  • 429 (Rate Limit): Dolly automatically retries with multi-key rotation
  • 500 (Server Error): Temporary — Dolly retries with exponential backoff
  • Check logs: View in Grafana/Loki for details

"TOOL_EXECUTION_ERROR"

Cause: Your tool endpoint failed or timed out.

Fix:

  1. Check your endpoint logs
  2. Verify endpoint URL is correct
  3. Ensure endpoint responds within 10 seconds
  4. Check auth headers and parameters

High Gate Failure Rate

Cause: AI responses repeatedly failing gates.

Fix:

  • Hallucination gate: Review KB entries — add missing product info, clarify prices
  • Tone gate: Check prohibited phrases list — may be too aggressive
  • Repetition gate: Customer sending rapid-fire messages — consider chunk buffer TTL
  • Relevance gate: Customer questions are off-topic for your product — add more KB entries

Maintenance & Downtime

Scheduled Maintenance

Dolly performs minimal maintenance:

  • Database backups: Daily at 3 AM UTC
  • Model reloads: Every 24 hours (no downtime)
  • Log rotation: Every 7 days

Service Status

Check real-time service status:

GET https://status.dolly.shin0x.space

Or subscribe to status updates (Phase 2).


Logging & Debugging

View Logs

All requests are logged with structured JSON:

{
"timestamp": "2026-03-12T14:30:00Z",
"level": "info",
"service": "api",
"tenantId": "ulid",
"customerId": "ulid",
"messageId": "ulid",
"message": "Message processed successfully",
"latencyMs": 2300,
"inputTokens": 500,
"outputTokens": 45
}

Access logs in Grafana → Loki with filters:

{job="dolly-api"} | json | tenantId="your_tenant_id"

Enable Debug Logging

On-demand debug logs for specific message:

GET /api/tenants/me/conversations/:conversationId/debug?verbosity=full

Data Retention

Conversation Data

  • Messages: Kept for 90 days
  • Gate results: Kept indefinitely (audit trail)
  • Token usage: Kept for 7 days (cost tracking)

Analytics Data

  • ClickHouse events: Kept for 1 year
  • Aggregated metrics: Permanent

Customer Memory

  • L1-L4 memory: Kept indefinitely (updated on each interaction)
  • Episodes: Kept for 2 years

Compliance & Data Privacy

Data Isolation

Each tenant's data is:

  • Row-level secured in Postgres (RLS policies)
  • Isolated in ClickHouse by tenant_id
  • Isolated in KeyDB by key prefix
  • Isolated in Minio by bucket

Verify isolation: Cross-tenant queries return 403 (forbidden).

GDPR Compliance (Phase 2)

  • Data export API (customer request all their data)
  • Data deletion API (right to be forgotten)
  • Data processing agreement (DPA)
  • Privacy policy

Contact & Support


Last updated: 2026-03-12. API version: 1.0-alpha