System & Operations
Health checks, error codes, rate limits, and operational information.
Health Check
Check the health of all Dolly services.
GET /api/health
Response (All Healthy)
{
"status": "healthy",
"timestamp": "2026-03-12T14:30:00Z",
"services": {
"api": { "status": "healthy", "latencyMs": 2 },
"postgres": { "status": "healthy", "latencyMs": 5 },
"keydb": { "status": "healthy", "latencyMs": 1 },
"minio": { "status": "healthy", "latencyMs": 8 },
"queue": { "status": "healthy", "pendingJobs": 5 },
"embedder": { "status": "loading", "progress": "45%", "eta": "2s" }
}
}
Status Values
healthy— Service is operationalloading— Service is warming up (first request, loading models)degraded— Service has issues but is partially operationalunhealthy— Service is down
When to Worry
- api: Check application logs
- postgres: Database connectivity issue
- keydb: Cache connectivity issue
- queue: BullMQ jobs are stuck — check worker logs
- embedder: ONNX models loading (normal on first request)
Rate Limiting
Limits
- Default: 100 requests / minute per tenant
- Webhook: Fair-queued (see
/api/tenants/me/analyticsfor queue depth)
Headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1678951234
When Rate Limited
HTTP 429
{
"error": {
"code": "RATE_LIMIT",
"message": "Rate limit exceeded",
"retryAfter": 5
}
}
Action: Wait the number of seconds in retryAfter, then retry.
Backoff Strategy
async function retryWithBackoff(fn, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (err.status === 429) {
const delay = Math.min(1000 * Math.pow(2, attempt - 1), 30000);
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
throw err;
}
}
}
Error Code Reference
Client Errors (4xx)
| Code | HTTP | Meaning | Action |
|---|---|---|---|
INVALID_REQUEST | 400 | Malformed request (missing fields, wrong types) | Fix request schema |
UNAUTHORIZED | 401 | Missing or invalid JWT token | Login again, refresh token |
FORBIDDEN | 403 | Insufficient permissions (accessing another tenant's data) | Check tenant context |
NOT_FOUND | 404 | Resource does not exist | Verify resource ID |
CONFLICT | 409 | Resource already exists (e.g., duplicate KB entry) | Check for duplicates |
RATE_LIMIT | 429 | Too many requests | Implement backoff |
UNPROCESSABLE_ENTITY | 422 | Request validation failed (e.g., invalid email format) | Check field values |
Server Errors (5xx)
| Code | HTTP | Meaning | Action |
|---|---|---|---|
INTERNAL_ERROR | 500 | Unexpected server error | Retry with exponential backoff |
SERVICE_UNAVAILABLE | 503 | Service is temporarily down | Retry later |
Special Codes
| Code | HTTP | Context |
|---|---|---|
INVALID_SIGNATURE | 401 | Webhook signature validation failed |
TENANT_NOT_FOUND | 404 | Tenant was deleted or doesn't exist |
GATE_FAILURE | 400 | AI response failed gate validation (debug only) |
LLM_ERROR | 500 | Claude API error (usually temporary) |
TOOL_EXECUTION_ERROR | 400 | Tool returned error or timeout |
PANCAKE_API_ERROR | 500 | Pancake API unreachable |
Common Issues & Troubleshooting
"INVALID_SIGNATURE" on Webhook
Cause: Pancake webhook signature doesn't match.
Fix:
- Verify
PANCAKE_WEBHOOK_SECRETis correct - Ensure you're validating the raw request body (not parsed JSON)
- Use timing-safe comparison:
crypto.timingSafeEqual()
"LLM_ERROR" When Processing Messages
Cause: Claude API returned error (rate limit, overloaded, etc).
Fix:
- 429 (Rate Limit): Dolly automatically retries with multi-key rotation
- 500 (Server Error): Temporary — Dolly retries with exponential backoff
- Check logs: View in Grafana/Loki for details
"TOOL_EXECUTION_ERROR"
Cause: Your tool endpoint failed or timed out.
Fix:
- Check your endpoint logs
- Verify endpoint URL is correct
- Ensure endpoint responds within 10 seconds
- Check auth headers and parameters
High Gate Failure Rate
Cause: AI responses repeatedly failing gates.
Fix:
- Hallucination gate: Review KB entries — add missing product info, clarify prices
- Tone gate: Check prohibited phrases list — may be too aggressive
- Repetition gate: Customer sending rapid-fire messages — consider chunk buffer TTL
- Relevance gate: Customer questions are off-topic for your product — add more KB entries
Maintenance & Downtime
Scheduled Maintenance
Dolly performs minimal maintenance:
- Database backups: Daily at 3 AM UTC
- Model reloads: Every 24 hours (no downtime)
- Log rotation: Every 7 days
Service Status
Check real-time service status:
GET https://status.dolly.shin0x.space
Or subscribe to status updates (Phase 2).
Logging & Debugging
View Logs
All requests are logged with structured JSON:
{
"timestamp": "2026-03-12T14:30:00Z",
"level": "info",
"service": "api",
"tenantId": "ulid",
"customerId": "ulid",
"messageId": "ulid",
"message": "Message processed successfully",
"latencyMs": 2300,
"inputTokens": 500,
"outputTokens": 45
}
Access logs in Grafana → Loki with filters:
{job="dolly-api"} | json | tenantId="your_tenant_id"
Enable Debug Logging
On-demand debug logs for specific message:
GET /api/tenants/me/conversations/:conversationId/debug?verbosity=full
Data Retention
Conversation Data
- Messages: Kept for 90 days
- Gate results: Kept indefinitely (audit trail)
- Token usage: Kept for 7 days (cost tracking)
Analytics Data
- ClickHouse events: Kept for 1 year
- Aggregated metrics: Permanent
Customer Memory
- L1-L4 memory: Kept indefinitely (updated on each interaction)
- Episodes: Kept for 2 years
Compliance & Data Privacy
Data Isolation
Each tenant's data is:
- Row-level secured in Postgres (RLS policies)
- Isolated in ClickHouse by
tenant_id - Isolated in KeyDB by key prefix
- Isolated in Minio by bucket
Verify isolation: Cross-tenant queries return 403 (forbidden).
GDPR Compliance (Phase 2)
- Data export API (customer request all their data)
- Data deletion API (right to be forgotten)
- Data processing agreement (DPA)
- Privacy policy
Contact & Support
- Documentation: https://docs.dolly.shin0x.space
- Status: https://status.dolly.shin0x.space
- Email: support@dolly.shin0x.space
- Slack: [Community channel] (Phase 2)
Last updated: 2026-03-12. API version: 1.0-alpha