System & Operations

Health checks, error codes, rate limits, and operational information.

Health Check

Check the health of all Dolly services.

GET /api/health

Response (All Healthy)

{
  "status": "healthy",
  "timestamp": "2026-03-12T14:30:00Z",
  "services": {
    "api": { "status": "healthy", "latencyMs": 2 },
    "postgres": { "status": "healthy", "latencyMs": 5 },
    "keydb": { "status": "healthy", "latencyMs": 1 },
    "minio": { "status": "healthy", "latencyMs": 8 },
    "queue": { "status": "healthy", "pendingJobs": 5 },
    "embedder": { "status": "loading", "progress": "45%", "eta": "2s" }
  }
}

Status Values

healthy — Service is operational
loading — Service is warming up (first request, loading models)
degraded — Service has issues but is partially operational
unhealthy — Service is down

When to Worry

api: Check application logs
postgres: Database connectivity issue
keydb: Cache connectivity issue
queue: BullMQ jobs are stuck — check worker logs
embedder: ONNX models loading (normal on first request)

Rate Limiting

Limits

Default: 100 requests / minute per tenant
Webhook: Fair-queued (see /api/tenants/me/analytics for queue depth)

Headers

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1678951234

When Rate Limited

HTTP 429
{
  "error": {
    "code": "RATE_LIMIT",
    "message": "Rate limit exceeded",
    "retryAfter": 5
  }
}

Action: Wait the number of seconds in retryAfter, then retry.

Backoff Strategy

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (err.status === 429) {
        const delay = Math.min(1000 * Math.pow(2, attempt - 1), 30000);
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      throw err;
    }
  }
}

Error Code Reference

Client Errors (4xx)

Code	HTTP	Meaning	Action
`INVALID_REQUEST`	400	Malformed request (missing fields, wrong types)	Fix request schema
`UNAUTHORIZED`	401	Missing or invalid JWT token	Login again, refresh token
`FORBIDDEN`	403	Insufficient permissions (accessing another tenant's data)	Check tenant context
`NOT_FOUND`	404	Resource does not exist	Verify resource ID
`CONFLICT`	409	Resource already exists (e.g., duplicate KB entry)	Check for duplicates
`RATE_LIMIT`	429	Too many requests	Implement backoff
`UNPROCESSABLE_ENTITY`	422	Request validation failed (e.g., invalid email format)	Check field values

Server Errors (5xx)

Code	HTTP	Meaning	Action
`INTERNAL_ERROR`	500	Unexpected server error	Retry with exponential backoff
`SERVICE_UNAVAILABLE`	503	Service is temporarily down	Retry later

Special Codes

Code	HTTP	Context
`INVALID_SIGNATURE`	401	Webhook signature validation failed
`TENANT_NOT_FOUND`	404	Tenant was deleted or doesn't exist
`GATE_FAILURE`	400	AI response failed gate validation (debug only)
`LLM_ERROR`	500	Claude API error (usually temporary)
`TOOL_EXECUTION_ERROR`	400	Tool returned error or timeout
`PANCAKE_API_ERROR`	500	Pancake API unreachable

Common Issues & Troubleshooting

"INVALID_SIGNATURE" on Webhook

Cause: Pancake webhook signature doesn't match.

Fix:

Verify PANCAKE_WEBHOOK_SECRET is correct
Ensure you're validating the raw request body (not parsed JSON)
Use timing-safe comparison: crypto.timingSafeEqual()

"LLM_ERROR" When Processing Messages

Cause: Claude API returned error (rate limit, overloaded, etc).

Fix:

429 (Rate Limit): Dolly automatically retries with multi-key rotation
500 (Server Error): Temporary — Dolly retries with exponential backoff
Check logs: View in Grafana/Loki for details

"TOOL_EXECUTION_ERROR"

Cause: Your tool endpoint failed or timed out.

Fix:

Check your endpoint logs
Verify endpoint URL is correct
Ensure endpoint responds within 10 seconds
Check auth headers and parameters

High Gate Failure Rate

Cause: AI responses repeatedly failing gates.

Fix:

Hallucination gate: Review KB entries — add missing product info, clarify prices
Tone gate: Check prohibited phrases list — may be too aggressive
Repetition gate: Customer sending rapid-fire messages — consider chunk buffer TTL
Relevance gate: Customer questions are off-topic for your product — add more KB entries

Maintenance & Downtime

Scheduled Maintenance

Dolly performs minimal maintenance:

Database backups: Daily at 3 AM UTC
Model reloads: Every 24 hours (no downtime)
Log rotation: Every 7 days

Service Status

Check real-time service status:

GET https://status.dolly.shin0x.space

Or subscribe to status updates (Phase 2).

Logging & Debugging

View Logs

All requests are logged with structured JSON:

{
  "timestamp": "2026-03-12T14:30:00Z",
  "level": "info",
  "service": "api",
  "tenantId": "ulid",
  "customerId": "ulid",
  "messageId": "ulid",
  "message": "Message processed successfully",
  "latencyMs": 2300,
  "inputTokens": 500,
  "outputTokens": 45
}

Access logs in Grafana → Loki with filters:

{job="dolly-api"} | json | tenantId="your_tenant_id"

Enable Debug Logging

On-demand debug logs for specific message:

GET /api/tenants/me/conversations/:conversationId/debug?verbosity=full

Data Retention

Conversation Data

Messages: Kept for 90 days
Gate results: Kept indefinitely (audit trail)
Token usage: Kept for 7 days (cost tracking)

Analytics Data

ClickHouse events: Kept for 1 year
Aggregated metrics: Permanent

Customer Memory

L1-L4 memory: Kept indefinitely (updated on each interaction)
Episodes: Kept for 2 years

Compliance & Data Privacy

Data Isolation

Each tenant's data is:

Row-level secured in Postgres (RLS policies)
Isolated in ClickHouse by tenant_id
Isolated in KeyDB by key prefix
Isolated in Minio by bucket

Verify isolation: Cross-tenant queries return 403 (forbidden).

Data export API (customer request all their data)
Data deletion API (right to be forgotten)
Data processing agreement (DPA)
Privacy policy

Contact & Support

Documentation: https://docs.dolly.shin0x.space
Status: https://status.dolly.shin0x.space
Email: support@dolly.shin0x.space
Slack: [Community channel] (Phase 2)

Last updated: 2026-03-12. API version: 1.0-alpha

Health Check​

Response (All Healthy)​

Status Values​

When to Worry​

Rate Limiting​

Limits​

Headers​

When Rate Limited​

Backoff Strategy​

Error Code Reference​

Client Errors (4xx)​

Server Errors (5xx)​

Special Codes​

Common Issues & Troubleshooting​

"INVALID_SIGNATURE" on Webhook​

"LLM_ERROR" When Processing Messages​

"TOOL_EXECUTION_ERROR"​

High Gate Failure Rate​

Maintenance & Downtime​

Scheduled Maintenance​

Service Status​

Logging & Debugging​

View Logs​

Enable Debug Logging​

Data Retention​

Conversation Data​

Analytics Data​

Customer Memory​

Compliance & Data Privacy​

Data Isolation​

GDPR Compliance (Phase 2)​

Contact & Support​

Health Check

Response (All Healthy)

Status Values

When to Worry

Rate Limiting

Limits

Headers

When Rate Limited

Backoff Strategy

Error Code Reference

Client Errors (4xx)

Server Errors (5xx)

Special Codes

Common Issues & Troubleshooting

"INVALID_SIGNATURE" on Webhook

"LLM_ERROR" When Processing Messages

"TOOL_EXECUTION_ERROR"

High Gate Failure Rate

Maintenance & Downtime

Scheduled Maintenance

Service Status

Logging & Debugging

View Logs

Enable Debug Logging

Data Retention

Conversation Data

Analytics Data

Customer Memory

Compliance & Data Privacy

Data Isolation

GDPR Compliance (Phase 2)

Contact & Support