Back to blog
analyticsmonitoringopenclawcostsperformance

Track Your AI Agent's Performance (Messages, Costs, Satisfaction)

How to monitor your OpenClaw agent's conversation volume, API costs, user satisfaction scores, and response quality — with dashboard setup and alerting.

By ClawPort Team

Deploying an AI agent without monitoring is flying blind. You don't know if it's giving good answers, how much it's costing you, or whether users are happy. This post covers the specific metrics that matter, how to collect them, and how to set up alerts before problems become expensive.

The Four Metrics That Actually Matter

Most monitoring tutorials list 20 metrics. Here are the four you should focus on first:

  1. Message volume — How many conversations per day/hour? Spot spikes early.
  2. API cost per conversation — Are some users costing 10x more than average?
  3. Conversation completion rate — Do users get an answer, or bail?
  4. CSAT / thumbs up rate — Are users satisfied with responses?

Everything else is nice-to-have. Start here.

Setting Up Basic Logging

OpenClaw logs events to stdout by default. For structured analytics, configure JSON logging:

logging:
  format: json
  level: info
  include_fields:
    - conversation_id
    - user_id
    - channel
    - model
    - tokens_used
    - latency_ms
    - message_count

Each log line looks like:

{
  "timestamp": "2026-03-10T14:22:01Z",
  "event": "conversation.message",
  "conversation_id": "conv_abc123",
  "user_id": "user_xyz789",
  "channel": "telegram",
  "model": "gpt-4o-mini",
  "tokens_used": {
    "input": 1240,
    "output": 387,
    "total": 1627
  },
  "latency_ms": 843,
  "message_count": 3
}

Ship these to your log aggregator of choice — Datadog, Grafana Loki, CloudWatch, or even a simple Postgres table.

Tracking API Costs

Token costs vary by model. Here are the current rates (March 2026):

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4o$2.50$10.00
GPT-4o mini$0.15$0.60
Claude 3.5 Sonnet$3.00$15.00
Claude 3.5 Haiku$0.80$4.00
Gemini 1.5 Flash$0.075$0.30
Llama 3.3 70B (Groq)$0.59$0.79

Calculate cost per conversation:

function calculateConversationCost(model, inputTokens, outputTokens) {
  const rates = {
    'gpt-4o':             { input: 0.0000025, output: 0.00001 },
    'gpt-4o-mini':        { input: 0.00000015, output: 0.0000006 },
    'claude-3-5-sonnet':  { input: 0.000003, output: 0.000015 },
    'claude-3-5-haiku':   { input: 0.0000008, output: 0.000004 },
  };
  
  const rate = rates[model];
  if (!rate) return null;
  
  return (inputTokens * rate.input) + (outputTokens * rate.output);
}

// Example: a 5-message conversation on gpt-4o-mini
const cost = calculateConversationCost('gpt-4o-mini', 2000, 800);
// $0.00078 — less than a tenth of a cent

Set a cost alert: if average cost per conversation exceeds 2x your baseline, something is wrong (likely a prompt injection causing infinite loops, or a user finding a way to feed huge inputs).

Measuring Conversation Completion Rate

A completed conversation is one where the user got an answer. Incomplete conversations look like:

  • User sends 1-2 messages, never responds again
  • User explicitly says "this didn't help"
  • Conversation ends with unanswered question

Track it:

analytics:
  track_completion: true
  completion_signals:
    - user_message_contains: ["thanks", "thank you", "got it", "perfect", "solved"]
    - conversation_length_min: 3  # at least 3 exchanges
  abandonment_signals:
    - user_message_contains: ["useless", "not helpful", "wrong", "that's wrong"]
    - single_message_conversation: true

A healthy completion rate is 60-80% for support use cases. Below 50%? Your agent isn't answering questions well enough.

User Satisfaction Tracking

The simplest approach: ask after each conversation.

satisfaction:
  enabled: true
  prompt: "Was this helpful? 👍 or 👎"
  collect_after: "conversation.ended"
  follow_up_on_negative: "I'm sorry it wasn't helpful. What were you trying to do?"

Track your thumbs-up rate over time. Any week-over-week drop of more than 10% should trigger a review of recent conversations.

Building a Simple Dashboard

If you're self-hosting, here's a minimal SQL schema for tracking:

CREATE TABLE conversations (
  id TEXT PRIMARY KEY,
  user_id TEXT,
  channel TEXT,
  started_at TIMESTAMPTZ,
  ended_at TIMESTAMPTZ,
  message_count INTEGER,
  total_tokens INTEGER,
  model TEXT,
  cost_usd DECIMAL(10,6),
  completed BOOLEAN,
  satisfaction_score INTEGER  -- 1 (positive) or -1 (negative)
);

CREATE TABLE messages (
  id TEXT PRIMARY KEY,
  conversation_id TEXT REFERENCES conversations(id),
  role TEXT,  -- 'user' or 'assistant'
  sent_at TIMESTAMPTZ,
  tokens INTEGER,
  latency_ms INTEGER
);

Query for your daily dashboard:

-- Today's summary
SELECT
  COUNT(*) as conversations,
  AVG(message_count) as avg_messages,
  SUM(cost_usd) as total_cost,
  AVG(cost_usd) as avg_cost_per_conversation,
  ROUND(100.0 * SUM(CASE WHEN completed THEN 1 ELSE 0 END) / COUNT(*), 1) as completion_rate,
  ROUND(100.0 * SUM(CASE WHEN satisfaction_score = 1 THEN 1 ELSE 0 END) / 
    NULLIF(SUM(CASE WHEN satisfaction_score IS NOT NULL THEN 1 ELSE 0 END), 0), 1) as csat
FROM conversations
WHERE started_at >= CURRENT_DATE;

Setting Up Alerts

Three alerts you should set from day one:

1. Cost spike alert

IF avg_cost_per_hour > 2x baseline THEN alert Slack

2. Low CSAT alert

IF csat_7day_rolling < 60% THEN alert email

3. Error rate alert

IF agent_errors_per_hour > 5 THEN alert PagerDuty/email

For OpenClaw self-hosted, you can connect to Grafana alerts or use a simple cron job that queries your database and sends a notification.

What to Do When Metrics Drop

If completion rate drops:

  1. Sample 20 recent abandoned conversations
  2. Identify patterns — is there a type of question the agent can't answer?
  3. Update SOUL.md with better handling for that topic
  4. Deploy, watch metrics for 24 hours

If CSAT drops:

  1. Look at thumbs-down conversations specifically
  2. Check for hallucinations or wrong information
  3. Check if a recent model update changed behavior
  4. Consider adding specific examples in your system prompt

If costs spike:

  1. Check for any conversations with unusually high token counts
  2. Look for prompt injection attempts (adversarial inputs trying to get the bot to do something expensive)
  3. Review if a recent knowledge base change dramatically increased context size

Analytics on ClawPort

ClawPort includes a built-in analytics dashboard — message volume, cost tracking, satisfaction scores, and conversation logs are all visible in one place. No separate logging infrastructure required.

You can export data as CSV for further analysis or connect to your own dashboards via the API. It's one of the bigger time-saves compared to self-hosting, where the monitoring stack is often as complex as the agent itself.

Seven-day free trial at clawport.io, $10/month after that.

Ready to deploy your AI agent?

Get started with ClawPort in 60 seconds. No credit card required.

Get Started Free