Monitoring & Observability
Learn how to monitor agent performance, track metrics, debug issues, and gain visibility into agent behavior in production.
Overview
Effective monitoring enables you to:
- Track performance - Measure latency, throughput, and success rates
- Debug issues - Identify and diagnose problems quickly
- Optimize costs - Monitor token usage and API costs
- Ensure reliability - Detect and respond to failures
- Improve quality - Analyze agent behavior and outputs
Core Metrics
1. Performance Metrics
Track execution time and throughput:
import { createReActAgent } from '@agentforge/patterns';
class MetricsCollector {
private metrics: Map<string, number[]> = new Map();
record(metric: string, value: number) {
if (!this.metrics.has(metric)) {
this.metrics.set(metric, []);
}
this.metrics.get(metric)!.push(value);
}
getStats(metric: string) {
const values = this.metrics.get(metric) || [];
if (values.length === 0) return null;
const sorted = [...values].sort((a, b) => a - b);
return {
count: values.length,
min: sorted[0],
max: sorted[sorted.length - 1],
avg: values.reduce((a, b) => a + b, 0) / values.length,
p50: sorted[Math.floor(sorted.length * 0.5)],
p95: sorted[Math.floor(sorted.length * 0.95)],
p99: sorted[Math.floor(sorted.length * 0.99)]
};
}
}
const metrics = new MetricsCollector();
// Measure agent execution time
const startTime = Date.now();
const result = await agent.invoke(input);
const duration = Date.now() - startTime;
metrics.record('agent.duration', duration);
metrics.record('agent.success', 1);
console.log('Performance stats:', metrics.getStats('agent.duration'));2. Token Usage Metrics
Track token consumption and costs:
class TokenMetrics {
private totalTokens = 0;
private totalCost = 0;
private requestCount = 0;
record(usage: { promptTokens: number; completionTokens: number }, model: string) {
const total = usage.promptTokens + usage.completionTokens;
this.totalTokens += total;
this.requestCount++;
// Calculate cost
const pricing = {
'gpt-4': { input: 0.03, output: 0.06 },
'gpt-3.5-turbo': { input: 0.0015, output: 0.002 }
};
const cost =
(usage.promptTokens / 1000) * pricing[model].input +
(usage.completionTokens / 1000) * pricing[model].output;
this.totalCost += cost;
}
getStats() {
return {
totalTokens: this.totalTokens,
totalCost: this.totalCost.toFixed(4),
requestCount: this.requestCount,
avgTokensPerRequest: Math.round(this.totalTokens / this.requestCount),
avgCostPerRequest: (this.totalCost / this.requestCount).toFixed(4)
};
}
}
const tokenMetrics = new TokenMetrics();
const result = await agent.invoke(input, {
callbacks: [{
handleLLMEnd: (output) => {
tokenMetrics.record(output.llmOutput?.tokenUsage, 'gpt-4');
}
}]
});
console.log('Token stats:', tokenMetrics.getStats());3. Error Metrics
Track failures and error rates:
class ErrorMetrics {
private errors: Map<string, number> = new Map();
private totalRequests = 0;
recordSuccess() {
this.totalRequests++;
}
recordError(errorType: string) {
this.totalRequests++;
this.errors.set(errorType, (this.errors.get(errorType) || 0) + 1);
}
getStats() {
const totalErrors = Array.from(this.errors.values()).reduce((a, b) => a + b, 0);
const errorRate = this.totalRequests > 0 ? (totalErrors / this.totalRequests) * 100 : 0;
return {
totalRequests: this.totalRequests,
totalErrors,
errorRate: errorRate.toFixed(2) + '%',
errorsByType: Object.fromEntries(this.errors)
};
}
}
const errorMetrics = new ErrorMetrics();
try {
const result = await agent.invoke(input);
errorMetrics.recordSuccess();
} catch (error) {
errorMetrics.recordError(error.name);
throw error;
}Logging
Structured Logging
Use structured logs for better analysis:
import winston from 'winston';
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' })
]
});
// Log agent execution
logger.info('Agent invocation started', {
agentType: 'ReActAgent',
input: input.messages[0].content,
timestamp: new Date().toISOString()
});
const result = await agent.invoke(input, {
callbacks: [{
handleLLMStart: (llm, prompts) => {
logger.debug('LLM call started', {
model: llm.modelName,
promptLength: prompts[0].length
});
},
handleLLMEnd: (output) => {
logger.info('LLM call completed', {
tokens: output.llmOutput?.tokenUsage,
duration: output.llmOutput?.estimatedDuration
});
},
handleToolStart: (tool, input) => {
logger.info('Tool execution started', {
tool: tool.name,
input
});
},
handleToolEnd: (output) => {
logger.info('Tool execution completed', {
outputLength: JSON.stringify(output).length
});
},
handleLLMError: (error) => {
logger.error('LLM error', {
error: error.message,
stack: error.stack
});
}
}]
});
logger.info('Agent invocation completed', {
success: true,
duration: Date.now() - startTime
});Log Levels
Use appropriate log levels:
// ERROR: System errors, failures
logger.error('Agent execution failed', { error: error.message });
// WARN: Degraded performance, retries
logger.warn('Token limit approaching', { usage: tokenUsage });
// INFO: Important events, completions
logger.info('Agent task completed', { duration, tokens });
// DEBUG: Detailed execution flow
logger.debug('Tool selected', { tool: toolName, reasoning });
// TRACE: Very detailed, token-level
logger.trace('LLM token generated', { token, position });Tracing
Distributed Tracing
Track requests across services:
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
// Setup tracing
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
new BatchSpanProcessor(new JaegerExporter())
);
provider.register();
const tracer = trace.getTracer('agentforge');
// Trace agent execution
async function tracedAgentInvoke(input: any) {
const span = tracer.startSpan('agent.invoke');
try {
span.setAttribute('agent.type', 'ReActAgent');
span.setAttribute('input.length', input.messages[0].content.length);
const result = await agent.invoke(input, {
callbacks: [{
handleLLMStart: () => {
const llmSpan = tracer.startSpan('llm.call', { parent: span });
llmSpan.end();
},
handleToolStart: (tool) => {
const toolSpan = tracer.startSpan('tool.execute', { parent: span });
toolSpan.setAttribute('tool.name', tool.name);
toolSpan.end();
}
}]
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}LangSmith Integration
Use LangSmith for comprehensive tracing:
import { Client } from 'langsmith';
const client = new Client({
apiKey: process.env.LANGSMITH_API_KEY
});
const result = await agent.invoke(input, {
callbacks: [{
handleChainStart: (chain, inputs, runId) => {
client.createRun({
id: runId,
name: chain.name,
run_type: 'chain',
inputs
});
},
handleChainEnd: (outputs, runId) => {
client.updateRun(runId, {
outputs,
end_time: Date.now()
});
}
}]
});Metrics Exporters
Prometheus Integration
Export metrics to Prometheus:
import { register, Counter, Histogram, Gauge } from 'prom-client';
// Define metrics
const agentInvocations = new Counter({
name: 'agent_invocations_total',
help: 'Total number of agent invocations',
labelNames: ['agent_type', 'status']
});
const agentDuration = new Histogram({
name: 'agent_duration_seconds',
help: 'Agent execution duration',
labelNames: ['agent_type'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60]
});
const tokenUsage = new Counter({
name: 'agent_tokens_total',
help: 'Total tokens used',
labelNames: ['agent_type', 'model']
});
const activeAgents = new Gauge({
name: 'agent_active_count',
help: 'Number of currently active agents'
});
// Instrument agent
async function monitoredAgentInvoke(input: any) {
activeAgents.inc();
const startTime = Date.now();
try {
const result = await agent.invoke(input, {
callbacks: [{
handleLLMEnd: (output) => {
tokenUsage.inc({
agent_type: 'ReActAgent',
model: 'gpt-4'
}, output.llmOutput?.tokenUsage?.totalTokens || 0);
}
}]
});
agentInvocations.inc({ agent_type: 'ReActAgent', status: 'success' });
return result;
} catch (error) {
agentInvocations.inc({ agent_type: 'ReActAgent', status: 'error' });
throw error;
} finally {
const duration = (Date.now() - startTime) / 1000;
agentDuration.observe({ agent_type: 'ReActAgent' }, duration);
activeAgents.dec();
}
}
// Expose metrics endpoint
import express from 'express';
const app = express();
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(9090);DataDog Integration
Send metrics to DataDog:
import { StatsD } from 'node-dogstatsd';
const dogstatsd = new StatsD();
async function datadogMonitoredInvoke(input: any) {
const startTime = Date.now();
try {
const result = await agent.invoke(input, {
callbacks: [{
handleLLMEnd: (output) => {
dogstatsd.increment('agent.llm.calls', 1, ['model:gpt-4']);
dogstatsd.histogram('agent.llm.tokens',
output.llmOutput?.tokenUsage?.totalTokens || 0,
['model:gpt-4']
);
},
handleToolStart: (tool) => {
dogstatsd.increment('agent.tool.calls', 1, [`tool:${tool.name}`]);
}
}]
});
dogstatsd.increment('agent.invocations', 1, ['status:success']);
return result;
} catch (error) {
dogstatsd.increment('agent.invocations', 1, ['status:error']);
throw error;
} finally {
const duration = Date.now() - startTime;
dogstatsd.histogram('agent.duration', duration, ['agent:react']);
}
}Dashboards
Grafana Dashboard
Create a comprehensive monitoring dashboard:
{
"dashboard": {
"title": "AgentForge Monitoring",
"panels": [
{
"title": "Agent Invocations",
"targets": [{
"expr": "rate(agent_invocations_total[5m])"
}]
},
{
"title": "Success Rate",
"targets": [{
"expr": "rate(agent_invocations_total{status=\"success\"}[5m]) / rate(agent_invocations_total[5m])"
}]
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, agent_duration_seconds_bucket)"
}]
},
{
"title": "Token Usage",
"targets": [{
"expr": "rate(agent_tokens_total[5m])"
}]
},
{
"title": "Active Agents",
"targets": [{
"expr": "agent_active_count"
}]
}
]
}
}Custom Dashboard
Build a real-time dashboard:
import express from 'express';
import { Server } from 'socket.io';
const app = express();
const server = require('http').createServer(app);
const io = new Server(server);
// Serve dashboard
app.get('/dashboard', (req, res) => {
res.sendFile(__dirname + '/dashboard.html');
});
// Emit metrics to connected clients
setInterval(() => {
io.emit('metrics', {
timestamp: Date.now(),
performance: metrics.getStats('agent.duration'),
tokens: tokenMetrics.getStats(),
errors: errorMetrics.getStats()
});
}, 1000);
server.listen(3000);<!-- dashboard.html -->
<!DOCTYPE html>
<html>
<head>
<title>Agent Dashboard</title>
<script src="/socket.io/socket.io.js"></script>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
</head>
<body>
<h1>AgentForge Dashboard</h1>
<div>
<h2>Performance</h2>
<canvas id="performanceChart"></canvas>
</div>
<div>
<h2>Token Usage</h2>
<canvas id="tokenChart"></canvas>
</div>
<script>
const socket = io();
socket.on('metrics', (data) => {
updateCharts(data);
});
function updateCharts(data) {
// Update performance chart
// Update token chart
// Update error chart
}
</script>
</body>
</html>Alerting
Alert Rules
Define alert conditions:
class AlertManager {
private rules: Array<{
name: string;
condition: (metrics: any) => boolean;
action: (metrics: any) => void;
}> = [];
addRule(name: string, condition: (metrics: any) => boolean, action: (metrics: any) => void) {
this.rules.push({ name, condition, action });
}
check(metrics: any) {
for (const rule of this.rules) {
if (rule.condition(metrics)) {
console.log(`🚨 Alert triggered: ${rule.name}`);
rule.action(metrics);
}
}
}
}
const alertManager = new AlertManager();
// High error rate alert
alertManager.addRule(
'High Error Rate',
(metrics) => parseFloat(metrics.errorRate) > 5,
(metrics) => {
sendSlackAlert(`Error rate is ${metrics.errorRate}`);
}
);
// High latency alert
alertManager.addRule(
'High Latency',
(metrics) => metrics.p95 > 10000,
(metrics) => {
sendSlackAlert(`P95 latency is ${metrics.p95}ms`);
}
);
// High token usage alert
alertManager.addRule(
'High Token Usage',
(metrics) => metrics.totalTokens > 100000,
(metrics) => {
sendSlackAlert(`Token usage: ${metrics.totalTokens}`);
}
);
// Check alerts periodically
setInterval(() => {
alertManager.check({
errorRate: errorMetrics.getStats().errorRate,
p95: metrics.getStats('agent.duration')?.p95,
totalTokens: tokenMetrics.getStats().totalTokens
});
}, 60000);Notification Channels
Send alerts to various channels:
import { WebClient } from '@slack/web-api';
import nodemailer from 'nodemailer';
// Slack notifications
const slack = new WebClient(process.env.SLACK_TOKEN);
async function sendSlackAlert(message: string) {
await slack.chat.postMessage({
channel: '#alerts',
text: `🚨 AgentForge Alert: ${message}`
});
}
// Email notifications
const transporter = nodemailer.createTransport({
service: 'gmail',
auth: {
user: process.env.EMAIL_USER,
pass: process.env.EMAIL_PASS
}
});
async function sendEmailAlert(subject: string, message: string) {
await transporter.sendMail({
from: 'alerts@agentforge.com',
to: 'team@company.com',
subject: `[AgentForge] ${subject}`,
text: message
});
}
// PagerDuty integration
import { event } from '@pagerduty/pdjs';
async function sendPagerDutyAlert(severity: string, message: string) {
await event({
data: {
routing_key: process.env.PAGERDUTY_KEY,
event_action: 'trigger',
payload: {
summary: message,
severity,
source: 'agentforge'
}
}
});
}Debugging Tools
Agent Execution Visualizer
Visualize agent execution flow:
class ExecutionVisualizer {
private steps: Array<{
type: string;
timestamp: number;
data: any;
}> = [];
recordStep(type: string, data: any) {
this.steps.push({
type,
timestamp: Date.now(),
data
});
}
generateMermaidDiagram(): string {
let diagram = 'graph TD\n';
this.steps.forEach((step, i) => {
const nodeId = `step${i}`;
const label = `${step.type}: ${JSON.stringify(step.data).substring(0, 30)}`;
diagram += ` ${nodeId}[${label}]\n`;
if (i > 0) {
diagram += ` step${i-1} --> ${nodeId}\n`;
}
});
return diagram;
}
generateTimeline(): string {
const startTime = this.steps[0]?.timestamp || Date.now();
return this.steps.map((step, i) => {
const elapsed = step.timestamp - startTime;
return `${elapsed}ms: ${step.type} - ${JSON.stringify(step.data)}`;
}).join('\n');
}
}
const visualizer = new ExecutionVisualizer();
const result = await agent.invoke(input, {
callbacks: [{
handleLLMStart: () => visualizer.recordStep('LLM Start', {}),
handleLLMEnd: (output) => visualizer.recordStep('LLM End', { tokens: output.llmOutput?.tokenUsage }),
handleToolStart: (tool, input) => visualizer.recordStep('Tool Start', { tool: tool.name, input }),
handleToolEnd: (output) => visualizer.recordStep('Tool End', { output })
}]
});
console.log('Execution Timeline:');
console.log(visualizer.generateTimeline());
console.log('\nExecution Diagram:');
console.log(visualizer.generateMermaidDiagram());Performance Profiler
Profile agent performance:
class PerformanceProfiler {
private profiles: Map<string, { count: number; totalTime: number; samples: number[] }> = new Map();
async profile<T>(name: string, fn: () => Promise<T>): Promise<T> {
const startTime = Date.now();
try {
return await fn();
} finally {
const duration = Date.now() - startTime;
if (!this.profiles.has(name)) {
this.profiles.set(name, { count: 0, totalTime: 0, samples: [] });
}
const profile = this.profiles.get(name)!;
profile.count++;
profile.totalTime += duration;
profile.samples.push(duration);
}
}
getReport() {
const report: any = {};
for (const [name, profile] of this.profiles.entries()) {
const sorted = [...profile.samples].sort((a, b) => a - b);
report[name] = {
calls: profile.count,
totalTime: profile.totalTime,
avgTime: profile.totalTime / profile.count,
minTime: sorted[0],
maxTime: sorted[sorted.length - 1],
p50: sorted[Math.floor(sorted.length * 0.5)],
p95: sorted[Math.floor(sorted.length * 0.95)]
};
}
return report;
}
}
const profiler = new PerformanceProfiler();
// Profile different operations
await profiler.profile('agent.invoke', async () => {
return await agent.invoke(input);
});
await profiler.profile('tool.webScraper', async () => {
return await webScraper.invoke({ url: 'https://example.com' });
});
console.log('Performance Report:');
console.log(JSON.stringify(profiler.getReport(), null, 2));Best Practices
1. Monitor Key Metrics
Always track these essential metrics:
- Request rate and throughput
- Latency (p50, p95, p99)
- Error rate and types
- Token usage and costs
- Resource utilization
2. Set Up Alerts
Configure alerts for critical conditions:
- Error rate > 5%
- P95 latency > 10s
- Token usage > budget
- Memory usage > 80%
3. Use Structured Logging
Always use structured, searchable logs:
logger.info('event', { key: 'value', timestamp: Date.now() });4. Implement Distributed Tracing
Track requests across services for complex systems.
5. Create Dashboards
Build real-time dashboards for visibility.
Next Steps
- Deployment - Production deployment
- Resource Management - Optimize resources
- Streaming - Real-time monitoring
- Core API Reference - Core monitoring utilities
Further Reading
- OpenTelemetry - Observability framework
- Prometheus - Metrics and alerting
- Grafana - Visualization and dashboards
- LangSmith - LLM tracing