Troubleshooting

Even well-designed agent systems can encounter issues. This comprehensive troubleshooting guide will help you diagnose and solve common problems, debug complex workflows, and maintain healthy agent systems.

Learning Objectives

By the end of this section, you'll be able to:

Diagnose common AgenticGoKit issues systematically
Use debugging tools and techniques effectively
Troubleshoot configuration, connectivity, and performance problems
Debug multi-agent workflows and orchestration issues
Resolve memory and tool integration problems
Implement monitoring and alerting for proactive issue detection
Apply best practices for maintaining healthy agent systems

Prerequisites

Before starting, make sure you've completed:

✅ Building Workflows - Understanding of complex agent systems

Systematic Troubleshooting Approach

When facing issues with AgenticGoKit systems, follow this systematic approach:

1. Identify the Problem

What is happening vs. what should happen?
When does the problem occur?
Where in the system does it manifest?
How consistently does it reproduce?

2. Gather Information

Check logs and error messages
Review configuration files
Test individual components
Monitor system resources

3. Form Hypotheses

Based on symptoms, what could be causing the issue?
What are the most likely root causes?
How can you test each hypothesis?

4. Test and Validate

Test hypotheses systematically
Make one change at a time
Document what works and what doesn't

5. Implement and Monitor

Apply the solution
Monitor to ensure the fix works
Document the solution for future reference

Common Issues and Solutions

Installation and Setup Issues

"agentcli: command not found"

Symptoms:

bash

$ agentcli version
bash: agentcli: command not found

Diagnosis:

bash

# Check if Go is installed
go version

# Check GOPATH and GOBIN
go env GOPATH
go env GOBIN

# Check if GOPATH/bin is in PATH
echo $PATH

Solutions:

bash

# Reinstall agentcli
go install github.com/kunalkushwaha/agenticgokit/cmd/agentcli@latest

# Add GOPATH/bin to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH=$PATH:$(go env GOPATH)/bin

# Or use full path temporarily
$(go env GOPATH)/bin/agentcli version

"provider not registered" Error

Symptoms:

Error: LLM provider "openai" not registered

Diagnosis: Check if plugin imports are missing in your main.go:

bash

grep -n "import" main.go

Solution: Add missing plugin imports to your main.go:

import (
    _ "github.com/kunalkushwaha/agenticgokit/plugins/llm/openai"
    _ "github.com/kunalkushwaha/agenticgokit/plugins/llm/ollama"
    _ "github.com/kunalkushwaha/agenticgokit/plugins/llm/azure"
    _ "github.com/kunalkushwaha/agenticgokit/plugins/orchestrator/default"
    _ "github.com/kunalkushwaha/agenticgokit/plugins/runner/default"
)

Configuration Issues

Configuration Validation Failures

Symptoms:

bash

$ agentcli validate
Error: Invalid configuration at line 15: missing required field 'system_prompt'

Diagnosis Tools:

bash

# Validate TOML syntax
agentcli validate --verbose

# Check specific sections
agentcli config check --section agents

# Show configuration with resolved environment variables
agentcli config show --resolved

Common Configuration Fixes:

Missing Required Fields:

toml

# Bad: Missing system_prompt
[agents.assistant]
role = "helper"

# Good: All required fields present
[agents.assistant]
role = "helper"
description = "A helpful assistant"
system_prompt = "You are a helpful assistant."
enabled = true

Invalid TOML Syntax:

toml

# Bad: Missing quotes
[agents.assistant]
system_prompt = You are a helpful assistant.

# Good: Proper quoting
[agents.assistant]
system_prompt = "You are a helpful assistant."

Environment Variable Issues:

bash

# Check if environment variables are set
env | grep -E "(OPENAI|AZURE|OLLAMA)"

# Test with explicit values
export OPENAI_API_KEY="your-actual-key"
agentcli validate

LLM Provider Issues

OpenAI Connection Problems

Symptoms:

"Invalid API key" errors
"Rate limit exceeded" errors
"Model not found" errors

Diagnosis:

bash

# Test API key directly
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
     https://api.openai.com/v1/models

# Check rate limits and usage
agentcli llm status --provider openai

# Verify model availability
agentcli llm models --provider openai

Solutions:

bash

# Verify API key is correct
echo $OPENAI_API_KEY

# Check account status and billing
# Visit https://platform.openai.com/account/usage

# Use different model if current one is unavailable
# Edit agentflow.toml:
[llm]
model = "gpt-3.5-turbo"  # Instead of "gpt-4"

Azure OpenAI Issues

Symptoms:

"Resource not found" errors
"Deployment not found" errors
Authentication failures

Diagnosis:

bash

# Check all required environment variables
echo "Endpoint: $AZURE_OPENAI_ENDPOINT"
echo "Key: $AZURE_OPENAI_API_KEY"
echo "Deployment: $AZURE_OPENAI_DEPLOYMENT"

# Test Azure connection
curl -H "api-key: $AZURE_OPENAI_API_KEY" \
     "$AZURE_OPENAI_ENDPOINT/openai/deployments?api-version=2024-02-15-preview"

Solutions:

bash

# Verify endpoint format (must include https://)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"

# Check deployment name matches Azure portal
export AZURE_OPENAI_DEPLOYMENT="your-exact-deployment-name"

# Verify API version compatibility
export AZURE_OPENAI_API_VERSION="2024-02-15-preview"

Ollama Connection Issues

Symptoms:

"Connection refused" errors
"Model not found" errors
Slow response times

Diagnosis:

bash

# Check if Ollama is running
curl http://localhost:11434/api/version

# List available models
ollama list

# Check Ollama logs
ollama logs

# Test model directly
ollama run llama3.1:8b "Hello, world!"

Solutions:

bash

# Start Ollama service
ollama serve

# Pull required model
ollama pull llama3.1:8b

# Check available system resources
free -h  # Memory
df -h    # Disk space

# Use smaller model if resources are limited
ollama pull gemma2:2b

Multi-Agent Orchestration Issues

Agents Not Collaborating

Symptoms:

Only one agent responds in collaborative mode
Agents seem to ignore each other's work
Inconsistent results from multi-agent workflows

Diagnosis:

bash

# Enable debug logging
export AGENTICGOKIT_LOG_LEVEL=debug
go run . -m "Test message"

# Check agent registration
agentcli agents list

# Verify orchestration configuration
agentcli config show --section orchestration

Solutions:

Fix Collaborative Agent List:

toml

[orchestration]
mode = "collaborative"
# Ensure all agents are listed
collaborative_agents = ["researcher", "analyzer", "writer"]

# Verify agent names match exactly
[agents.researcher]  # Must match name in collaborative_agents
role = "researcher"
# ...

Check Agent System Prompts:

toml

[agents.researcher]
system_prompt = """
You are part of a research team. Your role is to gather information.
Work with the analyzer and writer to create comprehensive reports.
Share your findings clearly so other agents can build on your work.
"""

Sequential Pipeline Breaks

Symptoms:

Pipeline stops at a specific agent
Agents can't process previous agent's output
Data loss between pipeline stages

Diagnosis:

bash

# Test each agent individually
agentcli test-agent researcher "Test input"
agentcli test-agent analyzer "Test input"

# Check pipeline configuration
agentcli config show --section orchestration

# Monitor pipeline execution
export AGENTICGOKIT_PIPELINE_DEBUG=true
go run . -m "Test message"

Solutions:

Fix Sequential Agent Order:

toml

[orchestration]
mode = "sequential"
# Ensure logical order
sequential_agents = ["collector", "validator", "processor", "analyzer", "reporter"]

Improve Agent Handoffs:

toml

[agents.validator]
system_prompt = """
Process the data from the collector and prepare it for the processor.
Always output your results in a structured format that the processor can understand:

VALIDATED_DATA:
- Field1: value
- Field2: value
- Status: VALID/INVALID
- Notes: any important observations
"""

Memory System Issues

Memory Not Persisting

Symptoms:

Agents don't remember previous conversations
Knowledge base searches return no results
Memory-related errors in logs

Diagnosis:

bash

# Check memory provider status
agentcli memory status

# Test database connection
agentcli memory test-connection

# Check memory configuration
agentcli config show --section agent_memory

# List stored memories
agentcli memory list --limit 10

Solutions:

Database Connection Issues:

bash

# Test PostgreSQL connection
psql "postgresql://agent:agentpass@localhost:5432/agentdb" -c "SELECT 1;"

# Check if pgvector extension is installed
psql "postgresql://agent:agentpass@localhost:5432/agentdb" -c "SELECT * FROM pg_extension WHERE extname = 'vector';"

# Install pgvector if missing
psql "postgresql://agent:agentpass@localhost:5432/agentdb" -c "CREATE EXTENSION IF NOT EXISTS vector;"

Memory Configuration Fixes:

toml

[agent_memory]
provider = "pgvector"
enable_rag = true
enable_knowledge_base = true

[agent_memory.pgvector]
# Ensure connection string is correct
connection_string = "postgresql://agent:agentpass@localhost:5432/agentdb"

[agents.memory_agent]
# Ensure memory is enabled for agents
memory_enabled = true

RAG Not Finding Relevant Information

Symptoms:

Knowledge base searches return empty results
Agents claim no relevant information exists
Poor quality search results

Diagnosis:

bash

# Check if documents are uploaded
agentcli knowledge list

# Test search directly
agentcli knowledge search "test query"

# Check chunk configuration
agentcli config show --section agent_memory

# Verify vector embeddings
agentcli memory debug --type embeddings

Solutions:

Upload Documents:

bash

# Upload documents to knowledge base
agentcli knowledge upload ./docs/

# Verify upload
agentcli knowledge list

# Test search
agentcli knowledge search "your search term"

Optimize RAG Configuration:

toml

[agent_memory]
chunk_size = 800           # Smaller chunks for better precision
overlap_size = 200         # Overlap for context continuity
max_results = 8            # More results for comprehensive answers
similarity_threshold = 0.6  # Lower threshold for more results

Tool Integration Issues

MCP Tools Not Available

Symptoms:

"Tool not found" errors
Agents claim they can't perform actions
MCP server connection failures

Diagnosis:

bash

# Check MCP server status
agentcli mcp health

# List available tools
agentcli mcp tools

# Test specific server
agentcli mcp test web-search

# Check server logs
agentcli mcp logs filesystem

Solutions:

Install Missing Dependencies:

bash

# Install uv/uvx for MCP servers
curl -LsSf https://astral.sh/uv/install.sh | sh

# Test uvx installation
uvx --version

# Install specific MCP server
uvx mcp-server-web-search --help

Fix MCP Configuration:

toml

[[mcp.servers]]
name = "web-search"
command = "uvx"
args = ["mcp-server-web-search"]
# Add required environment variables
env = { "SEARCH_ENGINE" = "duckduckgo" }

Check Tool Permissions:

toml

[[mcp.servers]]
name = "filesystem"
command = "uvx"
args = ["mcp-server-filesystem"]
env = { 
    "ALLOWED_DIRECTORIES" = "./workspace,./data",
    "MAX_FILE_SIZE" = "10MB"
}

Tool Permission Errors

Symptoms:

"Permission denied" errors
"Access restricted" messages
Tools fail silently

Diagnosis:

bash

# Check file permissions
ls -la ./workspace/

# Verify environment variables
env | grep -E "(ALLOWED|DENIED)"

# Test tool access directly
agentcli mcp test filesystem --operation read --path ./workspace/test.txt

Solutions:

bash

# Fix directory permissions
chmod 755 ./workspace/
chmod 644 ./workspace/*

# Update MCP server configuration
# Edit agentflow.toml:
[[mcp.servers]]
name = "filesystem"
env = { 
    "ALLOWED_DIRECTORIES" = "./workspace,./reports",
    "ALLOWED_EXTENSIONS" = ".txt,.md,.json,.csv"
}

Performance Issues

Slow Agent Responses

Symptoms:

Long wait times for agent responses
Timeouts in multi-agent workflows
High resource usage

Diagnosis:

bash

# Monitor system resources
top
htop
free -h

# Check agent performance
agentcli monitor --agent researcher --duration 60s

# Profile memory usage
agentcli profile --type memory

# Check network latency (for cloud LLMs)
ping api.openai.com

Solutions:

Optimize LLM Settings:

toml

[llm]
model = "gpt-3.5-turbo"  # More efficient than gpt-4
max_tokens = 1000        # Reduce for quicker responses
temperature = 0.7        # Doesn't affect speed significantly

Optimize Agent Configuration:

toml

[orchestration]
timeout_seconds = 120    # Reasonable timeout
max_concurrent_agents = 3 # Limit concurrent agents

[agent_memory]
max_results = 3          # Fewer RAG results for speed
chunk_size = 500         # Smaller chunks process more efficiently

Use Local Models for Development:

toml

[llm]
provider = "ollama"
model = "gemma2:2b"      # Efficient, lightweight model
host = "http://localhost:11434"

Memory Usage Issues

Symptoms:

Out of memory errors
System becomes unresponsive
Gradual memory leaks

Diagnosis:

bash

# Monitor memory usage over time
watch -n 5 'free -h'

# Check Go memory stats
agentcli debug --type memory

# Profile memory allocation
go tool pprof http://localhost:6060/debug/pprof/heap

Solutions:

toml

# Limit memory usage
[agent_memory]
conversation_memory_limit = 20  # Limit conversation history
enable_compression = true       # Compress old memories
cleanup_interval = "1h"         # Regular cleanup

[orchestration]
max_concurrent_agents = 2       # Reduce concurrent agents

Debugging Tools and Techniques

Enable Debug Logging

bash

# Enable comprehensive debugging
export AGENTICGOKIT_LOG_LEVEL=debug
export AGENTICGOKIT_MEMORY_DEBUG=true
export AGENTICGOKIT_MCP_DEBUG=true
export AGENTICGOKIT_TOOL_DEBUG=true

# Run with debugging enabled
go run . -m "Debug test message"

Use CLI Debugging Commands

bash

# Test individual components
agentcli test-agent researcher "Test query"
agentcli test-memory "Test memory operation"
agentcli test-tool web-search "Test search"

# Health checks
agentcli health-check
agentcli mcp health
agentcli memory health

# Configuration validation
agentcli validate --verbose
agentcli config check --all

# Performance monitoring
agentcli monitor --duration 300s
agentcli metrics --timeframe 1h

Log Analysis

bash

# Filter logs by component
grep "ERROR" agenticgokit.log
grep "memory" agenticgokit.log
grep "mcp" agenticgokit.log

# Analyze performance
grep "duration" agenticgokit.log | sort -n

# Find error patterns
grep -E "(failed|error|timeout)" agenticgokit.log | sort | uniq -c

Monitoring and Alerting

Basic Monitoring Setup

toml

[monitoring]
enabled = true
log_level = "info"
metrics_collection = true
health_check_interval = "30s"

[monitoring.alerts]
max_execution_time = 300
memory_threshold = "1GB"
error_rate_threshold = 0.1

Health Check Endpoints

bash

# Check overall system health
curl http://localhost:8080/health

# Check specific components
curl http://localhost:8080/health/memory
curl http://localhost:8080/health/mcp
curl http://localhost:8080/health/agents

Automated Monitoring

bash

#!/bin/bash
# health-check.sh - Simple monitoring script

check_health() {
    if ! agentcli health-check > /dev/null 2>&1; then
        echo "ALERT: AgenticGoKit health check failed"
        # Send notification (email, Slack, etc.)
    fi
}

# Run periodically
while true; do
    check_health
    sleep 300
done

Best Practices for Troubleshooting

1. Implement Comprehensive Logging

toml

[agents.well_logged_agent]
system_prompt = """
Always log your decision-making process:
- What information you received
- How you interpreted it
- What actions you decided to take
- Why you made those decisions
- Any issues or limitations encountered

This helps with debugging and improvement.
"""

2. Use Structured Error Handling

// Example error handling pattern
func (a *Agent) Run(ctx context.Context, state core.State) (core.State, error) {
    log.Info("Agent starting", "agent", a.Name(), "input_keys", state.Keys())
    
    result, err := a.processInput(ctx, state)
    if err != nil {
        log.Error("Agent processing failed", 
            "agent", a.Name(), 
            "error", err,
            "input_size", len(state.Keys()))
        return state, fmt.Errorf("agent %s failed: %w", a.Name(), err)
    }
    
    log.Info("Agent completed", "agent", a.Name(), "output_keys", result.Keys())
    return result, nil
}

3. Create Reproducible Test Cases

bash

# Create test cases for common scenarios
mkdir -p tests/scenarios/

# Test case for memory issues
cat > tests/scenarios/memory-test.sh << 'EOF'
#!/bin/bash
echo "Testing memory persistence..."
go run . -m "Remember that my name is Alice"
go run . -m "What is my name?"
EOF

# Test case for tool integration
cat > tests/scenarios/tool-test.sh << 'EOF'
#!/bin/bash
echo "Testing web search tool..."
go run . -m "Search for the latest news about AI"
EOF

4. Document Known Issues

markdown

# Known Issues and Workarounds

## Issue: Memory not persisting after restart
**Symptoms**: Agents forget previous conversations
**Cause**: Database connection not properly configured
**Workaround**: Verify connection string and restart database
**Fix**: Update connection string in agentflow.toml

## Issue: Slow responses with GPT-4
**Symptoms**: Long wait times for responses
**Cause**: GPT-4 is slower than GPT-3.5-turbo
**Workaround**: Use GPT-3.5-turbo for development
**Fix**: Optimize prompts and reduce max_tokens

What You've Learned

✅ Systematic troubleshooting approach for diagnosing issues
✅ Common problem patterns and their solutions
✅ Debugging tools and techniques for complex systems
✅ Performance optimization strategies
✅ Monitoring and alerting setup for proactive issue detection
✅ Best practices for maintainable, debuggable systems
✅ Documentation and knowledge sharing for team environments

Understanding Check

Before moving on, make sure you can:

[ ] Diagnose common AgenticGoKit issues systematically
[ ] Use CLI tools and logging for debugging
[ ] Troubleshoot configuration and connectivity problems
[ ] Debug multi-agent workflows and orchestration issues
[ ] Resolve memory and tool integration problems
[ ] Set up monitoring and alerting for production systems
[ ] Document and share troubleshooting knowledge

Next Steps

Congratulations! You've completed the comprehensive AgenticGoKit getting-started tutorial. You now have the knowledge and skills to build sophisticated, production-ready agent systems. Let's explore what comes next in your AgenticGoKit journey.

→ Continue to Next Steps

Quick Navigation

Previous: Building Workflows - Complex system integration
Next: Next Steps - Continue your AgenticGoKit journey
Jump to: Installation - If you're having setup issues

Advanced Troubleshooting Resources

For Production Systems:

Deployment Guide - Production deployment patterns
Monitoring Guide - System health and observability
Performance Optimization - Scaling and optimization

For Development:

Debugging Guide - Advanced debugging techniques
Testing Strategies - Comprehensive testing approaches
Development Best Practices - Code quality and patterns

Community Support:

GitHub Discussions - Get help from the community
Known Issues - Check for known problems
FAQ - Frequently asked questions

Troubleshooting Mastery

You now have the skills to diagnose and solve issues in complex agent systems. These troubleshooting techniques will serve you well as you build and maintain sophisticated AgenticGoKit applications in production environments.

Troubleshooting ​

Learning Objectives ​

Prerequisites ​

Systematic Troubleshooting Approach ​

1. Identify the Problem ​

2. Gather Information ​

3. Form Hypotheses ​

4. Test and Validate ​

5. Implement and Monitor ​

Common Issues and Solutions ​

Installation and Setup Issues ​

"agentcli: command not found" ​

"provider not registered" Error ​

Configuration Issues ​

Configuration Validation Failures ​

LLM Provider Issues ​

OpenAI Connection Problems ​

Azure OpenAI Issues ​

Ollama Connection Issues ​

Multi-Agent Orchestration Issues ​

Agents Not Collaborating ​

Sequential Pipeline Breaks ​

Memory System Issues ​

Memory Not Persisting ​

RAG Not Finding Relevant Information ​

Tool Integration Issues ​

MCP Tools Not Available ​

Tool Permission Errors ​

Performance Issues ​

Slow Agent Responses ​

Memory Usage Issues ​

Debugging Tools and Techniques ​

Enable Debug Logging ​

Use CLI Debugging Commands ​

Log Analysis ​

Monitoring and Alerting ​

Basic Monitoring Setup ​

Health Check Endpoints ​

Automated Monitoring ​

Best Practices for Troubleshooting ​

1. Implement Comprehensive Logging ​

2. Use Structured Error Handling ​

3. Create Reproducible Test Cases ​

4. Document Known Issues ​

What You've Learned ​

Understanding Check ​

Next Steps ​

Troubleshooting

Learning Objectives

Prerequisites

Systematic Troubleshooting Approach

1. Identify the Problem

2. Gather Information

3. Form Hypotheses

4. Test and Validate

5. Implement and Monitor

Common Issues and Solutions

Installation and Setup Issues

"agentcli: command not found"

"provider not registered" Error

Configuration Issues

Configuration Validation Failures

LLM Provider Issues

OpenAI Connection Problems

Azure OpenAI Issues

Ollama Connection Issues

Multi-Agent Orchestration Issues

Agents Not Collaborating

Sequential Pipeline Breaks

Memory System Issues

Memory Not Persisting

RAG Not Finding Relevant Information

Tool Integration Issues

MCP Tools Not Available

Tool Permission Errors

Performance Issues

Slow Agent Responses

Memory Usage Issues

Debugging Tools and Techniques

Enable Debug Logging

Use CLI Debugging Commands

Log Analysis

Monitoring and Alerting

Basic Monitoring Setup

Health Check Endpoints

Automated Monitoring

Best Practices for Troubleshooting

1. Implement Comprehensive Logging

2. Use Structured Error Handling

3. Create Reproducible Test Cases

4. Document Known Issues

What You've Learned

Understanding Check

Next Steps