Troubleshooting

This guide covers the most common problems with a running mithai agent: skill loading failures, unexpected agent behavior, approval flow problems, configuration issues, and memory failures. Each section shows how to diagnose the problem and fix it.

Checking logs

Run with --verbose to see debug-level output including skill loading, config parsing, LLM calls, tool routing, and approval decisions:

mithai run --verbose

or for the chat interface:

mithai chat --verbose

Key patterns to look for:

Log message	Means
`Loaded skill: services (3 tools)`	Skill loaded successfully
`Skill services: missing TOOLS export`	`tools.py` does not define `TOOLS`
`Skill services: missing handle() function`	`tools.py` does not define `handle`
`Skipping services: missing prompt.md or tools.py`	Directory is incomplete
`Human MCP: requesting approve for services__restart_service`	Approval gate triggered
`Executing tool: services__restart_service`	Tool approved and running
`Tool denied by human: services__restart_service`	User clicked Deny
`LLM response: stop_reason=tool_use`	LLM called a tool
`LLM response: stop_reason=end_turn`	LLM finished responding

For production deployments running as systemd:

sudo journalctl -u mithai -f

Validating your setup

mithai skill validate

Checks all skills in ./skills/ for structural correctness — presence of prompt.md and tools.py, valid TOOLS schema, required handle function, and valid "human" levels:

mithai skill validate

Validate a single skill:

mithai skill validate services

A passing skill reports:

✓ services: OK (3 tools)

A failing skill reports the exact error:

✗ services: FAILED
    - Tool 2: missing 'description'
    - tools.py does not export handle() function

mithai doctor

Runs a full connectivity check — LLM API, Slack/Telegram tokens, kubectl clusters if configured, skill loading, and filesystem permissions:

mithai doctor

Exit code is 0 when all checks pass, 1 if any fail. Use this in a deployment pipeline after config changes.

Skill issues

Skill not loading

Symptom: The agent does not use a skill you added. mithai skill list does not show it. With --verbose, you see Skipping services: missing prompt.md or tools.py.

Diagnosis: Check that your skill directory has both required files:

ls skills/services/
# Expected: prompt.md  tools.py

Check that the directory name does not start with . or _ — those are silently skipped by the loader.

Check that the skill directory is under a path listed in skills.paths in config.yaml:

skills:
  paths:
    - ./skills       # relative to where mithai is run from

If the path is correct but the skill still does not appear, run mithai skill validate services to catch import errors that would otherwise be swallowed.

Fix: Ensure prompt.md and tools.py both exist, are non-empty, and the path in config.yaml points at the parent directory containing the skill folder.

Tool not appearing in the agent’s tool list

Symptom: The agent ignores a specific tool or says it cannot perform an action that the tool handles. --verbose shows the skill loaded but the tool is never called.

Diagnosis: The tool is not making it into the LLM’s context for one of these reasons:

The tool definition is missing from TOOLS. Verify:

# In tools.py — is your tool here?
TOOLS = [
    {"name": "restart_service", ...},
]

The TOOLS list has a syntax error that truncates it at parse time. Run mithai skill validate services — it imports tools.py and checks every entry.
The tool description is ambiguous and the LLM is not selecting it. Improve the "description" field to be specific about when to use the tool.

Fix: Run mithai skill validate and correct any reported errors. If the tool definition is correct but the LLM is not calling it, rewrite the "description" to be more explicit, and update prompt.md to tell the agent under what conditions to use this tool.

handle() returning wrong type

Symptom: The agent’s response includes raw Python objects, None, or an error like TypeError: Object of type X is not JSON serializable.

Diagnosis: handle() must return a str. The engine passes this string directly to the LLM as the tool result. Any non-string return will cause a serialization error.

Check your handle signature:

def handle(name: str, input: dict, ctx: dict) -> str:
    # Must return str in every branch
    if name == "list_services":
        services = ctx.get("config", {}).get("services", {})
        return json.dumps({"services": services})   # correct: str

    # Wrong — returning a dict
    # return {"services": services}

    # Wrong — forgetting the fallthrough
    # (no return means None is returned)
    return json.dumps({"error": f"unknown tool: {name}"})

Every code path in handle() must return a str. The last line handling unknown tool names is a required safety net.

Fix: Add -> str as the return type annotation and return json.dumps(...) in every branch, including the fallthrough case.

resolve_human not being called

Symptom: The agent runs a tool that should require approval without asking. Or the tool always asks for approval when you expected it to auto-execute based on resolve_human logic.

Diagnosis: resolve_human is only called when the tool’s "human" field is set to "dynamic". Check the tool definition:

TOOLS = [
    {
        "name": "restart_service",
        "description": "...",
        "input_schema": {...},
        "human": "dynamic",   # required for resolve_human to be called
    },
]

If "human" is set to "approve" or omitted, resolve_human is never called — the static level is used directly.

Also verify that resolve_human is exported at the module level (not nested inside a class or function):

# Correct — top-level function
def resolve_human(name: str, input: dict, ctx: dict) -> str | None:
    ...

Run mithai skill validate services — it checks that handle is callable but does not currently validate resolve_human. Add a quick Python import check if in doubt:

python3 -c "from skills.services.tools import resolve_human; print('ok')"

Fix: Set "human": "dynamic" on the tool definition and ensure resolve_human is a top-level function in tools.py.

Agent behavior

Agent not responding in Slack

Symptom: You send a message in Slack and nothing happens. No response, no error visible in Slack.

Diagnosis — check in this order:

Is the bot running?

# systemd
sudo systemctl status mithai

# Docker
docker compose ps

Is the bot invited to the channel?

The bot must be a member of the channel before it receives messages. In Slack, type /invite @your-bot-name in the channel.

Is the bot token valid?

Run mithai doctor — it calls auth.test against the Slack API and reports the workspace name on success or the exact error on failure.

Is the app token configured for Socket Mode?

The Socket Mode adapter (adapter.types: [slack]) requires both a bot token (xoxb-...) and an app-level token (xapp-...). The app-level token must have the connections:write scope. Check your app at https://api.slack.com/apps under Basic Information → App-Level Tokens.

Check the process logs:

sudo journalctl -u mithai -n 50

Look for invalid_auth, missing_scope, or connection error messages.

Agent responding to everything instead of only mentions

Symptom: The bot replies to every message in a channel, not just messages that @mention it.

Diagnosis: Check respond in config.yaml:

adapter:
  slack:
    respond: mentions    # correct: only reply to @mentions
    # respond: all       # this causes the bot to reply to everything

The default when respond is omitted is "all". Set it explicitly to "mentions" for normal Slack deployments.

Fix:

adapter:
  slack:
    bot_token: ${SLACK_BOT_TOKEN}
    app_token:  ${SLACK_APP_TOKEN}
    respond: mentions

Restart the agent after changing this.

Agent forgetting context between messages

Symptom: The agent does not remember things said earlier in the same thread. Each message is treated as a new conversation.

Diagnosis: The agent scopes sessions to thread_ts in Slack. If you are messaging in the channel top-level (not in a thread), each top-level message starts a new session scoped to channel_id. The agent accumulates history within a thread, not across the channel.

Check session storage:

ls .mithai/state/sessions/

If no session files exist, the state backend is not writing to disk. Verify the path is correct and writable:

state:
  backend: filesystem
  filesystem:
    path: ./.mithai/state

Check the directory permissions:

ls -la .mithai/state/

Also confirm sessions.max_history is not set to 0:

sessions:
  max_history: 10    # number of turns to replay into context
  max_stored: 50     # number of turns to keep on disk

Fix: Start a thread by replying to the bot’s first message — sessions accumulate within a thread. For cross-thread memory, the agent uses MEMORY.md via the memory skill.

Agent stuck in a tool loop

Symptom: The agent calls tools repeatedly without producing a final response. It looks like it is looping — calling the same tool or a sequence of tools over and over.

Diagnosis: This happens when:

A tool is returning an error and the agent retries it. Check what the tool returns — if it returns {"error": "..."}, the LLM may attempt the same call with slightly different inputs hoping to succeed.
A tool is returning data the LLM does not understand and it keeps trying to reinterpret it. Check the JSON shape your tool returns and whether it matches what prompt.md describes.
The LLM is calling a tool to complete a previous tool call’s intent (chaining too deep). Check --verbose logs to see the full sequence of tool calls.

Fix options:

Fix the underlying tool error so it returns useful results.
Make handle() return clearer error messages that tell the LLM when to stop retrying.
Add guidance to prompt.md telling the agent to stop after N failed attempts.
Use the human.overrides config to force a specific tool to require approval, giving you a chance to interrupt the loop manually:

human:
  overrides:
    services__restart_service: approve

Approval issues

Approval request never appears

Symptom: You expect an approval prompt but the tool runs automatically.

Diagnosis — check in order:

Is the "human" field set on the tool?

TOOLS = [
    {
        "name": "restart_service",
        "description": "...",
        "input_schema": {...},
        "human": "approve",   # this must be present
    },
]

Run mithai skill validate services — it reports missing or invalid "human" values.

Is there a config override removing the approval requirement?

Check config.yaml:

human:
  overrides:
    services__restart_service: null   # null removes the requirement

If using "human": "dynamic", is resolve_human returning None?

Add a temporary print or log statement to resolve_human to confirm it is being called and what it returns. The approval is skipped if resolve_human returns None.

Has auto-promote kicked in?

After learning.approval_auto_promote consecutive approvals (default: 3) with no denials for the same tool+input combination, mithai auto-promotes that specific input to auto-execute. This is stored in memory/approvals.json. Check:

cat memory/approvals.json

Fix: If auto-promote removed an approval you want to keep, delete the relevant entry from memory/approvals.json, or set learning.approval_auto_promote: 0 to disable auto-promotion entirely.

Approval times out before user responds

Symptom: The agent reports the tool was denied even though no one clicked Deny. Logs show a timeout.

Diagnosis: The default approval timeout is 300 seconds (5 minutes). Check your config:

adapter:
  slack:
    approval_timeout: 300   # seconds

If your team needs more time, increase this. Note that the agent thread blocks while waiting for approval — a very long timeout (e.g. 3600) means the agent holds an open connection to Slack for that duration.

Fix: Increase approval_timeout:

adapter:
  slack:
    approval_timeout: 600   # 10 minutes

Approved command runs again with approval next time

Symptom: You approved a command, it ran successfully, but the next time the same command runs it still asks for approval. Auto-promote is not working.

Diagnosis: Auto-promote requires the same tool and the same normalized input to be approved approval_auto_promote times (default: 3) with zero denials. Check:

Is learning.enabled: true in config.yaml?
Is memory configured and the memory directory writable?
Is the approval count in memory/approvals.json actually incrementing?

cat memory/approvals.json

You should see something like:

{
  "services__restart_service": {
    "{\"environment\":\"staging\",\"service\":\"auth\"}": {
      "approved": 2,
      "denied": 0
    }
  }
}

The input key is the JSON-serialized input dict with sorted keys. If the inputs differ between calls (e.g. different whitespace or key order), each is tracked as a separate entry.

Is approval_auto_promote set to a value other than 3?

learning:
  approval_auto_promote: 3   # promote after this many consecutive approvals

Fix: Verify the memory backend is writing (cat memory/approvals.json shows real data), check the approved count is incrementing, and confirm the threshold matches your config.

Configuration

Environment variable not substituted (shows ${VAR} literally)

Symptom: The agent fails to connect and logs show invalid_auth or a similar error. Inspecting the behavior reveals the token is literally the string ${SLACK_BOT_TOKEN}.

Diagnosis: The environment variable is not set in the environment when mithai run starts. Check:

echo $SLACK_BOT_TOKEN

If it prints nothing, the variable is not exported. Also check whether a .env file is present in the same directory as config.yaml:

ls -la .env

mithai calls python-dotenv’s load_dotenv() on the .env file in the config directory at startup. If the file does not exist or is not in the right directory, variables are not loaded.

Fix: Create .env in the same directory as config.yaml:

SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
ANTHROPIC_API_KEY=sk-ant-...

Then restart mithai.

config.yaml parse error

Symptom: mithai run exits immediately with a YAML parse error or Config validation failed.

Diagnosis: YAML is whitespace-sensitive. Common mistakes:

Tabs instead of spaces (YAML requires spaces)
Inconsistent indentation
Unquoted strings containing : or #
A multiline system_prompt missing the | block scalar indicator

Test the file standalone:

python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"

If it raises a ScannerError, it reports the line number. If it parses fine, the error is a schema validation failure — run mithai doctor for a clearer error message.

Fix: Correct the YAML at the reported line. For multiline strings, use the block scalar:

bot:
  system_prompt: |
    You are a helpful operations assistant.
    You have access to skills that let you interact with infrastructure.

Skill config not reaching ctx[“config”]

Symptom: ctx.get("config", {}) returns {} in your skill, but you have config defined in config.yaml.

Diagnosis: The config key in ctx is populated from skills.config.<skill_name>. The skill name must match the directory name exactly.

Check config.yaml:

skills:
  config:
    services:          # this key must match the skill directory name
      services:
        checkout:
          url: https://...

Check the skill directory name:

ls skills/

If the skill is named service (singular) but the config key is services (plural), ctx["config"] will be {}.

Fix: Make the key under skills.config match the skill directory name exactly.

Memory

memory_write failing with path validation error

Symptom: The memory skill’s memory_write tool returns an error like Invalid path: ../etc/passwd or ValueError: Invalid path.

Diagnosis: The memory backend enforces a path traversal guard. Any path that resolves outside the memory root directory is rejected. This includes paths starting with ../, absolute paths starting with /, and any path that after normalization escapes the root.

Valid paths:

MEMORY.md
daily/2024-01-15.md
notes/services.md

Invalid paths:

../config.yaml
/etc/passwd
../../secrets

Fix: Use relative paths that stay within the memory root. The memory root is set by learning.memory.filesystem.path in config.yaml (default: ./memory).

Memory not loading into system prompt

Symptom: The agent does not remember facts you told it to remember. MEMORY.md exists on disk but the agent behaves as if it has never seen it.

Diagnosis: Check that learning.enabled: true and the memory backend is correctly configured:

learning:
  enabled: true
  memory:
    backend: filesystem
    filesystem:
      path: ./memory

Verify the file exists and is non-empty:

cat memory/MEMORY.md

With --verbose, look for the memory injection in the startup logs. The engine calls self._memory.read("MEMORY.md") on every request. If it returns None or empty string, nothing is injected.

Check the memory path is an absolute path or is correctly relative to the working directory from which mithai run is executed. If you run mithai run from /home/user/mybot/ but the config says path: ./memory, the backend expects /home/user/mybot/memory/.

Fix: Run mithai run from the project root (where config.yaml lives), or use an absolute path:

learning:
  memory:
    backend: filesystem
    filesystem:
      path: /var/lib/mithai/memory