Troubleshooting
This guide covers the most common problems with a running mithai agent: skill loading failures, unexpected agent behavior, approval flow problems, configuration issues, and memory failures. Each section shows how to diagnose the problem and fix it.
Checking logs
Section titled “Checking logs”Run with --verbose to see debug-level output including skill loading, config parsing, LLM calls, tool routing, and approval decisions:
mithai run --verboseor for the chat interface:
mithai chat --verboseKey patterns to look for:
| Log message | Means |
|---|---|
Loaded skill: services (3 tools) | Skill loaded successfully |
Skill services: missing TOOLS export | tools.py does not define TOOLS |
Skill services: missing handle() function | tools.py does not define handle |
Skipping services: missing prompt.md or tools.py | Directory is incomplete |
Human MCP: requesting approve for services__restart_service | Approval gate triggered |
Executing tool: services__restart_service | Tool approved and running |
Tool denied by human: services__restart_service | User clicked Deny |
LLM response: stop_reason=tool_use | LLM called a tool |
LLM response: stop_reason=end_turn | LLM finished responding |
For production deployments running as systemd:
sudo journalctl -u mithai -fValidating your setup
Section titled “Validating your setup”mithai skill validate
Section titled “mithai skill validate”Checks all skills in ./skills/ for structural correctness — presence of prompt.md and tools.py, valid TOOLS schema, required handle function, and valid "human" levels:
mithai skill validateValidate a single skill:
mithai skill validate servicesA passing skill reports:
✓ services: OK (3 tools)A failing skill reports the exact error:
✗ services: FAILED - Tool 2: missing 'description' - tools.py does not export handle() functionmithai doctor
Section titled “mithai doctor”Runs a full connectivity check — LLM API, Slack/Telegram tokens, kubectl clusters if configured, skill loading, and filesystem permissions:
mithai doctorExit code is 0 when all checks pass, 1 if any fail. Use this in a deployment pipeline after config changes.
Skill issues
Section titled “Skill issues”Skill not loading
Section titled “Skill not loading”Symptom: The agent does not use a skill you added. mithai skill list does not show it. With --verbose, you see Skipping services: missing prompt.md or tools.py.
Diagnosis: Check that your skill directory has both required files:
ls skills/services/# Expected: prompt.md tools.pyCheck that the directory name does not start with . or _ — those are silently skipped by the loader.
Check that the skill directory is under a path listed in skills.paths in config.yaml:
skills: paths: - ./skills # relative to where mithai is run fromIf the path is correct but the skill still does not appear, run mithai skill validate services to catch import errors that would otherwise be swallowed.
Fix: Ensure prompt.md and tools.py both exist, are non-empty, and the path in config.yaml points at the parent directory containing the skill folder.
Tool not appearing in the agent’s tool list
Section titled “Tool not appearing in the agent’s tool list”Symptom: The agent ignores a specific tool or says it cannot perform an action that the tool handles. --verbose shows the skill loaded but the tool is never called.
Diagnosis: The tool is not making it into the LLM’s context for one of these reasons:
- The tool definition is missing from
TOOLS. Verify:
# In tools.py — is your tool here?TOOLS = [ {"name": "restart_service", ...},]-
The
TOOLSlist has a syntax error that truncates it at parse time. Runmithai skill validate services— it importstools.pyand checks every entry. -
The tool description is ambiguous and the LLM is not selecting it. Improve the
"description"field to be specific about when to use the tool.
Fix: Run mithai skill validate and correct any reported errors. If the tool definition is correct but the LLM is not calling it, rewrite the "description" to be more explicit, and update prompt.md to tell the agent under what conditions to use this tool.
handle() returning wrong type
Section titled “handle() returning wrong type”Symptom: The agent’s response includes raw Python objects, None, or an error like TypeError: Object of type X is not JSON serializable.
Diagnosis: handle() must return a str. The engine passes this string directly to the LLM as the tool result. Any non-string return will cause a serialization error.
Check your handle signature:
def handle(name: str, input: dict, ctx: dict) -> str: # Must return str in every branch if name == "list_services": services = ctx.get("config", {}).get("services", {}) return json.dumps({"services": services}) # correct: str
# Wrong — returning a dict # return {"services": services}
# Wrong — forgetting the fallthrough # (no return means None is returned) return json.dumps({"error": f"unknown tool: {name}"})Every code path in handle() must return a str. The last line handling unknown tool names is a required safety net.
Fix: Add -> str as the return type annotation and return json.dumps(...) in every branch, including the fallthrough case.
resolve_human not being called
Section titled “resolve_human not being called”Symptom: The agent runs a tool that should require approval without asking. Or the tool always asks for approval when you expected it to auto-execute based on resolve_human logic.
Diagnosis: resolve_human is only called when the tool’s "human" field is set to "dynamic". Check the tool definition:
TOOLS = [ { "name": "restart_service", "description": "...", "input_schema": {...}, "human": "dynamic", # required for resolve_human to be called },]If "human" is set to "approve" or omitted, resolve_human is never called — the static level is used directly.
Also verify that resolve_human is exported at the module level (not nested inside a class or function):
# Correct — top-level functiondef resolve_human(name: str, input: dict, ctx: dict) -> str | None: ...Run mithai skill validate services — it checks that handle is callable but does not currently validate resolve_human. Add a quick Python import check if in doubt:
python3 -c "from skills.services.tools import resolve_human; print('ok')"Fix: Set "human": "dynamic" on the tool definition and ensure resolve_human is a top-level function in tools.py.
Agent behavior
Section titled “Agent behavior”Agent not responding in Slack
Section titled “Agent not responding in Slack”Symptom: You send a message in Slack and nothing happens. No response, no error visible in Slack.
Diagnosis — check in this order:
- Is the bot running?
# systemdsudo systemctl status mithai
# Dockerdocker compose ps- Is the bot invited to the channel?
The bot must be a member of the channel before it receives messages. In Slack, type /invite @your-bot-name in the channel.
- Is the bot token valid?
Run mithai doctor — it calls auth.test against the Slack API and reports the workspace name on success or the exact error on failure.
- Is the app token configured for Socket Mode?
The Socket Mode adapter (adapter.types: [slack]) requires both a bot token (xoxb-...) and an app-level token (xapp-...). The app-level token must have the connections:write scope. Check your app at https://api.slack.com/apps under Basic Information → App-Level Tokens.
- Check the process logs:
sudo journalctl -u mithai -n 50Look for invalid_auth, missing_scope, or connection error messages.
Agent responding to everything instead of only mentions
Section titled “Agent responding to everything instead of only mentions”Symptom: The bot replies to every message in a channel, not just messages that @mention it.
Diagnosis: Check respond in config.yaml:
adapter: slack: respond: mentions # correct: only reply to @mentions # respond: all # this causes the bot to reply to everythingThe default when respond is omitted is "all". Set it explicitly to "mentions" for normal Slack deployments.
Fix:
adapter: slack: bot_token: ${SLACK_BOT_TOKEN} app_token: ${SLACK_APP_TOKEN} respond: mentionsRestart the agent after changing this.
Agent forgetting context between messages
Section titled “Agent forgetting context between messages”Symptom: The agent does not remember things said earlier in the same thread. Each message is treated as a new conversation.
Diagnosis: The agent scopes sessions to thread_ts in Slack. If you are messaging in the channel top-level (not in a thread), each top-level message starts a new session scoped to channel_id. The agent accumulates history within a thread, not across the channel.
Check session storage:
ls .mithai/state/sessions/If no session files exist, the state backend is not writing to disk. Verify the path is correct and writable:
state: backend: filesystem filesystem: path: ./.mithai/stateCheck the directory permissions:
ls -la .mithai/state/Also confirm sessions.max_history is not set to 0:
sessions: max_history: 10 # number of turns to replay into context max_stored: 50 # number of turns to keep on diskFix: Start a thread by replying to the bot’s first message — sessions accumulate within a thread. For cross-thread memory, the agent uses MEMORY.md via the memory skill.
Agent stuck in a tool loop
Section titled “Agent stuck in a tool loop”Symptom: The agent calls tools repeatedly without producing a final response. It looks like it is looping — calling the same tool or a sequence of tools over and over.
Diagnosis: This happens when:
-
A tool is returning an error and the agent retries it. Check what the tool returns — if it returns
{"error": "..."}, the LLM may attempt the same call with slightly different inputs hoping to succeed. -
A tool is returning data the LLM does not understand and it keeps trying to reinterpret it. Check the JSON shape your tool returns and whether it matches what
prompt.mddescribes. -
The LLM is calling a tool to complete a previous tool call’s intent (chaining too deep). Check
--verboselogs to see the full sequence of tool calls.
Fix options:
- Fix the underlying tool error so it returns useful results.
- Make
handle()return clearer error messages that tell the LLM when to stop retrying. - Add guidance to
prompt.mdtelling the agent to stop after N failed attempts. - Use the
human.overridesconfig to force a specific tool to require approval, giving you a chance to interrupt the loop manually:
human: overrides: services__restart_service: approveApproval issues
Section titled “Approval issues”Approval request never appears
Section titled “Approval request never appears”Symptom: You expect an approval prompt but the tool runs automatically.
Diagnosis — check in order:
- Is the
"human"field set on the tool?
TOOLS = [ { "name": "restart_service", "description": "...", "input_schema": {...}, "human": "approve", # this must be present },]Run mithai skill validate services — it reports missing or invalid "human" values.
- Is there a config override removing the approval requirement?
Check config.yaml:
human: overrides: services__restart_service: null # null removes the requirement- If using
"human": "dynamic", isresolve_humanreturningNone?
Add a temporary print or log statement to resolve_human to confirm it is being called and what it returns. The approval is skipped if resolve_human returns None.
- Has auto-promote kicked in?
After learning.approval_auto_promote consecutive approvals (default: 3) with no denials for the same tool+input combination, mithai auto-promotes that specific input to auto-execute. This is stored in memory/approvals.json. Check:
cat memory/approvals.jsonFix: If auto-promote removed an approval you want to keep, delete the relevant entry from memory/approvals.json, or set learning.approval_auto_promote: 0 to disable auto-promotion entirely.
Approval times out before user responds
Section titled “Approval times out before user responds”Symptom: The agent reports the tool was denied even though no one clicked Deny. Logs show a timeout.
Diagnosis: The default approval timeout is 300 seconds (5 minutes). Check your config:
adapter: slack: approval_timeout: 300 # secondsIf your team needs more time, increase this. Note that the agent thread blocks while waiting for approval — a very long timeout (e.g. 3600) means the agent holds an open connection to Slack for that duration.
Fix: Increase approval_timeout:
adapter: slack: approval_timeout: 600 # 10 minutesApproved command runs again with approval next time
Section titled “Approved command runs again with approval next time”Symptom: You approved a command, it ran successfully, but the next time the same command runs it still asks for approval. Auto-promote is not working.
Diagnosis: Auto-promote requires the same tool and the same normalized input to be approved approval_auto_promote times (default: 3) with zero denials. Check:
- Is
learning.enabled: trueinconfig.yaml? - Is
memoryconfigured and the memory directory writable? - Is the approval count in
memory/approvals.jsonactually incrementing?
cat memory/approvals.jsonYou should see something like:
{ "services__restart_service": { "{\"environment\":\"staging\",\"service\":\"auth\"}": { "approved": 2, "denied": 0 } }}The input key is the JSON-serialized input dict with sorted keys. If the inputs differ between calls (e.g. different whitespace or key order), each is tracked as a separate entry.
- Is
approval_auto_promoteset to a value other than3?
learning: approval_auto_promote: 3 # promote after this many consecutive approvalsFix: Verify the memory backend is writing (cat memory/approvals.json shows real data), check the approved count is incrementing, and confirm the threshold matches your config.
Configuration
Section titled “Configuration”Environment variable not substituted (shows ${VAR} literally)
Section titled “Environment variable not substituted (shows ${VAR} literally)”Symptom: The agent fails to connect and logs show invalid_auth or a similar error. Inspecting the behavior reveals the token is literally the string ${SLACK_BOT_TOKEN}.
Diagnosis: The environment variable is not set in the environment when mithai run starts. Check:
echo $SLACK_BOT_TOKENIf it prints nothing, the variable is not exported. Also check whether a .env file is present in the same directory as config.yaml:
ls -la .envmithai calls python-dotenv’s load_dotenv() on the .env file in the config directory at startup. If the file does not exist or is not in the right directory, variables are not loaded.
Fix: Create .env in the same directory as config.yaml:
SLACK_BOT_TOKEN=xoxb-...SLACK_APP_TOKEN=xapp-...ANTHROPIC_API_KEY=sk-ant-...Then restart mithai.
config.yaml parse error
Section titled “config.yaml parse error”Symptom: mithai run exits immediately with a YAML parse error or Config validation failed.
Diagnosis: YAML is whitespace-sensitive. Common mistakes:
- Tabs instead of spaces (YAML requires spaces)
- Inconsistent indentation
- Unquoted strings containing
:or# - A multiline
system_promptmissing the|block scalar indicator
Test the file standalone:
python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"If it raises a ScannerError, it reports the line number. If it parses fine, the error is a schema validation failure — run mithai doctor for a clearer error message.
Fix: Correct the YAML at the reported line. For multiline strings, use the block scalar:
bot: system_prompt: | You are a helpful operations assistant. You have access to skills that let you interact with infrastructure.Skill config not reaching ctx[“config”]
Section titled “Skill config not reaching ctx[“config”]”Symptom: ctx.get("config", {}) returns {} in your skill, but you have config defined in config.yaml.
Diagnosis: The config key in ctx is populated from skills.config.<skill_name>. The skill name must match the directory name exactly.
Check config.yaml:
skills: config: services: # this key must match the skill directory name services: checkout: url: https://...Check the skill directory name:
ls skills/If the skill is named service (singular) but the config key is services (plural), ctx["config"] will be {}.
Fix: Make the key under skills.config match the skill directory name exactly.
Memory
Section titled “Memory”memory_write failing with path validation error
Section titled “memory_write failing with path validation error”Symptom: The memory skill’s memory_write tool returns an error like Invalid path: ../etc/passwd or ValueError: Invalid path.
Diagnosis: The memory backend enforces a path traversal guard. Any path that resolves outside the memory root directory is rejected. This includes paths starting with ../, absolute paths starting with /, and any path that after normalization escapes the root.
Valid paths:
MEMORY.mddaily/2024-01-15.mdnotes/services.md
Invalid paths:
../config.yaml/etc/passwd../../secrets
Fix: Use relative paths that stay within the memory root. The memory root is set by learning.memory.filesystem.path in config.yaml (default: ./memory).
Memory not loading into system prompt
Section titled “Memory not loading into system prompt”Symptom: The agent does not remember facts you told it to remember. MEMORY.md exists on disk but the agent behaves as if it has never seen it.
Diagnosis: Check that learning.enabled: true and the memory backend is correctly configured:
learning: enabled: true memory: backend: filesystem filesystem: path: ./memoryVerify the file exists and is non-empty:
cat memory/MEMORY.mdWith --verbose, look for the memory injection in the startup logs. The engine calls self._memory.read("MEMORY.md") on every request. If it returns None or empty string, nothing is injected.
Check the memory path is an absolute path or is correctly relative to the working directory from which mithai run is executed. If you run mithai run from /home/user/mybot/ but the config says path: ./memory, the backend expects /home/user/mybot/memory/.
Fix: Run mithai run from the project root (where config.yaml lives), or use an absolute path:
learning: memory: backend: filesystem filesystem: path: /var/lib/mithai/memory