An AI On-Call Engineer: Multi-Phase Triage with Grafana and GitHub MCPs

When a microservice goes down at 3am, the slow part isn't fixing it — it's figuring out which service is actually broken. We built an LLM agent that does the triage.

AgentsMCPOn-CallObservability

The 3am page is bad for a specific reason: most of the time you spend on it is not fixing the bug. It is figuring out which of fifteen microservices is actually broken.

You stare at Grafana. You scroll through Loki. You try to remember what got deployed yesterday. You check three Slack channels. By the time you have a hypothesis, twenty minutes have gone by and you haven't typed a single line of code.

We built an AI agent that does that twenty-minute triage in about ninety seconds. This post is about how.

The Stack It's Triaging

Mynaksh's backend is a Node.js / Express microservice setup — a couple of dozen services, each with their own deploy cadence and ownership. AI workloads (the AI Astrologer, the personalization layer, the embedding/retrieval pipeline) live in a separate Python service that the Node services call into when they need LLM-shaped work done.

That two-language split matters for triage. When something is wrong, the question isn't just "which service" — it's also "which side of the Python/Node boundary." A latency spike in the user-facing API could be the Node layer, the Python AI service, the network between them, or any downstream dependency.

Twenty services × two stacks × dozens of recent deploys × thousands of log lines is a lot of search space for a paged engineer at 3am.

The Old Triage Process

Pre-agent, the on-call runbook was:

Check the alert. Figure out which symptom triggered it.
Open Grafana. Look at the relevant service's metrics dashboard. Check error rates, latency, throughput.
If something looks off, switch to Loki. Query logs for the affected service in the time window. Scroll, filter, cross-reference.
Open GitHub. Check what shipped recently. Look at PRs merged in the last 24 hours that touched suspect services.
Form a hypothesis. Write it in the incident channel. Start fixing or escalating.

Steps 2-4 take time. They also require remembering URL paths, dashboard names, log query syntax, and which repo a service lives in. Half of the cognitive load is the tooling, not the actual reasoning.

This was the obvious thing to automate.

The Agent

The agent is an LLM with structured access to two MCP servers:

Grafana Loki MCP — for log queries against any service, with filters on time range, level, and search strings.
GitHub MCP — for repo state, recent commits, recent PRs, and file contents.

Plus a small set of internal tools: alert intake, service registry lookup, and a structured "post hypothesis" tool that writes the agent's conclusion to the incident channel.

The agent is not autonomous in the "auto-merge a fix" sense. It is autonomous in the "do the triage, hand a human the verdict" sense. The human still drives the response. The agent just removes the twenty minutes of grunt work that come before the actual thinking.

The Multi-Phase Approach

A single LLM call doesn't work for this. The search space is too big to stuff into one prompt, and the reasoning has to react to what each query returns. We use a multi-phase loop instead.

Phase 1: Symptom Intake

The alert comes in with structured data: which service triggered the alert, what metric, what threshold, what time window. The agent's first step is to confirm the symptom and bound the search.

Output of phase 1: a structured incident summary — affected service(s), metric(s), time window, severity.

This phase is mostly mechanical. The LLM is doing summarization, not investigation.

Phase 2: Log Investigation

Now the agent reaches for the Loki MCP.

It queries logs for the affected service across the alert's time window. It looks for error patterns, stack traces, repeated failures, timeouts. It is allowed to issue follow-up queries — if it finds a stack trace pointing at a downstream service, it queries that service's logs too.

The interesting design decision was bounded autonomy. The agent gets a budget of 12 Loki queries per incident, weighted by query cost: a tight 5-minute single-service query costs 1 unit; a 1-hour multi-service query costs 4. The budget is enforced by a wrapper around the MCP tool call. The cost function is in the system prompt — the LLM is told upfront what each query type costs and asked to plan its investigation accordingly. This sounds elaborate; it took maybe a day to build and saves the agent from spending half its budget on broad "show me everything" queries that don't pinpoint anything.

Output of phase 2: a candidate set of suspect services and the evidence (specific log lines, error patterns, frequency) that pointed at them.

Phase 3: Change Correlation

Now the GitHub MCP.

For each suspect service, the agent pulls recent commits and PRs — anything that landed in the last 24-48 hours. It looks for changes that plausibly explain the failure mode it found in logs. If logs say "TypeError: cannot read property X of undefined" and a PR yesterday touched the file path that's in the stack trace, that's a strong candidate.

This is where the Node/Python split becomes relevant. Errors that originate in the Python AI service show up in the Node service's logs as upstream timeouts or 5xx responses. The agent has to know to follow the trail across the boundary — query the Python repo's recent changes when the Node service is reporting upstream failures from the AI service's address.

A small internal service registry tells the agent which services depend on which. The registry is hand-maintained but cheap — each service's owner annotates its service.yaml with a depends_on list. The agent reads this through a tool call, builds a dependency graph in its head, and walks the graph from symptom toward root. So a Node service reporting upstream timeouts from ai-personalization-svc (the Python service) prompts the agent to check ai-personalization-svc's logs and recent commits next, even though the original alert pointed at the Node service.

Output of phase 3: ranked hypotheses about likely root cause, with citations to specific commits/PRs and log lines.

Phase 4: Hypothesis Post

The agent writes a structured incident hypothesis to the incident channel:

Most likely affected service(s)
Most likely root cause
Citations: log lines, commits, PRs
Confidence (calibrated, not always "high")
Suggested first action for the on-call

The on-call reads this and drives the response. The agent does not page anyone, does not roll back deploys, does not touch infrastructure. Hypothesis only.

Why MCPs and Not Custom Tools

The MCP-based approach was a deliberate call. We considered building bespoke tools — direct Loki client, direct GitHub API client — and decided against it for two reasons:

Standardization. MCP gives us a consistent interface for the agent to discover tool capabilities, retrieve schemas, and handle errors. Adding a new data source (PagerDuty, the metrics store, etc.) is "stand up another MCP server" rather than "write more glue."
Tool reuse. The same MCP servers we built for the on-call agent are now used by other internal tooling — engineers can ask the agent ad-hoc questions about logs and recent changes outside of incidents.

The downside is latency — MCP adds a layer of indirection on every tool call. For our use case (3am triage where 90 seconds is "fast"), it doesn't matter.

What It Gets Wrong

The agent is good at the obvious cases. It is mediocre on the genuinely subtle ones.

A representative example of an incident class the agent misses: a config change that becomes a problem only when traffic crosses a threshold. We changed a connection-pool size in a Python service two weeks before an incident; the change was fine at the time. Then traffic grew, the pool got saturated, and we started seeing intermittent timeouts. The agent looked at recent changes, saw nothing in the last 48 hours, and concluded "no recent change explains this — likely environmental." It was technically right (no recent change), but unhelpful.

We've since added a tool that lets the agent query for any config changes touching the suspect service in the last 30 days, with a higher score weight if the change was to a tunable parameter. That helps but doesn't fully solve the class. The hard cases remain hard.

We mitigate by having the agent always include a confidence rating, and the on-call always treats low-confidence verdicts as "look at this, maybe" rather than "this is the answer."

The Human Handoff Is the Product

The most important thing the agent does isn't pinpoint root cause. It's hand the on-call a complete triage packet — service, evidence, recent changes, log citations — so that the human's first ten minutes of incident response can be reasoning instead of clicking through dashboards.

That handoff is the actual moat. The agent isn't smarter than a senior engineer. It's faster at the legwork that a senior engineer used to do before they were allowed to start being smart.

What's Next

Three things on the roadmap:

Metrics MCP. Currently the agent only reads logs and code. Adding the metrics store as a third MCP source would let it correlate "logs say X" with "the metric for X spiked at this time."
Multi-service incident reasoning. Right now the agent is good at single-service incidents and mediocre at incidents that span the call graph. Better graph-walking and parallel investigation would help.
Post-incident summarization. A separate agent run after the incident closes that drafts a retrospective from the incident channel transcript, the agent's hypothesis, and what actually fixed it. Saves the on-call from writing it cold.