abhinavagarwal.me
·
observability engineering-leadership kafka-connect

What I learned migrating 85 monitors before a $6M deadline

An observability platform migration is mostly a sociotechnical problem dressed up as a tooling one. Notes from leading one across a 70-person org.

In early 2025 I picked up a migration that had been stalled for almost a year. The task, on paper, was simple: move our team’s observability from Datadog to Grafana. 85+ monitors, 22+ dashboards. The real task was less simple. We had a hard cost-savings deadline attached to org-wide Datadog deprecation, a 70-person org that had to actually do the work, and a strong prior expectation, formed across several earlier attempts, that this thing was going to slip.

It didn’t slip. We finished with zero observability loss and, somewhat unexpectedly, with less alert noise on the other side than we started with.

A lot of the things I’d assumed going in turned out to be wrong. These are the notes I wish I’d had at the start.

The work is 80% coordination, not tooling

The first instinct on a migration like this is to treat it as an engineering project. Write a translation guide, build some tooling to convert queries, run it, done. I spent the first two weeks doing exactly that.

It wasn’t useless work, but it was nowhere near the bottleneck. The actual bottleneck was that no individual engineer was going to spend their week migrating monitors unless someone made it concrete, scheduled, and socially expensive to skip.

What worked was much less clever:

  • A single source-of-truth tracker, owner-per-monitor.
  • A standing weekly office hour where anyone could bring a confusing query and we’d rewrite it live. Surprisingly high attendance.
  • A war room channel for the final two weeks where I personally pinged owners on stale items.
  • A weekly status email with the burndown chart, sent to a wide enough audience that the burndown was visible to people who could apply pressure.

None of that is engineering work. All of it was load-bearing.

Translate, then prune

The default mode in a migration is to translate everything one-for-one. Every monitor becomes a monitor; every dashboard becomes a dashboard. This is a mistake, because it preserves the sediment of years of “let’s just add an alert for that one time it happened.”

We made it a rule that every monitor had to be re-justified before it got translated. Concretely: an owner had to be able to write one sentence answering “what action does this page lead to?” If they couldn’t, the monitor got deleted or downgraded to a dashboard panel.

A meaningful chunk of monitors got retired this way. That’s where the post-migration alert noise reduction came from. Not better tooling, but the forcing function of having to re-touch every alert and ask whether it was still earning its keep.

If you’re ever doing a migration, build the prune step into the workflow. It’s much harder to do as a separate project later, because the activation energy of opening a dashboard you don’t normally look at is non-trivial.

A small core team beats a large committee

We had three people at the center. Me, plus two engineers who knew the query languages on both sides cold. That’s it. Everyone else owned their slice and pulled the core team in when they got stuck.

I’d previously seen migrations done with a much larger steering committee, and they almost always go worse. The committee creates a diffusion of responsibility where no one feels acutely on the hook for the burndown, and the cadence of decisions slows to the cadence of the committee meeting.

A small core team, by contrast, is a fast unblock function. An engineer in the war room asks a question, gets a real answer within an hour, and is unblocked. That speed compounds over the duration of the migration in a way that’s hard to overstate.

”Zero observability loss” is the wrong success metric

We held ourselves to zero observability loss, and we hit it. But in retrospect that was mostly a comfort metric. Two more interesting questions.

First: did the alerts that paged people in the new system map to the right problems? Answer: yes, and slightly better than before, because of the pruning step.

Second: did anyone’s debugging workflow get worse? Answer: sort of. The muscle memory transition cost was real, and we underweighted it. Engineers spent ~2 weeks being slower at incidents because they were re-learning where things lived. We could have shortened this with more dashboard-walkthrough sessions before cutover.

Next time I’d track “time to find the relevant dashboard during an incident” as an explicit post-migration metric for a quarter. We didn’t, and I think we left some easy improvement on the table by not doing it.

On running a 70-person initiative as one person

The hardest part of leading this wasn’t technical and wasn’t even organizational, exactly. It was managing my own anxiety about something I couldn’t directly do.

You can’t migrate 85 monitors by yourself. You also can’t make 50 engineers move on a schedule by force of will. What you can do is build the conditions where the work is the obvious next thing for each person. Clear ownership. Low-friction help. A visible burndown. A real deadline. And then trust the system you built and let people do their jobs.

I found that the weeks I was most tempted to “just go fix it myself” were the weeks I was the least useful to the project. The discipline of staying in the coordination role, even when the engineering itch flared, was probably the single biggest thing I had to learn.