Ops & Support

Ops covers everything after deploy: keeping production automations healthy, diagnosing failures when something goes wrong, and tuning alert noise so the signals don't drown in the inbox.

What you can ask Claude to do

Account health checks — read the state of a customer account: which automations ran, which had elevated failure rates, which connections are looking sick.
Diagnosing failed runs — pinpoint the root cause of a specific failure and either propose or apply a fix.
Tuning alert noise — adjust which errors page you, which retry, which suppress, and what the customer-facing alert text says.

You ask Claude to check on a customer; it pulls a focused report — which automations ran, which had elevated failure rates, which connections are looking sick. Daily check-ins, weekly reviews, or first response when a customer reports something feels off.

text

> How's Acme Fitness's account today?
> Anything broken on Boring Co?

The report is brief by design: recent error counts grouped by automation, with names and IDs. No log walls. If anything errored, Claude points you at diagnosis rather than digging in itself; the health check is read-only and meant to run quickly, including on a recurring schedule.

Diagnosing failed runs

When something specific is wrong, you describe the symptom and Claude walks the failed runs, identifies the root cause, and proposes (or applies) a fix.

text

> Acme says contacts aren't syncing since yesterday. Find out why.
> This automation's been retrying for hours. Stop it and figure out what's broken.
> A vendor sent us a payload shape we've never seen. Update the trigger.

Claude switches into the customer's account, searches execution history for the record or symptom you named, drills into the failing step, and pulls the actual HTTP request and response that the platform sent and received. The diagnosis comes back as a single summary: which automation, which step, the curl that failed, the API's response, what the response means, and what to change.

If the root cause turns out to be a handling-policy problem (a transient error that should retry but isn't, a misleading alert message, a domain that returns 401 for non-auth reasons), Claude tells you and points you at alert tuning. Code-level fixes happen on dev through the build and edit skills, not in production.

Tuning alert noise

Production automations are noisy by default — vendor blips, transient 502s, and known-good 404s all fire alerts until you tune them. Four levers decide what happens when an action errors, evaluated in this order:

Retry eligibility — is the error transient enough to go to the auto-retry queue instead of alerting?
401 auto-turnoff bypass — for APIs that return 401 for non-auth reasons (rate limits, IP restrictions), keep the auto-shutoff from firing.
Alert message transform — rewrite or suppress the message at the step, automation, or tenant-wide level.
Dismissal list — kill an alert whose first line exactly matches a saved entry.

You describe the symptom; Claude works out which lever applies and proposes the change.

text

> 404s on the find step are normal here. Suppress them.
> Don't auto-disable the connection on 401 from this domain — they rotate tokens hourly.
> Rewrite the alert text for failed shipping syncs to be customer-readable.

System-wide changes (the retry list, the 401-ignore list, system-level mappings) require Switch Account permission on your APIANT account; Claude tells you if you don't have it instead of trying.

What you can ask Claude to do

Account health checks

Diagnosing failed runs

Tuning alert noise

See also

Related docs