Alarms

Built at 2026-05-03 21:03:24 · all times EDT · auto-refresh 10s
Cron engine: OK. All checks healthy. Cycle 5 min. Next check in 95s.
Teams — reusable recipient lists, referenced as @teamname
Name Title Members Modified
@prodops Production ops srahman1@bnl.gov, wenaus@gmail.com, 20260421 18:22:28 Edit
Create team
Alarm summary — one row per configured alarm
hours ago
Alarm Emails Alarm tasks in last 24h Current alarm tasks Last alarm
Alarm 1: PanDA task failure rate — catch-all off 0 0 3 days, 13 hours ago
Alarm 2: PanDA task failure rate — Sakib's tasks off 0 0 3 days, 13 hours ago
Alarm 1: PanDA task failure rate — catch-all emails off alarm_panda_failure_rate_eic_all
Edit
Created / modified 20260421 17:06:48 / 20260421 21:33:34
Recipients wenaus@gmail.com
Source alarms/swf_alarms/alarms/panda_failure_rate_eic_all.py

Current alarm tasks — 0

No active alarms right now.

Description / email body
Catch-all alert on any PanDA task whose computed failure rate exceeds the configured threshold over the configured window. Torre-only tuning channel for shaping future per-owner alarms. Threshold and window live in the Check params below.

Dashboard: https://epic-devcloud.org/prod/alarms/
Params
Param Value Type Required Default Description
threshold 0.05 float yes failure-rate threshold (e.g. 0.03 = 3%)
since_days 1 int no 1 look back this many days into PanDA
username str no optional task-owner filter (supports % LIKE)
processingtype str no optional PanDA processingtype filter
min_terminal_jobs 5 int no 5 ignore tasks with fewer finished+failed jobs than this
statuses list no task statuses to consider; default running/failed/broken
Alarm 2: PanDA task failure rate — Sakib's tasks emails off alarm_panda_failure_rate_sakib
Edit
Created / modified 20260421 17:06:48 / 20260421 19:22:01
Recipients @prodops
Source alarms/swf_alarms/alarms/panda_failure_rate_sakib.py

Current alarm tasks — 0

No active alarms right now.

Description / email body
Alert on PanDA tasks owned by Sakib Rahman whose computed failure rate exceeds the configured threshold over the configured window. Threshold, window, and minimum terminal jobs are in the Check params below.

Dashboard: https://epic-devcloud.org/prod/alarms/
Params
Param Value Type Required Default Description
threshold 0.03 float yes failure-rate threshold (e.g. 0.03 = 3%)
since_days 1 int no 1 look back this many days into PanDA
username Sakib Rahman str no optional task-owner filter (supports % LIKE)
processingtype str no optional PanDA processingtype filter
min_terminal_jobs 5 int no 5 ignore tasks with fewer finished+failed jobs than this
statuses list no task statuses to consider; default running/failed/broken
Recent cron engine runs — each run, with per-alarm breakdown
Started Per-alarm results
20260503 21:00:02
20260503 20:55:01
20260503 20:50:01
20260503 20:45:01
20260503 20:40:01
20260503 20:35:01
20260503 20:30:02
20260503 20:25:01
20260503 20:20:02
20260503 20:15:01
20260503 20:10:01
20260503 20:05:01
20260503 20:00:02
20260503 19:55:01
20260503 19:50:01
20260503 19:45:02
20260503 19:40:01
20260503 19:35:01
alarm_panda_failure_rate_sakib: seen 0, error — fetch: 404 Not Found from https://epic-devcloud.org/prod/api/panda/tasks/: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html>
alarm_panda_failure_rate_eic_all: seen 0, error — fetch: 404 Not Found from https://epic-devcloud.org/prod/api/panda/tasks/: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL was not found on this server.</p> </body></html>
20260503 19:30:01
alarm_panda_failure_rate_eic_all: seen 0, error — fetch: request failed: https://epic-devcloud.org/prod/api/panda/tasks/: The read operation timed out
20260503 19:25:01