Alarms
Built at 2026-05-03 21:03:24 ·
all times EDT ·
auto-refresh 10s
alarm summary
· PanDA task failure rate — catch-all
· PanDA task failure rate — Sakib's tasks
· recent crons
Cron engine: OK.
All checks healthy.
Cycle 5 min. Next check in 95s.
Teams
— reusable recipient lists, referenced as
@teamname
| Name | Title | Members | Modified | |
|---|---|---|---|---|
@prodops |
Production ops | srahman1@bnl.gov, wenaus@gmail.com, | 20260421 18:22:28 | Edit |
Alarm summary
— one row per configured alarm
| Alarm | Emails | Alarm tasks in last 24h | Current alarm tasks | Last alarm |
|---|---|---|---|---|
| Alarm 1: PanDA task failure rate — catch-all | off | 0 | 0 | 3 days, 13 hours ago |
| Alarm 2: PanDA task failure rate — Sakib's tasks | off | 0 | 0 | 3 days, 13 hours ago |
Alarm 1:
PanDA task failure rate — catch-all
emails off
Edit
alarm_panda_failure_rate_eic_all
| Created / modified | 20260421 17:06:48 / 20260421 21:33:34 |
|---|---|
| Recipients | wenaus@gmail.com |
| Source | alarms/swf_alarms/alarms/panda_failure_rate_eic_all.py |
Current alarm tasks — 0
No active alarms right now.
Description / email body
Catch-all alert on any PanDA task whose computed failure rate exceeds the configured threshold over the configured window. Torre-only tuning channel for shaping future per-owner alarms. Threshold and window live in the Check params below. Dashboard: https://epic-devcloud.org/prod/alarms/
Params
| Param | Value | Type | Required | Default | Description |
|---|---|---|---|---|---|
threshold |
0.05 |
float | yes | — | failure-rate threshold (e.g. 0.03 = 3%) |
since_days |
1 |
int | no | 1 | look back this many days into PanDA |
username |
— | str | no | — | optional task-owner filter (supports % LIKE) |
processingtype |
— | str | no | — | optional PanDA processingtype filter |
min_terminal_jobs |
5 |
int | no | 5 | ignore tasks with fewer finished+failed jobs than this |
statuses |
— | list | no | — | task statuses to consider; default running/failed/broken |
Alarm 2:
PanDA task failure rate — Sakib's tasks
emails off
Edit
alarm_panda_failure_rate_sakib
| Created / modified | 20260421 17:06:48 / 20260421 19:22:01 |
|---|---|
| Recipients | @prodops |
| Source | alarms/swf_alarms/alarms/panda_failure_rate_sakib.py |
Current alarm tasks — 0
No active alarms right now.
Description / email body
Alert on PanDA tasks owned by Sakib Rahman whose computed failure rate exceeds the configured threshold over the configured window. Threshold, window, and minimum terminal jobs are in the Check params below. Dashboard: https://epic-devcloud.org/prod/alarms/
Params
| Param | Value | Type | Required | Default | Description |
|---|---|---|---|---|---|
threshold |
0.03 |
float | yes | — | failure-rate threshold (e.g. 0.03 = 3%) |
since_days |
1 |
int | no | 1 | look back this many days into PanDA |
username |
Sakib Rahman |
str | no | — | optional task-owner filter (supports % LIKE) |
processingtype |
— | str | no | — | optional PanDA processingtype filter |
min_terminal_jobs |
5 |
int | no | 5 | ignore tasks with fewer finished+failed jobs than this |
statuses |
— | list | no | — | task statuses to consider; default running/failed/broken |
Recent cron engine runs
— each run, with per-alarm breakdown
| Started | Per-alarm results |
|---|---|
| 20260503 21:00:02 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:55:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:50:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:45:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:40:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:35:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:30:02 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:25:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:20:02 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:15:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:10:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:05:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 20:00:02 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 19:55:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 19:50:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 19:45:02 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 19:40:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|
| 20260503 19:35:01 |
alarm_panda_failure_rate_sakib:
seen 0, error — fetch: 404 Not Found from https://epic-devcloud.org/prod/api/panda/tasks/: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
alarm_panda_failure_rate_eic_all:
seen 0, error — fetch: 404 Not Found from https://epic-devcloud.org/prod/api/panda/tasks/: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
|
| 20260503 19:30:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0, error — fetch: request failed: https://epic-devcloud.org/prod/api/panda/tasks/: The read operation timed out
|
| 20260503 19:25:01 |
alarm_panda_failure_rate_sakib:
seen 0
alarm_panda_failure_rate_eic_all:
seen 0
|