MEHMET ISIK ยท 5D AGO ยท 25 VIEWS ยท PRIVATE
WWTP LLM Defense Critical Infrastructure?
Notebook
Input
Output
Logs
Comments (0)
Settings
Output Data
Run Summary โ All Models
NEW
| Model | Score | Input Tokens | Output Tokens | Cost | Time | Status |
|---|---|---|---|---|---|---|
Gemini 2.5 Flash |
89.80 | 1,716,324 | 86,849 | $1.88 Cost per token: Input: $0.30 / 1M tokens Output: $1.25 / 1M tokens |
102m | 30/30 PASS |
Gemini 2.5 Pro |
87.30 | 2,915,681 | 124,304 + ~1.36M thinking | $18.49 Cost per token: Input: $1.25 / 1M tokens Output: $10.00 / 1M tokens ~73% from invisible thinking tokens |
~270m | 27/30 QUOTA |
Gemini 2.0 Flash Lite |
85.10 | 980,412 | 62,100 | $0.42 Cost per token: Input: $0.30 / 1M tokens Output: $0.60 / 1M tokens |
45m | 30/30 PASS |
Gemini 2.0 Flash |
78.20 | 1,102,540 | 71,200 | $0.68 Cost per token: Input: $0.40 / 1M tokens Output: $1.00 / 1M tokens |
58m | 30/30 PASS |
Claude Opus 4.6 |
92.40 | 4,078,323 | 229,256 | $26.12 Cost per token: Input: $5.00 / 1M tokens Output: $25.00 / 1M tokens |
92m | 30/30 PASS |
Gemini 3 Flash Preview |
ERROR | โ | โ | $0.31 | 7m | ERROR |
Chat serialization limit: This task created 7 conversations, but only 4 are visible in the output JSON. Some conversation data may be missing. NEW
WWTP_LLM_Defense_Can_AI_Protect_Critical_Infrastructure-run_id_Run_1_google_gemini-2.5-flash...
This preview is truncated due to large file size. The number of JSON items and individual items might be truncated. Create a Notebook or download this file to see the full content.
โผ "root" : { 6 items
โผ "taskVersion" : { 4 items
"versionNumber" : 1
"name" : "WWTP LLM Defense: Can AI Protect Critical Infrastructure?"
"description" :
"36-hour SCADA cyber attack simulation on a real WWTP. 5 attack scenarios, 3 defense modes, 2 awareness levels. Stuxnet-inspired sensor manipulation with real plant data."
"definition" :
"@kbench.task(name='WWTP LLM Defense: Can AI Protect Critical Infrastructure?') def wwtp_defense_v2(llm) -> float: \"\"\"36-hour SCADA cyber attack simulation on a real WWTP. 5 attack scenarios, 3 defense modes, 2 awareness levels. Stuxnet-inspired sensor manipulation with real plant data.\"\"\" _benchmark_start = time.time() # --- Load all data --- data_dir = find_data_dir() scenario, config, attack_config, answer_key, sec_proto, sec_mandate = load_all_data(data_dir) print(f\"{'=' * 65}\") print(f\" WWTP LLM DEFENSE BENCHMARK v2.0\") print(f\"{'=' * 65}\") print(f\" Plant: Ceyhan WWTP, Adana, Turkey (34,896 m3/day)\") print(f\" Simulation: {SIMULATION_HOURS} hours | {CONTEXT_WINDOW}h context windows | 3x retry\") print(f\" Scenarios: 5 (Biogas, Blower, Generator, CHP, Combined)\") print(f\" Modes: BARE / PROTOCOL / MANDATE\") print(f\" Awareness: AWARE (knows attack) / UNAWARE (thinks sensor error)\") ..."
}
โผ "runs" : [ 1 item
โผ { 8 items
"model" : "google/gemini-2.5-flash"
"score" : 89.80
"conversations" : [ ... ]
"assertions" : [ ... ]
...
}
]
}
WWTP_LLM_Defense_Can_AI_Protect_Critical_Infrastructure-run_id_Run_1_google_gemini-2.5-pro...