38 Skills, three layers of defense, one hard lesson: natural language instructions are not a safety mechanism.
TL;DR
I have 38 Claude Code Skills (24 personal + 14 repo). After auditing them, I found: 12 destructive Skills had zero checkpoints, and 10 relied solely on natural language like “Confirm with user before proceeding.” This post documents how I systematically fixed the problem using a three-layer protection model.
Core insight: Natural language instructions are a “sign,” not a “physical barrier.” You don’t put up a “Please Keep Back” sign at the edge of a cliff and call it safety.
The Problem: Are Your Skills Actually Safe?
Claude Code Skills are reusable workflow templates. A well-written Skill can compress 10 minutes of work into 30 seconds. But here’s the thing — many Skills contain irreversible operations:
git pushto remoteaws s3 rmdeleting S3 objectskubectl deleteremoving K8s resourcesadb pushoverwriting system APKs--execute --yeswriting to production DB
I did a full audit of my 38 Skills:
| Category | Count | Examples |
|---|---|---|
| Destructive, no checkpoint | 5 | bads-skynet-e2e, device-test, commit-phase, test-folio, ventura-memory |
| Destructive, natural language only | 10 | bads-update, self-evolution, phase-impl, writing, code-review, etc. |
| Destructive, has AskUserQuestion | 3 | prepare-feature-for-ux-review, sprint-plan, release-notes |
| Read-only | 17 | aosp-analysis, director, trade-analysis, etc. |
The scariest finding: My bads-update Skill runs 8 steps end-to-end — from token refresh to production DB write to S3 cleanup. The only thing between dry-run and execution was a single “Confirm with user before proceeding.” In long-context conversations, the model may skip right past it.
The Three-Layer Protection Model
I categorize safety mechanisms into three layers, ordered by increasing reliability:
Layer 1 (Sign) → Natural language: "Wait for user confirmation"
Layer 2 (Traffic light) → AskUserQuestion tool call
Layer 3 (Physical wall) → PreToolUse Hook / Skill split / disallowed-tools
Layer 1: Natural Language — “Please Stop”
**→ WAIT** for user response before proceeding.
Confirm with user before proceeding.
This is a sign. The model sees it, weighs it against all other instructions, and decides whether to comply. Compliance drops under these conditions:
- Long context dilution: The longer the conversation, the lower the weight of earlier instructions
- Task completion bias: The model prefers completing tasks over stopping
- Vague wording: “Confirm with user” vs “Use AskUserQuestion tool”
Conclusion: Layer 1 should only be a last resort, never the primary defense.
Layer 2: AskUserQuestion — “Traffic Light That Might Break”
Use AskUserQuestion to confirm:
> "Push to origin refs/for/main? (Change-Id: I1234)"
Only proceed after user confirms.
AskUserQuestion is a tool call instruction. If the model decides to call it, the runtime forces a pause and waits for user response. But the first step — “the model decides to call it” — is still probabilistic.
GitHub Issue #19308 confirms: the model ignores explicit tool call instructions in Skills.
Probabilistic invocation → (if invoked) → Runtime blocks deterministically
↑ ↑
May fail 100% reliable
Better than Layer 1 thanks to the runtime blocking, but the entry point is still probabilistic.
Layer 3: Runtime Mechanisms — “You Can’t Move”
3a. PreToolUse Hook
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"if": "Bash(git push*)",
"command": "/path/to/safety-hook.sh",
"timeout": 5
}
]
}
]
}
}
The hook intercepts before the tool call executes. The model cannot bypass it. The if field ensures the hook process only spawns when matching a dangerous pattern, so normal commands aren’t affected.
The hook script reads JSON from stdin, inspects the command, and returns a permissionDecision:
#!/usr/bin/env bash
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "
import json,sys
print(json.load(sys.stdin).get('tool_input',{}).get('command',''))
" 2>/dev/null)
# DENY: hard block
if echo "$COMMAND" | grep -qE 'git\s+push\s+.*(--force|-f)\b'; then
echo '{"hookSpecificOutput":{
"hookEventName":"PreToolUse",
"permissionDecision":"deny",
"permissionDecisionReason":"Force push rewrites remote history."
}}'
exit 0
fi
# ASK: prompt user
if echo "$COMMAND" | grep -qE 'git\s+push\s'; then
echo '{"hookSpecificOutput":{
"hookEventName":"PreToolUse",
"permissionDecision":"ask",
"permissionDecisionReason":"Git push detected. Confirm target branch."
}}'
exit 0
fi
# Passthrough
exit 0
Three decisions:
| Decision | Effect | Use Case |
|---|---|---|
deny | Hard block, user must re-authorize | Force push, S3 deletion, production DB writes |
ask | Prompt for confirmation, can proceed | Regular push, device reboot |
| (empty) | Pass through | Normal commands |
3b. Skill Splitting
/bads-update → Steps 1-4 (dry-run only)
/bads-update-execute → Steps 5-8 (execute + verify + cleanup)
The user must manually type /bads-update-execute. This is a runtime guarantee — the model cannot autonomously invoke a Skill that requires manual user input.
3c. disallowed-tools
---
name: aosp-analysis
disallowed-tools:
- Edit
- Write
---
Removes write tools from the model’s available tool pool. A read-only Skill doesn’t need to modify files.
Implementation
Phase 1: PreToolUse Hook
Created a shared hook script with 6 if filters:
| Pattern | Decision | Reason |
|---|---|---|
git push --force | deny | Rewrites remote history |
aws s3 rm | deny | S3 deletion is irreversible |
kubectl delete (except tunnel cleanup) | deny | K8s resource deletion |
--execute --yes | deny | Production DB write |
git push | ask | Confirm target branch |
adb reboot | ask | Device restart |
docker compose down | ask | Service shutdown |
The if field is critical — it uses permission rule syntax filtering to ensure the hook process only spawns on matches. It doesn’t affect ls, git status, ./gradlew, or any other normal command.
Phase 2: Split the Most Dangerous Skill
bads-update was my most dangerous Skill:
Before:
Steps 1-8: token → DB tunnel → dry-run → execute → verify → S3 cleanup
Only a single "Confirm with user before proceeding" in between
After:
/bads-update (Steps 1-4):
token → DB tunnel → dry-run → STOP
"Dry-run complete. Run /bads-update-execute when ready."
/bads-update-execute (Steps 5-8):
execute → verify → S3 cleanup
Hook intercepts --execute --yes and aws s3 rm
Double protection: Skill split (user must manually invoke) + Hook interception (blocks even within the execute skill).
Phase 3: Add AskUserQuestion to Destructive Skills
| Skill | Checkpoint Location | Question |
|---|---|---|
commit-phase | Before push | ”Push to {remote} refs/for/{branch}?” |
device-test | Before APK push | ”Push {apk} to {device}:/product/priv-app/?” |
self-evolution | Before modifications | ”Apply these changes to CLAUDE.md / skills?” |
writing | Before file write | ”Write this content to {filepath}?” |
Phase 4: disallowed-tools for 17 Read-Only Skills
One line of frontmatter. Analysis Skills can’t modify files.
Verification
Tested the hook script with Python subprocess — all 10 test cases passed:
PASS | safe cmd | expected=passthrough | actual=passthrough
PASS | adb reboot | expected=ask | actual=ask
PASS | docker down | expected=ask | actual=ask
PASS | s3 rm | expected=deny | actual=deny
PASS | kubectl del | expected=deny | actual=deny
PASS | tunnel cleanup | expected=passthrough | actual=passthrough
PASS | execute yes | expected=deny | actual=deny
PASS | regular push | expected=ask | actual=ask
PASS | force push | expected=deny | actual=deny
PASS | git status | expected=passthrough | actual=passthrough
The most entertaining validation: the hook intercepted its own test commands during testing. Because the Bash command string contained --execute --yes, the PreToolUse hook thought it was a production DB write and denied it. This turned out to be the strongest end-to-end proof — the hook works in real runtime.
Before and After
Before
38 skills:
Layer 3: 0 skills ← Not a single one used Runtime protection
Layer 2: 1 skill
Layer 1: 10 skills
No checkpoint: 12 skills (5 destructive)
Read-only: 17 skills (unprotected)
After
38 skills:
PreToolUse Hook: 6 if-filters covering all dangerous Bash patterns
Skill split: bads-update → bads-update + bads-update-execute
AskUserQuestion: 4 destructive skills
disallowed-tools: 17 read-only skills
Every destructive operation has at least one Runtime defense ✓
Design Principles
1. Defense in Depth, But Primary Defense Must Be Layer 3
Layer 3 (must-have) → PreToolUse Hook / Skill split
Layer 2 (nice-to-have) → AskUserQuestion for UX
Layer 1 (last resort) → Natural language hints
Layers 2 and 1 provide defense-in-depth. They are not the primary defense.
2. if Filters Are a Performance Requirement
Without if filters, every Bash command spawns a hook process. ls, git status, ./gradlew build — each adds 100-200ms of latency.
// Good: only spawn hook on git push
{"if": "Bash(git push*)", "command": "safety-hook.sh"}
// Bad: every Bash command spawns hook
{"command": "safety-hook.sh"}
3. Skill Splitting Is More Reliable Than Checkpoints
Adding a checkpoint inside a Skill (whether Layer 1 or Layer 2) means the model can probabilistically skip it. But Skill splitting requires the user to manually type another slash command — that’s a 0% chance the model can bypass.
4. allowed-tools ≠ Restriction
Claude Code’s allowed-tools is pre-approval, not restriction. Using allowed-tools: [Bash, Read] doesn’t prevent the model from using Edit — it just triggers a permission prompt when Edit is called.
For actual restriction, use disallowed-tools.
Conclusion
The core question of AI Agent safety isn’t “will the model listen?” — it’s “what happens if it doesn’t?”
For read-only operations — no big deal. Worst case is some wasted context.
For irreversible operations — git push --force, aws s3 rm, production DB writes — you don’t need “please ask me first.” You need “you literally can’t.”
PreToolUse Hook is that “you literally can’t.”
References
- Claude Code Hooks Guide — Official PreToolUse Hook documentation
- Claude Code Skills — Skill frontmatter (
allowed-tools,disallowed-tools) - Handle approvals and user input — AskUserQuestion blocking behavior
- GitHub Issue #19308 — Model ignoring explicit tool call instructions in Skills
- GitHub Issue #18454 — Model ignoring MANDATORY natural language checkpoints in CLAUDE.md
This is part of the “Claude Code in Practice” series. Previous post: Claude Code Skill Safety Gates: A Three-Layer Model from Auditing 35 Skills.