Prompt Injection Classifier

A lightweight scikit-learn classifier that detects prompt injection and jailbreak attacks against LLMs. Built using the autoresearch autonomous experimentation pattern with Claude Code + Opus 4.6.

Architecture: Conservative ensemble (LinearSVC + LogisticRegression) — both models must agree to flag as malicious. Trained on neuralchemy/Prompt-injection-dataset. Val accuracy: 96.1% | Test accuracy: 95.2% | Inference: < 5ms

Examples