All discussions filtered by tag "sabotage evaluations"

Evaluations for AI Sabotage Risks

Anthropic's new evaluations test AI models for potential sabotage risks, ensuring safety as capabilities increase.