All discussions filtered by tag "anthropic alignment science"

Evaluations for AI Sabotage Risks

Anthropic's new evaluations test AI models for potential sabotage risks, ensuring safety as capabilities increase.