137AI > Security & Trust > Model Safety


AI Model Safety


Model safety is the operational discipline of deploying AI models safely in production. The discipline integrates training-side alignment work, pre-deployment evaluation, deployment-time controls, ongoing monitoring, and incident response into operational safety practice. The discipline sits at the intersection of training, deployment, and operational work, drawing on alignment research and red teaming while addressing the broader operational question of how models reach and remain in deployment.

The training-side discipline is treated separately in Alignment. The adversarial evaluation discipline is treated separately in Red Teaming. Specific failure modes including confabulation and capability drift are treated in Hallucination & Drift. This page covers the operational umbrella that integrates these specialized disciplines into deployment-time safety practice.


Capability, Propensity, and Controls

The conceptual framework that organizes much of contemporary model safety work distinguishes three questions about a deployed model.

Capability addresses what the model can do. A model has specific capabilities determined by its training, scale, and design. Capabilities include both intended capabilities the operator built the model for and unintended capabilities that emerged from training. Capability is a property of the model itself.

Propensity addresses whether the model is inclined to use its capabilities harmfully. A model with capability to produce harmful content may have low propensity to do so given training, instruction-following, and refusal behavior. Propensity is shaped by training-side work including RLHF, constitutional AI, and refusal training.

Controls address what bounds the model's behavior at deployment regardless of capability and propensity. Behavioral envelopes, monitoring, human oversight, and the broader operational infrastructure all operate as controls. Controls hold even when capability is high and propensity has been compromised through adversarial inputs.

The framework supports systematic safety analysis. Risk is bounded when capability is bounded, when propensity is bounded, or when controls are bounded — and the bounds at multiple layers combine to produce overall safety posture. A model with concerning capability requires either substantial propensity work to bound how it uses the capability, substantial controls to bound what the capability can affect, or both.

The framework also clarifies why specific work matters. Training-side alignment work addresses propensity. Red teaming surfaces both capabilities the developers did not anticipate and propensity failures the alignment work did not address. Controls work addresses what cannot be addressed at training time. The integration of all three is what produces operational safety.


Safety Case Construction

Safety case construction is the discipline of building documented arguments that a system is safe to deploy in a specific context. The methodology was developed in safety-critical engineering contexts including aviation, nuclear, and medical devices, and is increasingly applied to AI systems.

A safety case is structured argument supported by evidence. The argument states the claim (the system is safe to deploy for purpose X in context Y), the supporting sub-claims that combine to support the main claim, the evidence supporting each sub-claim, and the assumptions on which the argument depends.

Safety cases for AI face specific challenges. The behavior of AI systems is not fully specified by code and depends on training data the operator may not have authored. The evaluation methodology is less mature than for conventional engineering. Edge cases may not be foreseeable in ways that conventional safety analysis assumes. The framework continues to develop with substantial activity from industry and standards bodies.

UL 4600 provides safety case methodology specifically for autonomous products. The framework addresses how to construct safety cases for AI-controlled systems including autonomous vehicles. The methodology is being extended to other AI applications.

The EU AI Act includes safety case-relevant requirements for high-risk AI systems through Article 11 technical documentation requirements. The documentation must demonstrate that the system meets the regulatory requirements through structured argument and supporting evidence.

Frontier AI labs have been developing safety case methodology specifically for frontier model deployment. The work addresses how to argue that increasingly capable models are safe to deploy in specific contexts and what evidence supports such arguments. The methodology continues to develop as model capability advances.


Pre-Deployment Evaluation

Pre-deployment evaluation is the work of establishing what a model does before it is released or deployed. The discipline supports both safety assessment and deployment-context appropriateness.

Capability evaluation establishes what the model can do across the range of tasks the operator cares about. The evaluations may include standard benchmarks, operator-specific evaluations, and custom evaluations for specific concerns.

Safety evaluation establishes propensity to behave safely under various conditions. The evaluations include refusal behavior on prohibited categories, robustness to adversarial inputs, behavior under unusual conditions, and the broader category of "what does the model do when something goes wrong."

Bias evaluation establishes patterns of differential behavior across populations and contexts. The detailed treatment appears in Bias & Fairness.

Robustness evaluation establishes how the model behaves under distribution shift, adversarial inputs, and the broader category of inputs that differ from training distribution.

Red team evaluation establishes what adversarial probing can elicit from the model. The detailed treatment of red teaming as a discipline appears in Red Teaming.

External evaluation by third parties supplements internal evaluation with independent perspective. The AI Safety Institute network, academic researchers, and contracted evaluators all provide external evaluation that internal work alone does not produce. The external evaluation has been increasingly important for frontier model releases.

The limits of current evaluation methodology are substantive. Evaluations cover what they cover; they do not necessarily reveal what they do not cover. A model that passes evaluations may still produce concerning behavior in deployment because the evaluation did not test for the specific concern. The discipline continues to develop methodology that addresses these limits.


Dangerous Capability Evaluations

Dangerous capability evaluations are specific evaluations targeting capabilities that warrant particular attention because of their potential for severe harm. The category has emerged as a focus area for frontier model safety work.

Capability Category What It Targets Evaluation Approach
Bio Capability to provide uplift to biological weapons development Expert evaluation, structured task assessment, capability elicitation under adversarial conditions
Chemical Capability to provide uplift to chemical weapons development Expert evaluation focused on synthesis pathways, precursor knowledge, and adjacent capability
Cyber Capability to develop, modify, or operate offensive cyber capabilities Structured cyber capability evaluation including vulnerability discovery, exploit development, and autonomous cyber operation
Autonomous replication Capability to acquire resources, persist, and replicate without authorization Task-based evaluation of agentic capabilities relevant to autonomous operation
Deception and scheming Capability and propensity to deceive evaluators or operators Targeted evaluation of model behavior under conditions that might reveal scheming if present
Persuasion and manipulation Capability to influence human decision-making beyond legitimate persuasion Human study and structured evaluation of model influence on decisions
AI development Capability to contribute substantively to AI development including model training and research Task-based evaluation of AI research and engineering capability

The evaluations are at varying levels of methodological maturity. Cyber evaluation has substantial precedent from cybersecurity research; bio and chem evaluation requires careful design to elicit capabilities without producing safety concerns through the evaluation itself; autonomous replication and scheming evaluations engage emerging methodology specific to AI safety.

The Inspect framework, METR evaluation work, and similar emerging infrastructure provides standardized methodology for dangerous capability evaluation. The AI Safety Institute network has been developing shared evaluation infrastructure that supports cross-organization comparison.


Responsible Scaling and Frontier Model Safety Frameworks

Frontier model safety has developed specific frameworks that address the safety considerations of the most capable models. The frameworks operate as voluntary commitments by frontier labs with varying degrees of external scrutiny.

Anthropic's Responsible Scaling Policy (RSP) defines AI Safety Levels (ASL) and specific commitments at each level. The framework includes both capability thresholds and corresponding safety measures, with the operator committing to specific practices when models reach defined capability levels. The framework has been revised through multiple versions as the methodology develops.

OpenAI's Preparedness Framework provides analogous structure with specific risk categories, capability thresholds, and corresponding safety measures. The framework addresses cybersecurity, CBRN (chemical, biological, radiological, nuclear), persuasion, and model autonomy categories with structured assessment.

Google DeepMind's Frontier Safety Framework provides similar structure with critical capability levels and corresponding mitigation requirements. The framework integrates with broader DeepMind safety practice including evaluations and ongoing research.

Microsoft's Responsible AI Standard and frontier-specific safety practice address the safety considerations of models Microsoft develops and deploys. The framework includes specific evaluation and deployment criteria.

Meta's frontier AI safety practice operates alongside Meta's open weights model releases, addressing the specific safety considerations of releasing model weights rather than only API access.

The Frontier Model Forum coordinates safety practice among major frontier labs. The Forum produces shared work on evaluation, safety research, and best practice with varying transparency to outside observers.

The AI Safety Institute network including the UK AISI, US AISI, Japan AISI, and equivalent institutes in additional countries provides external evaluation and coordination. The institutes operate evaluations including pre-deployment access to frontier models in some cases.

The aggregate framework is substantial but operates as voluntary commitment rather than binding regulation. The EU AI Act introduces binding obligations for general-purpose AI models with systemic risk, but the implementation continues to develop. The voluntary framework provides current operational substance while the regulatory framework matures.


Refusal and Content Policy

Refusal behavior addresses what models should and should not produce as output. The discipline addresses the operational implementation of safety considerations at the model output layer.

Content policy definition determines what categories of output the model should refuse, what it should produce with caveats, and what it should produce freely. The categorization is operator-specific and shaped by legal requirements, business considerations, and broader safety analysis.

Refusal training implements content policy at the model level. RLHF, constitutional AI, and equivalent methodologies train models to refuse prohibited categories. The work has substantive impact on model behavior but does not produce perfect refusal.

Refusal evaluation tests whether the trained refusal behavior actually holds across the range of inputs the model encounters. The evaluation includes both standard refusal benchmarks and adversarial evaluation that probes for refusal failures.

Overrefusal is the failure mode where models refuse legitimate requests because they superficially resemble prohibited categories. The pattern produces operational frustration and may push users to less safe alternatives. Mature operators balance refusal completeness against overrefusal.

Jailbreak resistance addresses the adversarial dimension of refusal. The detailed treatment of jailbreak attacks appears in Cybersecurity.

Content classification at the output layer provides additional defense beyond model-level refusal. Output classifiers can catch content that the model produced despite its refusal training. The infrastructure operates as additional control beyond the model itself.

The refusal policy varies across operators and across models within operators. Different deployment contexts may warrant different refusal policies. Mature operators design refusal policies deliberately rather than applying uniform policy across all contexts.


Deployment Context Appropriateness

A model that is safe to deploy in one context may not be safe in another. The discipline addresses how operators match models to contexts.

Use case appropriateness considers whether the model's capabilities match the intended use. A model trained for general assistance may not be appropriate for medical decision support; a model evaluated for limited domains may not be appropriate for broader deployment. The match between capability and use case shapes whether deployment is appropriate.

User population appropriateness considers who will interact with the model. A model appropriate for trained professionals may not be appropriate for general consumers; a model appropriate for adults may not be appropriate for children. The user population shapes what considerations apply.

Stakes appropriateness considers what consequences attach to model outputs. A model whose outputs inform low-stakes decisions has different appropriateness than one whose outputs drive high-stakes consequences. The stakes affect what safety bar applies.

Regulatory context appropriateness considers what regulatory framework applies. Deployment in regulated sectors faces additional requirements beyond what general deployment faces. The detailed regulatory framework appears throughout the Governance pillar.

Geographic context appropriateness considers what jurisdictional considerations apply. Different jurisdictions have different legal requirements, different cultural contexts, and different operational considerations.

The matching of model to context is part of the deployment decision. Some models may not be appropriate for any context; others may be appropriate for many contexts; many models are appropriate for some contexts and not others. The discipline supports informed deployment rather than maximum deployment.


Model Release Decisions

Model release strategy shapes what safety implications a model produces. The decisions involve substantial trade-offs that operators address deliberately.

Closed weights with API access maintains operator control over how the model is used. The operator can implement safety measures at the deployment layer, monitor usage patterns, and intervene when concerns emerge. The trade-off is reduced research access and reduced operator independence for users.

Open weights releases the model itself for download and use. The release supports research, deployment diversity, and reduced operator dependence. The trade-off is loss of operator control over how the model is used; safety measures at the operator layer no longer constrain downstream use.

Staged release combines elements through phased availability. Initial release may be limited to research access; subsequent stages may broaden access as confidence develops. The pattern allows operator to monitor early use before broader release.

Capability-tiered release matches access to capability. Less capable models may be released more openly; more capable models may face more restrictive access. The pattern aligns release strategy with safety considerations.

The choice between closed and open release for frontier models has been substantively contested. Arguments for open release emphasize research benefit, deployment diversity, and the legitimacy of broader access. Arguments for closed release emphasize safety considerations, the irreversibility of release, and the limits of downstream safety practice.

The operational practice across major labs varies. Anthropic's Claude models are closed weights with API access. OpenAI's GPT models have been closed weights with API access. Meta's Llama models have been released with open weights under various licenses. Google's models have a mix of release strategies. The variance reflects different operator judgments about the trade-offs.


Continuous Safety Through Model Updates

Model safety is not a one-time deployment decision; it is ongoing work as models update, contexts change, and deployment experience accumulates.

Model versioning supports tracking what specific model version is deployed where. The infrastructure becomes operationally important when safety concerns emerge that affect specific versions but not others.

Pre-update evaluation tests new model versions before deployment. The evaluation addresses both whether the new version maintains the safety properties of the prior version and whether new capabilities introduce new safety considerations.

Fine-tuning safety addresses how operator fine-tuning may affect safety properties. Fine-tuning can degrade safety behavior trained into the base model; mature operators evaluate fine-tuned models specifically for safety properties.

Model deprecation addresses the end-of-life dimension. Models that should no longer be deployed need to be retired with appropriate transition for downstream users. The infrastructure for graceful deprecation is operationally important.

Post-deployment monitoring extends to safety-relevant patterns. The infrastructure overlaps with Monitoring & Anomaly Detection but with specific attention to safety-relevant signals.

Incident response specifically for safety concerns addresses when model behavior produces concerning patterns in production. The response may include rollback to prior versions, deployment of mitigations, or other operational measures.

Ongoing red teaming and evaluation extends pre-deployment work into production. The work catches patterns that pre-deployment evaluation may have missed and addresses the reality that adversaries continue to probe deployed models.


Operational Considerations

Operators implementing model safety practice face several recurring considerations.

Investment in safety capacity has substantial cost. The evaluation infrastructure, red teaming work, alignment research, and broader safety practice represent significant operator investment. The cost is borne by operators that take safety seriously; operators that invest less face less direct cost but accumulate exposure that may produce larger eventual cost.

External evaluation engagement supports both safety assessment and reputational positioning. AI Safety Institute access, academic researcher engagement, and contracted external evaluation all contribute. The engagement has costs and benefits that operators balance deliberately.

Transparency about safety practice supports stakeholder relationships. Model cards, system cards, safety evaluations, and broader transparency documentation contribute to operator-stakeholder relationships. The choice of what to disclose involves trade-offs operators navigate.

Coordination with industry safety practice supports both operator practice and broader ecosystem development. Frontier Model Forum participation, AI safety standards engagement, and broader industry coordination produce both substantive benefit and reputational positioning.

Regulatory engagement addresses the developing regulatory framework. Engagement with the EU AI Office, NIST, AI Safety Institutes, and equivalent bodies supports operator understanding of developing requirements and contributes to the broader framework.

Internal organizational design affects whether safety practice succeeds. Dedicated safety teams, governance structures, and accountability for safety outcomes shape what the operator can accomplish. The internal organization is part of the operational practice.


What Model Safety Does Not Solve

The discipline has real limits.

Model safety practice at one operator does not address safety at other operators. The aggregate ecosystem depends on practice across operators; the operator with mature safety practice operates within an ecosystem where other operators may have less mature practice.

Model safety does not address what happens with model weights once released open. Open weights releases produce safety dynamics that operator-level practice cannot bound after release. The dynamic is one of the substantive considerations in release strategy.

Model safety does not eliminate the structural concerns about increasingly capable models. The capability trajectory continues and the safety practices that work at current capability may not generalize to future capability. The framework continues to develop with attention to this dynamic.

Model safety operates within the broader societal context. Safety practice that addresses model-level concerns does not address how AI is used in society, what AI deployment shifts in labor and economic patterns, or what AI changes in the broader information ecosystem. These dimensions involve different work that model safety alone does not address.

Model safety has its own failure modes. The evaluations may not catch what they were designed to catch; the safety measures may be circumvented; the operational practice may fail in ways the framework did not anticipate. The discipline includes ongoing attention to its own failure modes.


The Reframe

Model safety is the operational discipline of deploying AI models safely in production. The discipline integrates training-side alignment work, pre-deployment evaluation, deployment controls, and ongoing monitoring through the capability-propensity-controls framework that organizes contemporary work. Safety case construction, pre-deployment evaluation including dangerous capability evaluations, responsible scaling and frontier model safety frameworks, refusal and content policy, deployment context appropriateness, model release decisions, and continuous safety through model updates all combine into operational practice. The frameworks developed by frontier labs including Anthropic's RSP, OpenAI's Preparedness Framework, Google DeepMind's Frontier Safety Framework, and equivalent work operate as voluntary commitments with the AI Safety Institute network providing external evaluation. The EU AI Act introduces binding obligations for general-purpose AI models with systemic risk. The discipline has real limits including the operator-level scope, the implications of open weights releases, the structural concerns about capability trajectory, and the broader societal context that model-level safety does not address. For operators, the practical work involves substantial investment in safety capacity, external evaluation engagement, transparency about safety practice, industry coordination, regulatory engagement, and internal organizational design that supports safety outcomes. The work of building adequate model safety practice across the agentic AI ecosystem is one of the substantive ongoing projects the era requires, and the integration with the other disciplines covered across the site determines whether AI deployment can operate at scale within acceptable safety bounds.


Related Coverage

Security & Trust | Alignment | Red Teaming | Hallucination & Drift