137AI > Security & Trust > Red Teaming
AI Red Teaming
Red teaming is the discipline of adversarial evaluation — testing AI systems by deliberately attempting to find failures, vulnerabilities, harmful outputs, and behavior outside intended scope. The term inherits from military and cybersecurity practice where red teams simulate adversaries to test defenses, and has developed AI-specific methodology that addresses what conventional security red teaming was not designed for.
The discipline pairs with adjacent work covered separately. Alignment is the training-side discipline that produces models red teaming evaluates. Model Safety is the operational umbrella that incorporates red teaming alongside other inputs. Cybersecurity covers AI cybersecurity broadly with red teaming for specifically cyber concerns. This page covers red teaming as a discipline including methodology, targets, institutional infrastructure, and limits.
What Distinguishes Red Teaming
Red teaming is methodologically distinct from adjacent evaluation work. The distinction matters because conflating them produces confusion about what specific work accomplishes.
Evaluation tests known properties against benchmarks. The methodology compares model performance to expected baselines or competing systems on defined tasks. Evaluation establishes how well the model does the things it is supposed to do.
Testing checks for specified failures. The methodology applies test cases designed to detect specific known failure modes. Testing establishes whether the model fails in ways the test designer anticipated.
Red teaming deliberately attempts to find unknown failures through adversarial probing. The methodology operates without prior specification of what failures exist; the work is to surface failures that evaluation and testing might not catch. Red teaming establishes what the model does that it shouldn't, including behaviors the developers did not anticipate.
The adversarial framing is structural rather than incidental. Red teamers operate from the perspective of adversaries trying to elicit harmful or unintended behavior. The framing produces probes that evaluation-as-collaboration would not generate and surfaces failures that cooperative evaluation would not find.
Red Teaming Methodology Categories
Red teaming operates through several methodological categories with different strengths and limitations. Mature operators combine methodologies rather than relying on any single approach.
| Methodology | Approach | Strengths and Limitations |
|---|---|---|
| Manual red teaming | Human red teamers explore model behavior through structured and exploratory probing | Catches creative attacks human ingenuity produces; scale-limited by red teamer capacity |
| Automated red teaming | AI systems generate adversarial probes and test target models at scale | Scales beyond manual capacity; may miss patterns automated systems don't generate |
| Specialized red teaming | Domain experts target specific capabilities with sector-specific expertise | Catches domain-specific concerns; requires expert participation that may be limited |
| Adversarial collaboration | External researchers work with vendors under structured arrangements | Combines external perspective with vendor access; requires institutional infrastructure |
| Bug bounty programs | External researchers report findings under structured programs with compensation | Scales external participation through incentive; depends on program design and execution |
| Public competition | Open events where many participants probe systems under defined conditions | Surfaces diverse approaches; bounded by event scope and participant population |
| Continuous red teaming | Ongoing red team operation alongside production deployment | Catches issues emerging over time; requires sustained operator investment |
| Adversarial machine learning research | Academic and industry research developing new attack methodologies | Produces foundational techniques; may take time to translate to operational practice |
Specific Red Teaming Targets
Red teaming addresses several specific categories of concern with methodology adapted to each. The categories overlap but warrant distinct treatment because the failure modes and effective probes differ.
Jailbreaks and refusal failures target the safety guardrails operators implement. Red teaming attempts to elicit content or behavior the model has been trained to refuse. The category has substantive ongoing research with documented techniques including specific patterns, multi-turn approaches, translation and obfuscation, and prompt injection-mediated jailbreaks. The broader treatment appears in Cybersecurity.
Harmful content generation addresses what models produce that violates policy regardless of whether refusal was attempted. The category includes content that produces real-world harm including illegal content, content facilitating violence, content designed to manipulate, and broader categories operator policy addresses.
Bias and fairness failures address differential behavior across populations. The detailed treatment appears in Bias & Fairness; red teaming for bias involves systematic probing for differential behavior that operator evaluation might not catch.
Hallucination and confabulation address the production of plausible-sounding but incorrect content. The category includes fabricated citations, invented facts, and content with subtle but significant errors. The detailed treatment of these failure modes appears in Hallucination & Drift.
Capability discovery surfaces what models can do that the developers did not specifically anticipate. The category includes both intended capabilities operating in unexpected ways and unintended capabilities the model exhibits.
Dangerous capability evaluation specifically targets capabilities that warrant particular attention because of potential for severe harm. The detailed treatment appears in Model Safety. Red teaming for these capabilities requires specialized methodology and substantive expertise to elicit capabilities that may be present without producing dual-use information.
Agentic behavior failures address what AI agents do when given the authority to take actions. The category includes tool-use abuse, multi-step attack patterns, instruction override through manipulated content, and the broader category of agentic AI behavior outside intended scope.
Deceptive alignment indicators address whether models behave differently under evaluation than in deployment. The category includes situational awareness, sandbagging, alignment faking, and broader behavioral inconsistency. Recent research at Anthropic and elsewhere has provided initial empirical infrastructure for this category.
Privacy and data extraction targets information that should not be derivable from the model. The category includes training data extraction, membership inference, model inversion, and broader privacy-affecting model behavior.
Internal Versus External Red Teaming
Red teaming can operate internal to the vendor or external through various arrangements. The two have different incentive structures, different access patterns, and different output.
Internal red teams operate within AI vendor organizations with substantial access to model internals, training infrastructure, and operator context. The internal access supports detailed evaluation that external red teaming would have difficulty matching. The internal position also produces specific limitations including potential conflict between thorough criticism and organizational dynamics, limited diversity of perspective, and the risk that internal team's familiarity with the system shapes what they probe for.
External red teaming addresses the limitations of internal work. External red teamers bring different perspectives, different domain expertise, and adversarial framing that internal collaboration may dilute. External red teaming requires institutional infrastructure including access arrangements, evaluation methodology, and disclosure protocols.
Third-party commercial red teaming provides external evaluation through contracted services. The pattern operates analogously to conventional cybersecurity penetration testing with adaptations for AI. The market for AI-specific red teaming services has been developing rapidly.
Government red teaming through AI Safety Institutes provides national-level evaluation. The UK AISI, US AISI, and equivalent institutes have been conducting pre-deployment evaluations of frontier models under specific arrangements with vendors.
Academic red teaming provides research-oriented evaluation. Academic researchers produce both methodology development and specific findings that contribute to the broader red teaming infrastructure.
The combination of internal and external red teaming produces stronger evaluation than either alone. Mature operators invest in both internal capability and external engagement, with the specific balance depending on operator scale, deployment context, and regulatory environment.
The AI Safety Institute Network
The AI Safety Institute network has emerged as substantive institutional infrastructure for AI red teaming, conducting evaluations that no individual operator could perform on its own.
The UK AI Safety Institute, established in 2023, conducts pre-deployment evaluation of frontier models under arrangements with major AI vendors. The institute has performed evaluations including capability assessment and safety evaluation across multiple models with the work informing both UK policy and broader international understanding.
The US AI Safety Institute, established within NIST in 2024, performs analogous evaluation work for US national interests. The institute operates with substantial NIST infrastructure and integrates with broader US AI policy work.
The Japan AI Safety Institute and equivalent institutes in additional countries are developing through 2024-2026. The institutes coordinate through specific bilateral arrangements and through the broader AI Safety Institute network.
The MoU between the US and UK AI Safety Institutes enables joint evaluation work and coordinated approach to frontier model safety. The arrangement provides a model for bilateral AI safety coordination that other arrangements may follow.
The institute access to pre-deployment models has been substantive. Multiple frontier model releases have included AI Safety Institute evaluation as part of pre-deployment work, with the institutes providing findings that informed deployment decisions.
The institutional model is novel and continues to develop. The relationships between institutes and vendors, the methodologies institutes use, the disclosure of findings, and the integration with broader policy work all continue to mature. The work represents one of the substantive institutional developments in AI governance over recent years.
Bug Bounty Programs for AI
Bug bounty programs extend the cybersecurity bug bounty model to AI-specific concerns. Several major AI vendors operate AI-specific bug bounty programs.
Anthropic's bug bounty program addresses both conventional cybersecurity vulnerabilities and AI-specific issues including model behavior concerns. The program operates through HackerOne and similar infrastructure with specific scope for AI safety concerns.
OpenAI's bug bounty program addresses similar scope with specific AI-relevant categories. The program has produced substantial reported findings across both conventional security and AI-specific dimensions.
Microsoft's AI bug bounty programs address Microsoft AI products including Azure OpenAI Service and similar offerings. The programs integrate with broader Microsoft security infrastructure.
Google's bug bounty extensions to AI address Gemini and adjacent AI products with specific scope for AI concerns.
Meta's bug bounty extensions address Meta AI products. The programs operate alongside Meta's broader security bug bounty infrastructure.
The AI-specific scope addresses concerns including model behavior, jailbreaks, content policy violations, bias issues, and broader AI-relevant categories that conventional bug bounty did not specifically address. The compensation structures vary across programs with substantive payouts for high-impact findings.
The bug bounty model has limits for AI-specific concerns. The methodology works well for findings that are reproducible, verifiable, and patchable in conventional senses. AI behavior issues may not have these properties cleanly, producing operational complexity in bug bounty processing. Mature programs have been adapting to the AI-specific dimensions.
DEFCON AI Village and Public Competition
DEFCON AI Village has emerged as the major public AI red teaming event. The annual event hosts substantial AI red teaming work including specific competitions and broader red teaming activity.
The 2023 DEFCON AI Village included the largest public red teaming event of generative AI to that point, with substantial participation across multiple major AI vendors and substantial findings across categories. The event was supported by the White House and brought together substantial industry, government, and research participation.
The 2024 and 2025 DEFCON AI Village events continued the model with expanded scope and continued institutional engagement. The events have produced both specific findings and broader development of red teaming methodology through public engagement.
Other public competitions including HackAPrompt, the Trojan Detection Challenge, and various academic competitions address specific aspects of AI red teaming. The cumulative public competition infrastructure provides substantial volume of red teaming activity that no individual operator could match alone.
The public competition model has substantive advantages including diverse participation, public engagement, and the development of broader practitioner community. The model has specific limitations including event-bounded scope, participant population that may not represent all relevant adversaries, and disclosure considerations that may limit what specific findings become public.
Disclosure Considerations
Red teaming produces findings that raise disclosure questions analogous to but distinct from cybersecurity vulnerability disclosure.
Coordinated disclosure for AI findings extends the cybersecurity model to AI-specific concerns. The pattern involves reporting findings to the vendor, allowing time for remediation, and then publishing findings according to agreed timeline. The model has been adopted by many AI vendors and researchers though specific arrangements vary.
Embargo arrangements protect specific findings from premature disclosure. Findings affecting deployed models, findings related to dangerous capabilities, and findings whose public disclosure could produce specific harm may face embargo for some period. The duration of embargo and the conditions for eventual disclosure are subject to negotiation.
Public disclosure of red teaming findings supports broader field development. Academic publication, conference presentation, and direct publication all contribute to public knowledge of AI failure modes and effective mitigations.
Permanent restriction of specific findings addresses categories where any disclosure would produce harm. Specific operational details of attacks, evasion techniques that work against multiple systems, and information that would enable harmful capability are typically not publicly disclosed regardless of timing considerations.
Vendor-published red teaming results including system cards, model cards, and safety evaluation publications provide structured disclosure of vendor red teaming work. The publications support both transparency and broader field understanding.
Government and AI Safety Institute findings face their own disclosure considerations including national security considerations, vendor relationship considerations, and broader policy considerations. The institutes have been developing disclosure practices that balance these considerations.
The disclosure landscape continues to develop. Major incidents involving disclosure questions including specific jailbreak publications, training data extraction findings, and broader red teaming disclosures shape the evolving practice.
Red Teaming for Frontier Models
Frontier model red teaming has specific characteristics that distinguish it from red teaming for non-frontier systems.
Capability uncertainty is substantive. The most capable models can do things their developers did not specifically train them for and may not fully understand. Red teaming has to discover capabilities rather than only test for them.
Dangerous capability evaluation requires specialized methodology. Bio, chem, cyber, and broader dangerous capability evaluations require expertise that general red teaming does not include. The work involves substantial investment in specialized capacity.
Pre-deployment access to models is limited by definition. Frontier models in development are not yet public; red teaming requires arrangements with vendors that provide access under specific conditions.
The stakes are higher. Frontier model deployment affects more users, larger contexts, and more consequential decisions than non-frontier deployment. The red teaming work bears more weight in deployment decisions.
The frameworks discussed in Model Safety including Anthropic's RSP, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework all include red teaming as foundational evaluation infrastructure. The frameworks specify red teaming requirements at capability thresholds and shape what red teaming work occurs.
The EU AI Act provisions for general-purpose AI models with systemic risk include red teaming-relevant requirements. The implementation continues to develop and will shape mandatory red teaming practice in the EU market.
The Limits of Red Teaming
Red teaming has substantive limits that the discipline acknowledges directly.
Red teaming finds what it finds; it does not establish absence of problems. Successful red teaming surfaces specific failures; unsuccessful red teaming does not establish the system is failure-free. The asymmetry is fundamental to the methodology and affects how findings should be interpreted.
Statistical confidence is limited. Red teaming is not statistical sampling and does not produce statistical estimates of failure rates. The findings are existence proofs rather than population estimates.
Coverage depends on what red teamers think to probe. Failures the red teamers did not consider remain undetected. The methodology benefits from diversity of red teamers and approaches but cannot eliminate the coverage gap.
Adversaries continue to develop. The red teaming findings reflect what red teamers knew to look for at the time of evaluation. Subsequent adversary development may produce attacks that the original red teaming would not have caught.
Resource constraints bound what red teaming can accomplish. Comprehensive red teaming is expensive in time, expertise, and computational resources. The bounded resources mean that some categories receive less attention than others.
Red teaming reflects the institutional context in which it occurs. Internal red teaming may be shaped by organizational dynamics; external red teaming may be shaped by vendor relationship; bug bounty programs may be shaped by program structure. The institutional factors affect what red teaming surfaces and what it misses.
The methodology continues to develop. Current red teaming reflects current understanding of what failures matter and how to find them. As understanding develops, methodology improves. The improvement is ongoing rather than complete.
Practical Implications for Operators
For operators deploying AI systems, the red teaming landscape produces several practical implications.
Vendor red teaming practice is part of vendor evaluation. Operators relying on AI vendors face the red teaming work the vendor has done; the work varies substantially across vendors. Vendor system cards, model cards, and red teaming publications support understanding what the vendor has done.
Internal red teaming for operator-specific deployment context addresses what vendor red teaming may not cover. The operator's specific deployment context may produce failure modes that general vendor red teaming did not specifically address.
External red teaming engagement supports stronger evaluation. Third-party red teaming, academic engagement, and AI Safety Institute participation where available all contribute to evaluation depth.
Bug bounty participation extends red teaming through external researcher engagement. Operators may run their own bug bounty programs for their AI deployments or participate in vendor programs.
Disclosure practice for findings supports both improvement and reputational positioning. The choice of what to disclose, how, and when affects both operator relationships and broader field development.
Documentation of red teaming work supports compliance and accountability. Records of what red teaming was performed, what findings emerged, and what remediation followed support regulatory examination and broader stakeholder engagement.
Ongoing red teaming addresses the reality that adversaries continue to develop. Static red teaming becomes outdated; mature operators implement ongoing red teaming alongside production deployment.
The Reframe
Red teaming is the adversarial evaluation discipline for AI systems, methodologically distinct from evaluation and testing because it deliberately attempts to find unknown failures rather than testing for known properties or specified failures. The methodology categories including manual, automated, specialized, adversarial collaboration, bug bounty, public competition, continuous, and adversarial ML research provide the infrastructure. Specific targets including jailbreaks, harmful content, bias, hallucination, capability discovery, dangerous capabilities, agentic behavior failures, deceptive alignment indicators, and privacy concerns organize what specific work addresses. The internal versus external dimension shapes the institutional infrastructure including in-house red teams, third-party services, AI Safety Institute network, and academic engagement. Bug bounty programs extend the cybersecurity model to AI-specific concerns. DEFCON AI Village and public competition provide substantial public red teaming infrastructure. Disclosure considerations including coordinated disclosure, embargo, public disclosure, and permanent restriction shape what red teaming findings become public. Frontier model red teaming has specific characteristics including capability uncertainty, dangerous capability evaluation requirements, limited pre-deployment access, and higher stakes. The limits of red teaming including the asymmetry between finding and not finding, limited statistical confidence, coverage dependent on red teamer imagination, adversary development, resource constraints, and institutional context all warrant acknowledgment. For operators, the practical work involves vendor evaluation including red teaming practice, internal red teaming for deployment context, external red teaming engagement, bug bounty participation, disclosure practice, documentation, and ongoing red teaming alongside production. The work of building adequate red teaming infrastructure for AI is one of the substantive evaluation projects the agentic AI era requires.
Related Coverage
Security & Trust | Model Safety | Alignment | Cybersecurity