137AI > Security & Trust > Cybersecurity


AI Cybersecurity


Cybersecurity for AI systems extends conventional information security practice into the AI-specific layers that conventional practice was not designed to address. The page focuses on what is distinctive about AI cybersecurity rather than re-explaining established cybersecurity practice. Conventional cybersecurity remains foundational for AI deployment; the AI-specific work covered here adds to that foundation rather than replacing it.

The discipline addresses model file security, the prompt injection attack class, AI supply chain security, adversarial robustness, jailbreak and bypass attacks, model extraction and IP theft, training pipeline security, inference infrastructure security, AI agent-specific attack patterns, and the integration of AI cybersecurity with broader security architecture. The engineering controls that operationalize cybersecurity practice are covered in the Controls pillar; this page is the trust-posture treatment of cybersecurity as a discipline.


What Makes AI Cybersecurity Distinctive

AI systems present attack surfaces that conventional cybersecurity practice did not specifically anticipate. The distinctive surfaces emerge from properties of AI components that differ from conventional software.

AI behavior is shaped by training data the operator did not author. Conventional software behavior is determined by code; AI behavior is determined by code plus training. The training dimension introduces attack vectors that conventional security analysis does not capture.

AI processes natural language inputs that may contain instructions. The boundary between data and instruction in natural language is not crisp; content the agent processes can affect the agent's behavior in ways that conventional input validation does not address.

AI models are valuable assets in themselves. Training a frontier model requires substantial compute investment. The trained model represents intellectual property that adversaries may target through extraction, distillation, or theft.

AI systems operate in adversarial conditions. Adversaries can interact with AI systems through their normal interfaces, probe for failure modes, and construct inputs specifically designed to elicit unintended behavior. The adversarial conditions are continuous rather than punctuated.

AI supply chain extends through foundation models, training data, fine-tuning data, and AI vendor libraries. The supply chain compromise dimension covers paths that conventional software supply chain attention did not address.

The distinctive surfaces require AI-specific cybersecurity work that complements conventional security practice. The discipline has been developing rapidly through research, operator practice, and emerging frameworks including OWASP Top 10 for LLM Applications, MITRE ATLAS, and equivalent work.


Model File Security

Model files are increasingly valuable assets that require specific security practice. Foundation models represent substantial investment; fine-tuned models represent operator-specific intellectual property; deployment-ready model artifacts may contain proprietary capabilities.

Model storage security applies conventional storage security to model artifacts with attention to model-specific concerns. Encryption at rest, access control on storage, audit logging of access, and integrity verification on read all apply. The cost of compromise is higher for model artifacts than for many other assets because models cannot be revoked once exfiltrated.

Model transit security applies cryptographic protection to models in transit. Signed model artifacts support verification of authenticity downstream. Cryptographic integrity over the transit path prevents tampering. The protection is particularly important for models distributed across organizational boundaries.

Model deployment security addresses the inference infrastructure that serves models to applications. Access control on inference endpoints, rate limiting, authentication of callers, and integrity verification of the served model all apply. The deployment infrastructure is itself an attack surface with specific concerns.

Model provenance and attestation supports verification of where models came from. Signed model artifacts, attestation of training pipelines, and AI bill of materials all contribute to provenance discipline. The detailed treatment of attestation appears in Identity & Cryptographic Attestation.

Model leak and exposure concerns include both direct exfiltration and indirect exposure through inference outputs. Membership inference attacks, training data extraction attacks, and model inversion attacks all involve adversaries extracting information about the model or its training data through legitimate inference interfaces.


Prompt Injection as an Attack Class

Prompt injection is the canonical AI-specific attack class. The category exploits the property that natural language inputs to AI agents may contain instructions the agent will follow.

Direct prompt injection occurs when an adversarial user provides input designed to override the agent's intended behavior. The user input contains instructions that conflict with operator-defined policy; the agent may follow the injected instructions rather than maintaining policy. The category has been substantively studied and produces both research demonstrations and production-impacting events.

Indirect prompt injection is the more consequential variant. Content the agent processes from external sources can contain instructions that affect the agent's behavior. An email the agent processes, a web page the agent fetches, a document the agent reads, or a tool output the agent consumes may all carry injected instructions. The defense surface is broader than direct prompt injection because the adversary need not interact with the agent directly.

The vulnerability is structural to how current AI agents handle natural language. The boundary between data the agent should process and instructions the agent should follow is not crisply enforceable when both arrive through the same natural language interface. Mitigations bound the risk without eliminating it.

Mitigation Category Approach Effectiveness
Input sanitization Filtering content for known prompt injection patterns before passing to agents Catches known patterns; less effective against novel injection patterns
Instruction-data separation discipline Structural separation of instructions and data in agent prompts; clear delimiters and roles Improves baseline; does not fully prevent injection through sophisticated patterns
Privilege scoping Limiting agent authority so that injected instructions cannot accomplish high-stakes actions Bounds consequence rather than preventing injection; foundational for high-stakes deployments
Output validation Checking agent outputs against expected patterns and policies before downstream use Catches some injection effects; cannot catch effects that produce policy-compliant outputs
Approval gates Human approval for consequential actions regardless of agent decision Strong for high-stakes contexts; not scalable to all agent operations
Content provenance discipline Tracking source of content the agent processes and applying differential trust Supports differentiated handling; depends on provenance infrastructure that is uneven

AI Supply Chain Security

The AI supply chain reaches from training data through foundation models through fine-tuning and integration to deployment. Compromise at any point in the chain affects downstream operators.

Foundation model supply chain security addresses the risk that the foundation models operators build on may have been compromised at training time. The compromise could affect specific behaviors or could introduce backdoors that activate on specific triggers. The defense involves both upstream practices (foundation model providers implementing security discipline) and downstream practices (operators verifying provenance and applying their own evaluation).

Training data supply chain addresses the risk of poisoning attacks on training data. The broader treatment appears in Training Data Poisoning. The cybersecurity practice that bounds the risk includes data provenance, anomaly detection on training data, robust training methods that limit the effect of contaminated examples, and the broader infrastructure for verifying data integrity.

Model artifact supply chain addresses how models move from training to deployment. Signed model artifacts, integrity verification on receipt, and audit logging of model movements all support supply chain security.

AI vendor library supply chain addresses the libraries and packages that AI development depends on. Conventional software supply chain attacks (typosquatting, dependency confusion, malicious commits) reach AI infrastructure and may produce AI-specific consequences. The PyPI typosquatting attacks against ML packages and equivalent incidents illustrate the surface.

Model registry security addresses the platforms where models are stored and distributed. Compromise of model registries could distribute compromised models to many downstream operators simultaneously. The infrastructure for model registry security continues to develop.

AI bill of materials practices extend software bill of materials to AI components. The infrastructure supports downstream consumers in evaluating what they are receiving and continues to mature through industry work.


Adversarial Robustness

Adversarial robustness addresses the property that AI models can be made to produce incorrect outputs through carefully crafted inputs that humans would not recognize as adversarial.

Adversarial examples are inputs designed to elicit specific incorrect outputs from a target model. The classic demonstrations in computer vision show how small perturbations to images can cause classifiers to produce wildly incorrect labels. The category extends to natural language, audio, and other modalities.

The threat model varies by application. Adversarial robustness matters substantially for security applications, autonomous systems where adversarial physical-world manipulation could affect operation, and other contexts where adversaries have incentive and access to construct adversarial inputs.

Adversarial training improves model robustness by including adversarial examples in training. The technique has known limits and produces robustness for specific attack patterns without generalizing to all attacks.

Certified robustness approaches provide mathematical guarantees about model behavior within specified input regions. The approaches are limited in scale and applicability but produce stronger guarantees than empirical robustness.

Defensive distillation, randomization, and ensemble methods address adversarial robustness through architectural choices. The methods have been studied extensively with varying results.

The discipline continues to develop alongside the AI capability frontier. The pattern is that defenses are developed for known attacks, new attacks emerge that defeat the defenses, and the cycle continues. Operators implement current best practice while recognizing the ongoing nature of the work.


Jailbreak and Bypass Attacks

Jailbreak attacks specifically target the safety guardrails and behavioral constraints that AI services implement. The category emerged with the deployment of large language models and continues to evolve.

Direct jailbreaks use specific input patterns designed to elicit responses that policy would otherwise refuse. The DAN ("Do Anything Now") pattern and many subsequent variants demonstrate the category. Research has documented substantial collections of effective jailbreak patterns across multiple production AI services.

Multi-turn jailbreaks build up to prohibited behavior over conversation turns rather than in single inputs. The pattern exploits limitations in single-turn safety evaluation by establishing context that eventually licenses behavior the initial input would not.

Translation and obfuscation attacks use language manipulation to evade content classifiers. Translation to less-resourced languages, encoding schemes, and creative framing all produce attack patterns that defeat specific classifiers.

Indirect jailbreaks combine prompt injection with jailbreak patterns. Content the agent processes from external sources may contain jailbreak instructions that affect the agent's subsequent behavior.

The defensive landscape includes multiple layers. Pre-training safety work, RLHF and constitutional AI methods, content classifiers on inputs and outputs, monitoring for jailbreak patterns, and broader safety practice all contribute to bounding jailbreak success. The cumulative effect bounds the risk without eliminating it.

The category is the subject of substantial research and operator practice. The OWASP Top 10 for LLM Applications, MITRE ATLAS jailbreak coverage, and academic work including the Carlini group's analysis at Berkeley provide systematic treatment.


Model Extraction and IP Theft

Model extraction attacks attempt to recover proprietary model capabilities through legitimate inference queries. The category addresses theft of the intellectual property that model training represents.

Distillation attacks use the target model's outputs as training data for a smaller substitute model. The substitute can capture substantial capability of the target without the target's training cost. The attack has been demonstrated against multiple production AI services with varying success.

Query-based reconstruction attempts to recover model parameters or training data through carefully chosen queries. The category includes membership inference (whether specific data was in training set), training data extraction (recovering training examples), and model inversion (reconstructing inputs from outputs).

Side-channel attacks on inference infrastructure address compromise through timing, power, or other side channels rather than through the inference interface itself. The category is less developed for AI but follows established patterns from conventional cryptographic side-channel attacks.

The defenses include rate limiting on inference (bounding the queries an adversary can submit), differential privacy and output perturbation (reducing what can be learned from any specific query), monitoring for extraction patterns, and the broader infrastructure for protecting model assets.

The legal framework for AI model extraction is developing. Computer Fraud and Abuse Act enforcement, breach of terms of service, and trade secret claims all reach extraction in specific contexts. The case law continues to develop.


Training Pipeline Security

Training pipeline security addresses the security of the infrastructure and data flows that produce trained models. Compromise during training affects the resulting model in ways that downstream defenses may not detect.

Training infrastructure security extends conventional security to the compute environments where training occurs. The substantial compute resources required for frontier training represent attractive targets; the infrastructure includes substantial sensitive data and credentials.

Training data pipeline security addresses the data flows from data sources through preprocessing to training. Compromise at any stage may produce model effects that survive into deployment. The detailed treatment of data poisoning specifically appears in Training Data Poisoning.

Training process monitoring detects anomalies in training that may indicate compromise. Loss anomalies, gradient anomalies, and unexpected training behavior can all surface issues. The infrastructure for training monitoring continues to mature.

Reproducibility and provenance support verification that the trained model matches the training that produced it. Recorded training configurations, signed training artifacts, and the broader infrastructure for training provenance contribute to post-hoc verification.

Fine-tuning security addresses the additional training that operators apply to foundation models. The fine-tuning may use sensitive data that requires specific protection; the resulting fine-tuned model may itself contain information that warrants protection.


Inference Infrastructure Security

Inference infrastructure security addresses the deployment-time security of AI systems. The category includes both AI-specific concerns and conventional cybersecurity applied to AI infrastructure.

Vector database security addresses the embedding stores used in retrieval-augmented generation and similar architectures. Vector databases may contain sensitive content as embeddings, support similarity queries that could be exploited for extraction, and operate as critical infrastructure for AI applications. The security practice extends conventional database security with vector-specific considerations.

RAG-specific security addresses retrieval-augmented generation systems. Document store security, retrieval scoping, and the prompt injection risks discussed earlier all apply with RAG-specific dimensions.

Embedding security addresses the embeddings themselves as potential security concerns. Embeddings can leak information about the content they represent; embedding manipulation can affect downstream inference; embedding stores require their own access control discipline.

Prompt and response logging concerns address the cybersecurity dimension of the logging that AI services typically implement. The logs may contain sensitive content, may be subject to subpoena and litigation hold, and may require their own security and retention discipline.

API security for AI services extends conventional API security with AI-specific concerns including rate limiting that addresses extraction risk, authentication that supports usage attribution, and audit logging that captures AI-relevant operations.


AI Agent-Specific Attack Patterns

AI agents that take actions on external systems face attack patterns specific to their agency.

Agent hijacking attempts to redirect the agent's actions toward adversary purposes. The pattern combines prompt injection with the agent's authority to act, producing effects that exceed what either alone could accomplish.

Instruction override targets the agent's instruction-handling specifically. The pattern attempts to replace the operator's intended instructions with adversary instructions.

Tool-use abuse exploits the tool authority the agent has been granted. An agent with email-sending authority can be made to send adversary content; an agent with transaction authority can be made to execute adversary transactions; an agent with code-modification authority can be made to introduce adversary code.

Multi-step attack patterns target workflow agents specifically by manipulating sequences of operations rather than individual actions. The patterns can construct combinations that no individual operation would have produced.

Cross-agent attacks target multi-agent systems by compromising one agent to affect others. The orchestrator layer is a particularly consequential target because compromise reaches the agents the orchestrator controls.

The defensive practice combines the controls covered in the Controls pillar including identity attestation, behavioral envelopes, access control, monitoring, and human oversight. The integration with conventional cybersecurity practice including least privilege, defense in depth, and incident response provides the foundation.


Frameworks and References

Several specific frameworks address AI cybersecurity systematically.

OWASP Top 10 for LLM Applications provides the OWASP framework's treatment of LLM-specific risks. The current version covers prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. The framework is widely referenced and continues to develop.

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) provides a framework analogous to MITRE ATT&CK for adversarial AI threats. The framework catalogs adversarial tactics and techniques with substantial documentation supporting threat modeling and defense.

NIST AI Risk Management Framework addresses AI risk including substantial cybersecurity dimensions. The framework integrates AI-specific concerns with broader risk management practice.

NIST SP 800-218A (Secure Software Development Practices for Generative AI) extends the NIST Secure Software Development Framework to generative AI specifically. The publication provides operational guidance.

CISA work on AI security including the CISA AI Roadmap and ongoing publications addresses AI cybersecurity from the critical infrastructure protection perspective.

UK National Cyber Security Centre AI security guidance provides UK national perspective on AI cybersecurity including specific guidance documents and ongoing work.

ENISA work on AI cybersecurity provides EU-level guidance including threat landscapes for AI and specific publications on AI security.

Industry-led work through the Frontier Model Forum, AI Safety Institute network, and equivalent bodies addresses cybersecurity dimensions of frontier AI development. The work operates partly through industry coordination and partly through public publication.


Integration with Conventional Cybersecurity

AI cybersecurity does not replace conventional cybersecurity; it adds to it. Operators integrate AI-specific work with broader security architecture rather than treating it as a separate discipline.

The conventional cybersecurity foundation including identity management, access control, network segmentation, encryption, monitoring, incident response, and the broader security stack remains essential for AI deployment. AI components run on conventional infrastructure with conventional security requirements.

AI-specific work extends this foundation rather than replacing it. The AI cybersecurity practice operates alongside conventional security practice with integration points at multiple layers.

Security operations centers extend to AI-specific monitoring including the patterns covered in Monitoring & Anomaly Detection. SOC analysts trained on conventional threats need AI-specific training to handle AI-specific patterns.

Incident response procedures extend to AI-specific incidents. The detection, classification, response, and resolution patterns for conventional incidents apply with AI-specific extensions for AI failure modes.

Vulnerability disclosure programs extend to AI-specific vulnerabilities. Coordinated disclosure for AI vulnerabilities continues to develop as a discipline, with major AI vendors implementing AI-specific disclosure programs alongside conventional vulnerability disclosure.

Threat intelligence integration brings AI-specific threat information into broader threat intelligence operations. The information sharing infrastructure discussed elsewhere on the site supports this integration.


Practical Implications for Operators

For operators deploying AI agents, the cybersecurity landscape produces several practical implications.

Comprehensive cybersecurity discipline including both conventional and AI-specific dimensions is operational baseline. Operators that implement conventional security but not AI-specific work face the AI-specific surfaces unprotected; operators that implement AI-specific work but not conventional security have a fragile foundation.

Continuous evolution addresses the rapidly developing threat landscape. Static security programs become outdated quickly in AI cybersecurity. Mature operators implement ongoing capability development, threat intelligence engagement, and adaptive practice.

Vendor security assessment extends to AI vendor practice. Operators relying on AI vendors must understand and assess the vendor's security practice, with attention to the specific AI cybersecurity dimensions.

Red teaming and adversarial evaluation specific to AI complements conventional security testing. The detailed treatment appears in Red Teaming as the dedicated discipline.

Documentation and audit support compliance and incident response. AI cybersecurity documentation including threat models, security control descriptions, and incident response procedures supports both compliance and operational response.

Industry engagement through information sharing, vulnerability disclosure, and broader cybersecurity community participation provides both information access and operational relationships.


The Reframe

AI cybersecurity addresses the security practices specific to AI systems and the AI extensions of conventional cybersecurity. The distinctive surfaces include model file security, the prompt injection attack class, AI supply chain security, adversarial robustness, jailbreak and bypass attacks, model extraction and IP theft, training pipeline security, inference infrastructure security, and AI agent-specific attack patterns. The discipline integrates with conventional cybersecurity rather than replacing it, with the AI-specific work adding to the established foundation. Frameworks including OWASP Top 10 for LLM Applications, MITRE ATLAS, NIST AI RMF, and equivalent bodies provide systematic treatment of the discipline. The threat landscape continues to evolve rapidly with research demonstrations, production incidents, and adversary capability all developing alongside defensive practice. For operators, the practical work involves implementing comprehensive cybersecurity discipline that addresses both conventional and AI-specific dimensions, with continuous evolution to match the developing threat landscape. The work is one of the substantive engineering and operational projects the agentic AI era requires, and the integration with broader security architecture determines whether AI deployment can operate at scale without compounding cybersecurity exposure.


Related Coverage

Security & Trust | Cyber-Physical Compromise | Training Data Poisoning | Controls