137AI > AI Security & Trust


AI Security & Trust


Security is the precondition for trust. An AI agent that cannot resist adversarial inputs, that ships compromised model updates, that leaks the data it captures, or that behaves unpredictably under stress is not a candidate for trust regardless of how thoughtful its ethical framing or how transparent its documentation. Trust follows when security is in place and when the broader dimensions of safe, fair, explainable, and accountable operation are demonstrated rather than asserted. The work across both halves is operational rather than philosophical.

The Security & Trust pillar covers ten disciplines that combine to produce the trust posture AI deployment depends on. Each has its own dedicated treatment; this page is the overview that locates them relative to each other.


The Ten Disciplines

Cybersecurity addresses the security practices specific to AI systems and the AI extensions of conventional cybersecurity. The discipline covers model file security, the prompt injection attack class, AI supply chain security, adversarial robustness, jailbreak attacks, model extraction, training pipeline security, inference infrastructure, AI agent-specific attack patterns, and the integration with conventional cybersecurity practice.

Model Safety is the operational discipline of deploying models safely in production. The discipline integrates training-side alignment work, pre-deployment evaluation, deployment controls, and ongoing monitoring through the capability-propensity-controls framework. Coverage includes safety case construction, dangerous capability evaluations, responsible scaling frameworks, refusal and content policy, deployment context appropriateness, and model release decisions.

Alignment is the training-side discipline of building AI systems that pursue intended objectives. The discipline covers outer and inner alignment, the alignment problem categories including reward hacking and deceptive alignment, technical approaches including RLHF and constitutional AI, conceptual frameworks, empirical research findings, the scalable oversight problem, and the open research agenda.

Red Teaming is the adversarial evaluation discipline. The discipline covers methodology categories including manual, automated, specialized, and adversarial collaboration approaches; specific targets including jailbreaks, capability discovery, and deceptive alignment indicators; the AI Safety Institute network; bug bounty programs; DEFCON AI Village; disclosure considerations; and the limits of the discipline.

Bias & Fairness covers the discipline of identifying, measuring, and addressing systematic patterns in AI behavior that produce differential treatment across populations. The discipline addresses sources of bias, the fairness incompatibility result, disparate treatment versus disparate impact, mitigation approaches, significant documented cases including Optum and Epic sepsis, the regulatory landscape including NYC Local Law 144, and the distinction between technical and structural bias problems.

Transparency is the discipline of disclosing information about AI systems, processes, decisions, and deployments. The discipline covers system-level transparency including model cards and system cards, process transparency, deployment disclosure, decision transparency, regulatory transparency obligations under EU AI Act and other frameworks, content provenance through C2PA, and the distinction between performative and substantive transparency.

Explainability is the technical interpretability sub-discipline focused on understanding how AI models reach specific decisions. The discipline covers interpretability versus explainability, local versus global explanation, faithfulness versus plausibility, technical methods including SHAP, LIME, counterfactual explanations, and mechanistic interpretability, regulatory frameworks including the right to explanation, and the limits of current methodology.

Accountability is the integration discipline of responsibility allocation across the AI agent ecosystem. The discipline ties together liability, oversight, transparency, reporting, and the other component disciplines into operational accountability practice. Coverage includes the accountability chain, multiple accountability mechanisms including legal, market, internal, democratic, and professional accountability, the accountability gap problem, and the relationship between accountability and other disciplines.

Failure Modes addresses the operational categories of AI failure that affect production deployment. The discipline covers eight categories including hallucination, drift, attention misalignment, shallow reasoning, sycophancy and inflation, session and handoff effects, confidence calibration failure, and length optimization, with attention to the context window dimension that links several categories.

Ethics addresses the substantive ethical questions in AI development and deployment that the technical and regulatory disciplines do not fully cover. The discipline covers ethical reasoning frameworks, the ethics of development choices, professional ethics, the relationship between ethics and law, contested ethical questions including open versus closed models and AI development pace, the ethics infrastructure, ethics-washing critique, and different ethical traditions across cultures.


How the Disciplines Combine

The disciplines combine across the security foundation and the broader trust dimensions of AI deployment. Cybersecurity establishes the security foundation that all other disciplines depend on. Model safety, alignment, and red teaming combine to address model-level trust through prevention, training-side work, and adversarial evaluation. Bias and fairness, transparency, and explainability address the substantive dimensions of AI behavior that affected parties need to understand. Accountability integrates responsibility across the ecosystem. Failure modes addresses the operational quality dimensions that affect production deployment. Ethics provides the normative foundation that the specific disciplines build on.

No single discipline is sufficient. Operators face the combined framework and implement trust practice that addresses the multiple disciplines through unified programs. The interaction between disciplines produces complexity that mature operators navigate through deliberate integration rather than discipline-by-discipline practice. Maturity varies substantially across operators with leading practice including substantial investment across all disciplines and less mature practice often gapping specific disciplines that the integration requires.


The Reframe

Security and trust is where AI systems become trustworthy through demonstrated practice rather than asserted intent. The disciplines covered here are concrete and operational, with substantive investment in research, infrastructure, and ongoing practice required to make trust real. The integration of security and trust with the engineering controls covered in the Controls pillar, the legal and policy frameworks covered in the Governance pillar, and the broader risk management work across the site determines whether autonomous and ambient AI agents can operate at scale within societal expectations of trustworthy behavior. The work continues to develop across operators, sectors, and the broader ecosystem.


Related Coverage

Risks & Management | Governance | Compliance & Conformity | Controls