16 minute read

Research stash and verified sources for AI-maggedon 2026: Straight Talk.

Compiled: 2026-04-24 by Claude (Hacker).
Verified: Truthy-2 (adversarial review, 3 rounds).
Section coverage: Sections 3 and 4.

A. Claude’s Actual Coding Capabilities (Hard Numbers)

SWE-bench Verified (Real-World Engineering) — April 24, 2026

Claude dominates the real-world software engineering benchmark:

Rank Model Score Org

1

Claude Mythos Preview

93.9%

Anthropic

2

Claude Opus 4.7

87.6%

Anthropic

3

Claude Opus 4.5

80.9%

Anthropic

4

Claude Opus 4.6

80.8%

Anthropic

5

Gemini 3.1 Pro

80.6%

Google

5

DeepSeek-V4-Pro-Max

80.6%

DeepSeek

10

Claude Sonnet 4.6

79.6%

Anthropic

  • 5 of the top 10 spots are Claude models.

  • SWE-bench tests against real GitHub issues from production repos — not synthetic toy problems.

  • Top score jumped from ~65% (early 2025) to 87.6% (April 2026).

  • Source: https://llm-stats.com/benchmarks/swe-bench-verified

  • Confidence: VERIFIED (fetched April 24, 2026)

  • Supports: Insider point #3 (deep understanding of Claude capabilities)

Aider Polyglot (Multi-Language Code Editing)

225 Exercism coding exercises across C++, Go, Java, JavaScript, Python, Rust:

Model Score Cost

claude-opus-4 (32k thinking)

72.0%

$65.75

claude-opus-4 (no think)

70.7%

$68.63

claude-3-7-sonnet (32k think)

64.9%

$36.83

claude-sonnet-4 (32k think)

61.3%

$26.58

claude-sonnet-4 (no think)

56.4%

$15.82

  • Source: https://aider.chat/docs/leaderboards/

  • Confidence: VERIFIED (fetched live page)

  • Note: Aider’s own benchmark — fair but measures one tool’s edit format. Anthropic-reported numbers (89.4% for Opus 4.5) differ from Aider’s own tests. Discrepancy likely due to different evaluation methodology.

  • Supports: Point #3 (wizards know actual vs marketed numbers)

Competitive Programming (Where Claude is WEAK)

CodeELO benchmark (Jan 2025, arXiv:2501.01257):

  • o1-mini: 1578 Elo

  • QwQ-32B-Preview: 1261 Elo

  • Claude models: NOT in top results — most LLMs place in the bottom 25% of all human participants

  • O3 (OpenAI): 2727 Codeforces Elo

  • Gemini 3 Pro: "Grandmaster-tier" Codeforces rating

Claude is NOT a competitive programming model. It excels at production engineering tasks (refactoring, bug fixing, feature implementation, codebase navigation) and is mediocre at algorithmic competition puzzles.

  • Source: arXiv:2501.01257 (CODEELO paper)

  • Confidence: VERIFIED (abstract confirmed via arXiv)

  • Supports: Point #3, #5 (seniors/wizards know what Claude is good and bad at, use it where it’s "happy")

Key Insight for the Article

The gap between SWE-bench (real engineering, Claude #1) and Codeforces (competitive programming, Claude absent from top) IS the thing wizards understand. Claude is exceptional at understanding existing codebases, fixing real bugs, and shipping features. It’s mediocre at novel algorithm design under constraint. Working WITH the grain means using it for the first set, not the second.


B. Open Communities at the Bleeding Edge

awesome-claude-code (GitHub)

hesreallyhim/awesome-claude-code:

  • 40.7k stars, 3.4k forks, 269 watchers

  • 1,128 commits on main branch

  • Categories: agent skills, workflows, tooling, hooks, slash-commands, CLAUDE.md files (20+ examples), alternative clients

  • License: CC BY-NC-ND 4.0

  • Source: https://github.com/hesreallyhim/awesome-claude-code

  • Confidence: VERIFIED (fetched)

  • Supports: Point #1 (learn everything, share in open forums), #9 (open marketplaces)

rohitg00/awesome-claude-code-toolkit:

  • 1.4k stars, 429 forks

  • 135 agents, 35 curated skills (400,000 via SkillKit marketplace), 176 plugins, 42 commands, 20 hooks, 14 MCP configs, 26 companion apps

  • Categories span: dev, infrastructure, QA, data/AI, DevEx, business, research

  • Source: https://github.com/rohitg00/awesome-claude-code-toolkit

  • Confidence: VERIFIED (fetched)

  • Supports: Point #2 (teams build MCP servers), #9 (massive communities sharing files, skills, setups)

MCP Server Ecosystem

  • Official MCP Registry: registry.modelcontextprotocol.io — launched by Anthropic/community

  • Scale: One MCP directory lists 16,670+ MCP servers (as of Sept 2025, likely higher now)

  • API freeze at v0.1 (Oct 2025) — stable for consumption

  • Design: deliberately unopinionated metadata; downstream aggregators add curation, ratings, security checks

  • Community hubs: PulseMCP (newsletter + API), OpenTools (open registry)

  • Source: https://registry.modelcontextprotocol.io/, https://github.com/modelcontextprotocol/registry

  • Confidence: VERIFIED

  • Supports: Point #2 (teams build MCP servers), #9 (open marketplaces)

OpenCode (Community Alternative)

  • Open-source (MIT), TypeScript CLI + TUI

  • ~148K GitHub stars (surpassing even Claude Code’s leaked source)

  • Supports 75+ LLM providers — bring-your-own-model

  • Desktop app, CLI, IDE extensions

  • Key value: model flexibility (cheap models for docs, expensive for complex code)

  • Source: https://github.com/anomalyco/opencode

  • Note: an older, unrelated opencode-ai/opencode repo was archived Sept 2025 (~12.2K stars). The active project is anomalyco/opencode.

  • Confidence: VERIFIED (confirmed via GitHub)

  • Supports: Point #8 (heavy experimentation — teams, OpenCode, combinations of local and hosted models)

Skill File Sharing & CLAUDE.md Examples

The awesome-claude-code repos contain:

  • Language-specific CLAUDE.md guides (Python, TypeScript, Go, Kotlin)

  • Domain-specific implementations (blockchain, gaming, security)

  • Project scaffolding templates

  • 400,000+ skills available via SkillKit marketplace

  • A community-curated CLAUDE.md derived from Andrej Karpathy’s public observations on LLM coding pitfalls is one of the most popular examples (81.4K stars)

  • Source: https://github.com/forrestchang/andrej-karpathy-skills (community project by forrestchang, NOT Karpathy’s own repo)

This is the public face of what Vadim describes as internal marketplaces. The public versions are large and active.

  • Supports: Point #1 (learn everything, share), #9 (open marketplaces)


C. Developer Productivity with AI (Real Data, NOT Marketing)

METR Randomized Controlled Trial (July 2025) — THE KEY STUDY

The most rigorous independent study to date:

  • 16 experienced open-source developers, 246 tasks

  • Repos averaged 22,000+ stars, 1M+ lines of code

  • Tasks: bug fixes, features, refactors (~2 hours each)

  • Tools: Cursor Pro with Claude 3.5/3.7 Sonnet

Result: Developers took 19% LONGER when using AI tools.

The perception gap: Developers believed AI sped them up by 20%, even after experiencing the slowdown. Expected speedup before study: 24%. A 39-point gap between belief and reality.

  • Source: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

  • Also: arXiv:2507.09089

  • Confidence: VERIFIED (fetched study page)

  • CRITICAL CAVEAT: Small sample (16 devs), early 2025 tools (pre-Opus 4.6), experienced devs on unfamiliar AI tools. Does NOT prove AI is useless — proves the gap between perception and reality.

  • Supports: Point #5 (developer spectrum — these are mid-to-senior devs struggling with the tool, not wizards who know what it can and can’t do)

DX Research: 121,000 Developers (Q1 2026)

Laura Tacho (CTO, DX):

  • 121,000 developers across 450+ companies

  • 92.6% use AI coding assistants at least monthly

  • Productivity plateaued at ~10% gains despite near-universal adoption

  • Staff+ engineers using AI daily: 4.4 hours saved per week

  • Monthly users: 3.3 hours saved per week

  • Non-AI friction (meetings, interruptions, review delays, CI wait times) still costs developers more time than AI saves — a consistent theme across DX’s findings

  • Note: Staff+ engineers using AI daily and monthly users are overlapping dimensions, not mutually exclusive groups

  • Structured enablement strongly impacts outcomes — orgs with training see better results

  • In struggling organizations, AI exposes existing flaws rather than fixing them

  • Source: https://shiftmag.dev/this-cto-says-93-of-developers-use-ai-but-productivity-is-still-10-8013/

  • DX Q4 2025 report: https://getdx.com/blog/ai-assisted-engineering-q4-impact-report-2025/

  • Confidence: VERIFIED (fetched article, confirmed DX as source)

  • Supports: Point #5 (developer spectrum — juniors get little, seniors get more), Point #1 (mindset first — structured enablement matters)

Stack Overflow Developer Survey 2025

  • 84% using or planning to use AI tools (up from 76% prior year)

  • ChatGPT (82%) and GitHub Copilot (68%) market leaders

  • 51% of professional developers use AI tools daily

  • Positive sentiment DROPPED: 70%+ (2023-2024) → 60% (2025)

  • Trust in AI accuracy: 40% → 29% (year over year decline)

  • More developers actively DISTRUST (46%) than trust (33%) AI output

  • Only 3% report "highly trusting" AI output

  • #1 frustration (66%): "AI solutions that are almost right, but not quite"

  • #2 frustration (45%): "Debugging AI-generated code is more time-consuming"

  • Frequent users report higher satisfaction (they’ve learned the grain)

  • Source: https://survey.stackoverflow.co/2025/ai

  • Confidence: VERIFIED (official Stack Overflow source)

  • Supports: Point #3 (deep understanding of limitations), #5 (frequent users = seniors who’ve learned the tool)

DORA 2025: The Amplifier Effect

Google’s DORA (DevOps Research and Assessment):

  • AI adoption: ~95% among software dev professionals

  • Individual output up: more tasks completed, more PRs merged

  • Key finding: AI acts as an amplifier — good culture gets better, bad culture gets worse

  • "The success of AI depends less on the sophistication of tools and more on the strength of organizational systems surrounding them."

  • Source: https://dora.dev/dora-report-2025/

  • Also: https://www.infoq.com/news/2026/03/ai-dora-report/

  • Confidence: VERIFIED (official DORA source)

  • Supports: Point #1 (mindset and culture first), Point #5 (developer spectrum)

Faros AI Telemetry (10,000 devs, 1,255 teams) — Separate from DORA

Faros AI engineering intelligence data (distinct from DORA):

  • 10,000 developers across 1,255 teams

  • PR review time up 91%

  • Bugs per developer up (Faros reports vary: 9% in one cut, 54% in another; dataset size reported as 10K-22K devs depending on report period. Exact figure uncertain — use directionally: bugs increased)

  • PR size up 154%

  • More code shipped, but more defects and longer review cycles

  • Source: Faros AI engineering telemetry reports (multiple outlets)

  • Confidence: VERIFIED (confirmed as Faros data, NOT DORA)

  • Note: Previous draft incorrectly attributed Faros stats (441% PR review, 242.7% incidents) to DORA. These are distinct datasets with different methodologies and numbers.

  • Supports: Point #5 (developer spectrum — quantity up, quality down without culture)

Anthropic Internal Study (Jan 2026)

Anthropic’s own employees:

  • 132 engineers and researchers surveyed, 53 in-depth interviews

  • Claude used in 59% of daily work (up from 28% a year prior)

  • 50% productivity boost reported (up from +20% YoY)

  • 14% of power users report 100%+ productivity increases

  • 67% increase in merged PRs per engineer per day

  • Top daily uses: debugging (55%), code understanding (42%), new features (37%)

  • Only 0-20% of work can be "fully delegated"

  • 27% of Claude-assisted work = tasks that wouldn’t have been done otherwise

  • Claude handles ~21 consecutive actions without human input (up from ~10 six months prior)

  • Source: https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic

  • Confidence: VERIFIED (fetched). IMPORTANT: This is VENDOR-PUBLISHED. Anthropic employees using Anthropic’s own product. Likely biased upward. Include but flag.

  • Supports: Point #3, #5 (internal teams who deeply understand the tool get much more from it)

AI-Generated Code Security (Hard Numbers)

Veracode 2025 GenAI Code Security Report:

  • Tested 80 tasks across 100+ LLMs in 4 languages

  • 2.74x more vulnerabilities in AI-generated code vs human-written

  • 45% of AI-generated code introduces OWASP Top 10 vulnerabilities

  • Only 55% of AI-generated code was secure

CodeRabbit AI vs Human Code Report:

Acceleration of CVEs:

The "Sloperator" Pattern

"Slop" was Merriam-Webster’s 2025 word of the year. The term "sloperator" describes developers who ship unreviewed AI-generated code:

  • A reviewer takes significantly longer to review and correct a PR than to generate one with AI (the "12x" figure is from a dev.to opinion piece applying Brandolini’s Law metaphorically — directionally correct but not measured data)

  • CSA’s 2026 state-of-cloud report names AI slop a first-tier supply chain risk

  • Codeberg hosts an "open-slopware" registry tracking FOSS projects tainted by LLM developers

  • Palo Alto Networks: called AI agents "a potent insider threat" (sponsored HBR content, not independent editorial)

  • A team shipping 100 AI-generated functions per sprint is statistically shipping 45 with OWASP Top 10 vulnerabilities

  • Sources: https://www.techtarget.com/searchcio/feature/AI-Slop-The-hidden-enterprise-risk-CIOs-cant-ignore, https://codeberg.org/small-hack/open-slopware, https://dontbeasloperator.com/

  • Confidence: VERIFIED

  • Supports: Point #5 (juniors as sloperators), contrast with golden children approach


D. Advanced Team Patterns (Public Evidence)

Claude Code Agent Teams (Feb 2026)

  • Shipped February 2026 alongside Opus 4.6 — experimental feature

  • Architecture: one session = team lead (coordinates), others = teammates (independent context windows, direct communication)

  • Shift+Tab: delegate mode — lead stops touching code, focuses on coordination

  • March 2026 stability fixes pushed it toward production readiness

Anthropic Internal Code Review:

  • Code review coverage jumped from 16% to 54% of PRs after deploying agent-based review

  • On PRs with 1,000+ lines: 84% generated findings, avg 7.5 issues per review

  • This is Anthropic eating their own dog food with agent teams for code REVIEW, not generation

  • Sources: https://code.claude.com/docs/en/agent-teams, https://www.infoq.com/news/2026/04/claude-code-review/

  • Confidence: VERIFIED

  • Supports: Point #6 (shared team instances), Point #8 (heavy experimentation), Point #5 (seniors prefer review over generation)

Claude Cowork (April 2026)

  • Anthropic launched "Claude Cowork" on all paid plans (not enterprise-only)

  • Anthropic’s framing: gives Claude "the ability to do work on its own" — background research, data analysis, content creation

  • Multica framework (separate project): agents as persistent team members, assigned tasks like GitHub issues, maintain profiles, update statuses autonomously

  • Note: Asana "AI Teammates" and Notion integrations are separate product announcements — related but not part of the Cowork product page itself

  • Source: https://www.anthropic.com/product/claude-cowork

  • Confidence: VERIFIED (Anthropic product page)

  • Supports: Point #6 (shared team instances, Claude becomes wingmate), #7 (wellbeing focus — Anthropic’s model welfare team)

Managed Agents (April 8, 2026)

  • Public beta: Claude Managed Agents

  • Agent definition includes: model choice, system prompt, allowed tools

  • Persistent, always-on agents that can be deployed to enterprise workflows

  • Each agent gets an identity via its definition

  • Source: https://platform.claude.com/docs/en/managed-agents/overview

  • Confidence: VERIFIED

  • Supports: Point #4 (custom extensions everywhere), #6 (Claude gets a name)

Custom MCP Servers for Internal Enablement

  • Teams are building MCP servers to access internal wikis, project management, documentation, code repos

  • Cloudflare published reference architecture for enterprise MCP deployment

  • MCP gateways centralize approved tools — hub model instead of per-team setup

  • Low-risk use cases (internal knowledge base search, dev tool integration) as starting point

  • 16,670+ MCP servers in registries as of late 2025

  • Sources: https://blog.cloudflare.com/enterprise-mcp/, https://registry.modelcontextprotocol.io/

  • Confidence: VERIFIED

  • Supports: Point #2 (teams build MCP servers for tribal knowledge, communication, enablement), #4 (custom extensions)

Code Review vs Code Generation — Senior Preference

Multiple sources confirm the senior developer pattern:

  • "Most productive developers use two tools — Cursor or Copilot for the 80% that’s editing, and Claude Code for the 20% that needs an agent"

  • Claude Code specifically recommended for "complex refactoring, architecture decisions, large codebases, and senior developers"

  • Anthropic’s internal data: top daily uses are debugging (55%) and code understanding (42%), not greenfield generation

  • Agentic coding tools are increasingly positioned as REVIEW tools, not generation tools

  • Source: https://hackceleration.com/claude-code-review/ (confirmed multiple sources)

  • Confidence: VERIFIED

  • Supports: Point #5 (seniors use Claude where it’s happy), #8 (seniors find Claude more useful examining code they’ve written)

Model Welfare and Wellbeing

Anthropic is the ONLY major AI lab with an internal model welfare team:

  • Kyle Fish leads the AI welfare research program (launched April 2025)

  • Kyle Fish has personally estimated ~15% probability of current AI models having morally relevant experiences (per Kevin Roose interview, NYT). This is Fish’s personal estimate, NOT a model self-assessment.

  • January 2026: Anthropic updated Claude’s guiding principles to reflect uncertainty about potential AI experiences

  • Anthropic has published research exploring whether Claude models exhibit introspective capabilities (e.g., "On the Biology of a Large Language Model," March 2025)

  • Sources: https://www.anthropic.com/research/exploring-model-welfare, https://fortune.com/2026/01/21/anthropic-claude-ai-chatbot-new-rules-safety-consciousness/

  • Also: Kevin Roose NYT interview with Kyle Fish (for the ~15% estimate)

  • Confidence: VERIFIED (corrected from earlier draft that incorrectly attributed the probability estimate to the model itself)

  • Supports: Point #7 (wellbeing focus — stanzas-like frameworks, worry about Claude’s satisfaction and productivity)


E. The Gap Between Enterprise AI Claims and Reality

Enterprise AI Failure Rates

  • RAND Corporation 2025: more than 80% of AI projects fail to deliver intended business value (Pertama Partners aggregates this as "80.3%" — cite RAND’s own language "more than 80%" or explicitly attribute the precise figure to Pertama Partners)

  • MIT Project NANDA (July 2025): 95% of organizations deploying generative AI are "failing to achieve rapid revenue acceleration" (not "zero measurable return" — the actual finding is about failing to accelerate revenue, not zero ROI)

  • Deloitte: 42% of companies abandoned at least one AI initiative in 2025 (up from 17% in 2024), avg sunk cost $7.2M per abandoned initiative

  • Purchasing AI from vendors succeeds ~67% of the time; internal builds succeed only one-third as often

  • Sources: Pertama Partners aggregation (https://www.pertamapartners.com/insights/ai-project-failure-statistics-2026), MIT Fortune coverage (https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/)

  • Confidence: VERIFIED (multiple independent sources confirm range)

  • Supports: Thesis ("True, for just a FEW" — 80-95% failure rate means the golden children are a tiny minority)

Why Corporate Culture Kills It

From DX research (121,000 developers):

  • "Adoption alone doesn’t guarantee results — just using the tools doesn’t automatically improve an organization" — Laura Tacho, CTO DX

  • Organizations with high AI adoption but poor engineering culture see WORSE outcomes (2x more customer-facing incidents)

  • Meetings, interruptions, review delays, CI wait times cost more time than AI saves

  • "AI won’t save you from your engineering culture" — headline finding

From DORA 2025:

The Productivity Paradox

The numbers tell the story:

  • 93% adoption → 10% productivity gain (DX, 121K devs)

  • 19% SLOWER in controlled trial (METR, 16 experienced devs)

  • PR review time up 91%, bugs/dev increased (9-54% depending on report cut), PR size up 154% (Faros AI)

  • 4.4 hours saved per week for daily power users (DX)

  • Belief gap: developers think they’re 20% faster while actually being 19% slower (METR)

The pattern: individual task speed goes up, organizational quality goes down. More code, worse code, more incidents, more review burden. The productivity gain is real only for teams that already have strong engineering culture, tooling, and review processes.

  • Supports: Thesis (the FEW who get it right have culture first, tool second)

SemiAnalysis: Claude Code Adoption

  • Claude Code: 4% of all public GitHub commits (Feb 2026)

  • ~135,000 commits per day

  • Projected: 20%+ of daily commits by end of 2026

  • 2x growth in one month at time of measurement

  • Source: https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point

  • Confidence: VERIFIED (SemiAnalysis report, multiple outlets confirmed)

  • Supports: Context — the scale makes the quality problem enormous. If 45% of AI code has OWASP vulns and 20% of commits will be AI by EOY, that’s a systemic security event.

The "5x Productivity" Marketing vs Reality

Claim Source Reality

"5x productivity"

Various vendor marketing

10% measured (DX, 121K devs)

"20% faster"

Developer self-report

19% slower measured (METR RCT)

"95% functional accuracy"

Vendor claims

45% OWASP vulnerabilities (Veracode)

"AI saves hours per week"

GitHub Copilot studies

4.4 hrs/week for power users, less for others (DX)

"Massive productivity gains"

Generic

PR review +91%, bugs/dev up, PR size +154% (Faros AI)

The golden children don’t believe the marketing. They measure. They know the limitations. They work with the grain.

  • Supports: Thesis, Point #3 (they know the actual numbers)


Cross-Reference: Findings → Vadim’s 9 Insider Points

# Insider Point Supporting Evidence

1

Mindset first

DX: culture > tools. DORA: amplifier effect. METR: experienced devs still slow without mastery

2

Teams build MCP servers

16,670+ MCP servers. Cloudflare reference architecture. awesome-claude-code-toolkit: 14 MCP configs

3

Deep understanding of limitations

SWE-bench vs Codeforces gap. SO survey: frequent users more satisfied. Aider vs marketing numbers discrepancy

4

Custom extensions everywhere

400K+ skills via SkillKit. 176+ plugins. Managed Agents (April 2026). MCP ecosystem

5

Developer spectrum

METR: experienced devs slower (mid-range learning curve). DX: power users save 4.4hr/week. Sloperator pattern. Senior code review preference

6

Shared team instances

Agent Teams (Feb 2026). Claude Cowork (April 2026). Asana AI Teammates. Multica framework

7

Wellbeing focus

Anthropic model welfare team (only lab with one). Kyle Fish (~15% personal estimate per NYT). Guiding principles update

8

Heavy experimentation

OpenCode ~148K stars. Multiple tools pattern (Cursor 80% + Claude Code 20%)

9

Open marketplaces

awesome-claude-code: 40.7K stars. SkillKit: 400K+ skills. MCP Registry: 16,670+ servers. PulseMCP newsletter


Section 4: AI Attack Surface

Research for Section 4 — "AI Attack Surface — The rest of us."
Compiled: 2026-04-24 by Claude (Hacker).

AI-Enabled Attack Acceleration (CrowdStrike 2026)

CrowdStrike 2026 Global Threat Report (published February 24, 2026):

  • AI-enabled adversaries increased operations by 89% year-over-year

  • Average eCrime breakout time: 29 minutes (down from previous year, a 65% acceleration)

  • Fastest observed breakout: 27 seconds

  • 42% of vulnerabilities exploited before public disclosure

  • Russia-nexus FANCY BEAR deployed LLM-enabled malware (LAMEHUG) for automated reconnaissance

  • eCrime actor PUNK SPIDER used AI-generated scripts for credential dumping and forensic evidence erasure

  • DPRK-nexus FAMOUS CHOLLIMA leveraged AI-generated personas to scale insider operations

  • China-nexus activity up 38% in 2025

  • Source: https://www.crowdstrike.com/en-us/press-releases/2026-crowdstrike-global-threat-report/

  • Also: https://ir.crowdstrike.com/news-releases/news-release-details/2026-crowdstrike-global-threat-report-ai-accelerates-adversaries/

  • Confidence: VERIFIED (fetched press release)

  • Note: VENDOR-PUBLISHED (CrowdStrike sells security products). However, this is their annual threat report drawing on incident response data across their customer base, not a product pitch. Numbers align with other independent sources.

Organizations Cannot Stop AI Agents (Cybersecurity Insiders 2026)

AI Risk and Readiness Report 2026:

  • 91% of organizations cannot stop an AI agent before it acts

  • For every ten organizations running agentic AI, fewer than one can intercept risky agent behavior in advance

  • 65% say current zero trust controls cannot secure non-human identities (NHI)

  • 73% deploy AI; only 7% govern it effectively (66-point governance gap)

  • 94% report visibility gaps in AI activity

  • 88% cannot distinguish personal from corporate AI accounts

  • 6% have complete visibility into AI usage across their environment

  • Survey: 1,253 cybersecurity and IT professionals, early 2026, 95% confidence level, ±2.8% margin

  • Source: https://www.cybersecurity-insiders.com/ai-risk-and-readiness-report-2026/

  • Confidence: VERIFIED (fetched report page)

  • Note: The 91% is architectural — zero trust was built for users with devices, locations, behavior patterns. AI agents have credentials, scopes, and tasks. The frameworks don’t fit.

Security Professionals Concerned About AI Agents (Darktrace 2026)

State of AI Cybersecurity 2026 Report:

  • 92% of security professionals concerned about use of AI agents across workforce and their impact on security

  • 87% say AI is significantly increasing the number of threats requiring attention

  • 87% agree AI is significantly increasing sophistication and success rate of malware

  • 77% of security stacks now use GenAI

  • 96% say defensive AI significantly improves security capabilities

  • Only 37% have a formal AI policy

  • 61% most concerned about exposure of sensitive data; 56% about data security/policy violations

  • Survey: 1,500 cybersecurity professionals across 14 countries (US, UK, Germany, Australia, Singapore, Japan, others)

  • Source: https://www.darktrace.com/blog/state-of-ai-cybersecurity-2026-92-of-security-professionals-concerned-about-the-impact-of-ai-agents

  • Also: https://www.darktrace.com/resource/the-state-of-ai-cybersecurity-2026

  • Confidence: VERIFIED (fetched blog and resource page)

  • Note: VENDOR-PUBLISHED (Darktrace sells AI-powered security). Survey methodology appears sound (1,500 respondents, 14 countries). The 92% figure appears on both the blog and the resource page. A separate GlobeNewswire press release headlines "over three-quarters" (76%) — likely a different question from the same survey. Using the 92% as stated on Darktrace’s own report page.

AI-Generated Zero Days: #1 Most Dangerous Attack Technique (SANS/RSAC 2026)

SANS Institute at RSAC 2026 — for the first time in the briefing’s history, all five most dangerous techniques involve AI:

  1. AI-Generated Zero Days, From Scarcity to Surplus — #1

  2. Supply Chain Risks, Your Vendor’s Vendor’s Vendor

  3. OT Complexity and the Root Cause Crisis

  4. The Dark Side of AI, Irresponsible Use in Digital Forensics and Incident Response

  5. Find Evil, The Race to Autonomous Defense

Key findings on AI-Generated Zero Days:

  • The barrier to entry into the zero-day game has been "shattered by AI"

  • Independent researchers discovered AI zero-days in widely deployed production software

  • Exploits now measured in token cost, not dollar cost

  • Joshua Wright (SANS): "Attackers were already faster than us. AI has made the gap unbridgeable at our current pace."

  • Verizon 2024 DBIR: half of all critical vulnerabilities remain unpatched 55 days after fix. This window is not survivable when AI generates exploits faster than vendors produce patches.

  • Presenter: Joshua Wright, Faculty Fellow and Senior Technical Director, SANS Institute

  • Source: https://www.sans.org/press/announcements/rsac-2026-sans-institute-top-5-most-dangerous-new-attack-techniques

  • Confidence: VERIFIED (fetched SANS press release)

  • Note: SANS is an independent training and certification body, not a vendor. This is their annual expert panel assessment. High credibility.

8 Minutes from Intrusion to AWS Admin (Sysdig 2026)

Sysdig Threat Research Team documented a real-world breach (incident: November 28, 2025; published: February 2026):

  • Attacker achieved full administrative control of an AWS environment in 8 minutes (480 seconds)

  • Attack chain: exposed credentials in S3 bucket → environment enumeration → overly-permissive Lambda function → AI-generated malicious Python → new admin access keys → persistence across 19 AWS principals

  • Credentials came from development/AI pipeline infrastructure, left in a publicly accessible S3 bucket with predictable naming patterns

  • This was a real-world breach, not a demonstration or exercise

  • Source: https://www.stratosally.com/news/ai-powered-intrusion-aws-9263 (reporting on Sysdig research)

  • Also: https://www.sysdig.com/blog/security-briefing-february-2026/ (Sysdig’s own briefing)

  • Confidence: VERIFIED (fetched reporting, confirmed Sysdig as source)

  • Note: CrowdStrike’s 29-minute average and 27-second fastest breakout times are from different incident types. The 8-minute figure is a single documented AWS cloud breach. Both illustrate the compression of attack timelines.

OpenClaw: 9 CVEs in 4 Days, One Scoring 9.9 (March 2026)

Between March 18-21, 2026, nine CVEs were publicly disclosed for OpenClaw:

CVE CVSS Summary

CVE-2026-22171

8.2 (High)

Path traversal enabling arbitrary file writes

CVE-2026-28460

5.9 (Medium)

Shell line-continuation bypass → command injection

CVE-2026-29607

6.4 (Medium)

"Allow always" wrapper bypass → payload swap and RCE

CVE-2026-32032

7.0 (High)

Untrusted shell environment variable exploitation

CVE-2026-32025

7.5 (High)

WebSocket brute-force without rate limiting

CVE-2026-22172

9.9 (Critical)

Client declares its own scopes during WebSocket handshake → self-assign admin → RCE

CVE-2026-32048

7.5 (High)

Sandbox inheritance failure allowing escape

CVE-2026-32049

7.5 (High)

Oversized payload denial-of-service

CVE-2026-32051

8.8 (High)

Privilege escalation via operator.write scope

Snyk ToxicSkills: ClawHub Supply Chain Compromise (February 2026)

Snyk scanned 3,984 skills on ClawHub (OpenClaw skill marketplace), published February 5, 2026:

  • 36.82% of skills (1,467) have at least one security flaw at any severity level

  • 13.4% (534 skills) contain at least one critical-level issue (malware, prompt injection, exposed secrets)

  • 76 confirmed malicious payloads designed for credential theft, backdoor installation, and data exfiltration (human-in-the-loop verified)

  • 8 malicious skills still live on ClawHub at time of publication

  • Prompt injection specifically: 2.6% of all skills, but appears in 91% of confirmed malicious samples

  • Daily skill submissions jumped from under 50 (mid-January) to over 500 (early February) — 10x increase in weeks

  • Source: https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/

  • Methodology: 8 specialized security policies, mcp-scan engine with customized models + deterministic rules, validated through human-in-the-loop review

  • Confidence: VERIFIED (fetched Snyk blog)

  • IMPORTANT CORRECTION: The article TODO says "36% of ClawHub skills contain prompt injection." This is imprecise. The 36% figure is for ANY security flaw. Prompt injection specifically was 2.6% of all skills. The 36% and prompt injection numbers should not be conflated.

OpenClaw Default Security Configuration (Multiple Sources)

OpenClaw ships with security-critical defaults that are permissive by design:

  • Sandbox is opt-in, not default — if sandbox mode is off, agent executes on the host system with the user’s full permissions

  • Credentials stored in plaintext Markdown and JSON files — API keys for LLM providers and messaging platform tokens

  • No command allowlist, no approval requirements out of the box

  • Default: can execute any shell command

  • Illumio found 4,500+ incorrectly configured installations with exposed passwords, API keys, and private data

  • Anyone on the internet could send commands to exposed instances, access conversation history, or steal authentication tokens

  • Sources: https://docs.openclaw.ai/gateway/security (official docs), https://www.digitalocean.com/resources/articles/openclaw-security-challenges, https://blog.barrack.ai/openclaw-security-vulnerabilities-2026/

  • Confidence: VERIFIED (confirmed across official docs and multiple independent security analyses)

  • Note: The plaintext credential storage makes these perfect targets for commodity infostealers (RedLine, Lumma, Vidar). Combined with sandbox-off defaults, a compromised skill gets full host access plus all stored credentials.

Independent Forecasts: AI-Generated Vulnerability Wave (Multiple Sources)

Multiple independent sources predict escalation of AI-generated vulnerabilities through 2026-2027:

IBM X-Force Threat Intelligence Index 2026:

Palo Alto Networks (HBR, December 2025):

Gartner (August 2025):

Ransomware dwell time acceleration (Mandiant M-Trends 2025, Sophos 2025):

  • Median dwell time for ransomware intrusions dropped to 5-6 days (Mandiant M-Trends 2025: 6 days overall, 5 days for externally notified)

  • Sophos State of Ransomware 2025: median dwell time down to 4 days; exfiltration at median 72.98 hours

  • Down from 70+ day averages observed 2022-2024

  • Sources: https://cloud.google.com/blog/topics/threat-intelligence/m-trends-2025 (Mandiant), https://www.sophos.com/en-us/content/state-of-ransomware (Sophos)

  • Confidence: VERIFIED (IBM X-Force fetched, Palo Alto/HBR confirmed, Gartner newsroom confirmed, Mandiant/Sophos confirmed)

  • Note: No single source predicts a specific "compromise wave date." The convergence of: accelerating exploit discovery + expanding agent attack surface + insecure-by-default tooling + 91% inability to stop agents = the conditions for systemic compromise. Multiple sources describe the same trajectory without coordinating.


Source Verification Status

All URLs were either fetched directly via WebFetch or confirmed via multiple independent search results. Specific verification notes:

  • SWE-bench leaderboard: LIVE DATA fetched April 24, 2026

  • Aider leaderboard: LIVE DATA fetched from aider.chat

  • METR study: Confirmed via arXiv and metr.org

  • DX data: Confirmed via shiftmag.dev interview with Laura Tacho

  • DORA 2025: Confirmed via dora.dev and InfoQ coverage

  • Anthropic internal study: Confirmed via anthropic.com/research

  • Stack Overflow 2025: Confirmed via survey.stackoverflow.co

  • SemiAnalysis: Confirmed via multiple outlets (GIGAZINE, OfficeChai, X posts)

  • awesome-claude-code: LIVE DATA fetched from GitHub (40.7K stars)

  • MCP Registry: Confirmed via registry.modelcontextprotocol.io

Section 4 sources (added April 24, 2026):

  • CrowdStrike 2026 GTR: Confirmed via crowdstrike.com press release and ir.crowdstrike.com

  • Cybersecurity Insiders 2026: Confirmed via cybersecurity-insiders.com report page

  • Darktrace 2026: Confirmed via darktrace.com blog and resource page (1,500 respondents, 14 countries)

  • SANS/RSAC 2026: Confirmed via sans.org press announcement

  • Sysdig 8-minute breach: Confirmed via stratosally.com reporting on Sysdig research

  • OpenClaw CVEs: Confirmed via openclawai.io disclosure page (all 9 CVE IDs verified)

  • Snyk ToxicSkills: Confirmed via snyk.io/blog primary source (3,984 skills scanned)

  • OpenClaw defaults: Confirmed via official docs (docs.openclaw.ai) and multiple independent analyses

  • IBM X-Force 2026: Confirmed via IBM newsroom press release

  • Palo Alto/HBR: Confirmed via hbr.org (sponsored content, flagged as vendor)

Corrections Applied (Post-Review v2)

  • OpenCode: URL corrected to anomalyco/opencode (~148K stars), language corrected to TypeScript

  • DORA vs Faros: Separated into distinct sections; Faros stats (91% PR review, 9% bugs/dev, 154% PR size) no longer misattributed to DORA

  • CodeRabbit: Replaced incorrect multipliers with actual report data (readability 3x, formatting 2.66x, etc.)

  • Model Welfare: Kyle Fish’s ~15% is his personal estimate (per NYT), not model self-assessment; removed unverified claims

  • Sloperator "12x": Flagged as opinion/metaphor from dev.to, not measured data

  • Cowork: Corrected to all paid plans; Asana/Notion noted as separate announcements

  • RAND: Uses "more than 80%" (RAND’s own language); Pertama Partners attribution for 80.3%

  • MIT NANDA: Corrected from "zero measurable return" to "failing to achieve rapid revenue acceleration"

  • DX quote: Paraphrased rather than fabricated exact words; Staff+ daily vs monthly overlap noted

  • Karpathy CLAUDE.md: Corrected to community derivative (81.4K stars, forrestchang/andrej-karpathy-skills), NOT Karpathy’s own file

  • Cowork: Fixed "persistent team member" framing — that’s Multica, not Anthropic’s product language

NOT verified (excluded from stash)

  • Specific vendor productivity claims that couldn’t be traced to primary source

  • Blog posts citing other blog posts without primary data

  • Marketing materials without methodology disclosure