12,000+ API Keys and Passwords Exposed in AI Training Data
Key Takeaways
- Researchers at Truffle Security scanned the December 2024 dump of Common Crawl and discovered nearly 12,000 valid live secrets — API keys, passwords, tokens for AWS, Mailchimp, Slack, GitHub, and more. (BleepingComputer)
- These credentials were embedded in HTML, JavaScript and configuration files on public websites — later ingested into LLM training sets, meaning models and AI copilots may reproduce insecure practices or expose keys. (TechRadar)
- The leak underlines a profound risk: simply using publicly available data (like Common Crawl) for AI training can inadvertently embed sensitive secrets into any model trained on it.
- Organizations deploying or fine-tuning such models, or using AI coding assistants, face elevated risk of credential reuse, cloud-account compromise, and downstream supply-chain vulnerabilities.
Summary
In early 2025, security researchers revealed a systemic flaw in widely used AI training data: datasets drawn from the public web (notably Common Crawl) contained thousands of valid credentials—API keys, passwords, tokens—that remained active. These secrets were embedded in publicly accessible code, configuration, or web pages, and were inadvertently ingested into large-language models and AI tool training pipelines. (BleepingComputer)
Because many enterprises rely on pretrained models or AI assistants derived from these datasets, the exposure raises serious supply-chain and operational risks. AI-generated code might embed sensitive credentials, or attackers might use model-suggested secrets directly to compromise cloud accounts. This incident demonstrates that data provenance and credential hygiene are critical — not optional — when using open data sources for AI development.
What Happened: Incident Overview
- Researchers from Truffle Security scanned ~400 TB of public web data from Common Crawl’s December 2024 archive — covering 2.67 billion web pages. (BleepingComputer)
- Their scanning tool (e.g., TruffleHog) identified nearly 11,908 – 12,000 live secrets — valid API keys, access tokens, and passwords that successfully authenticated against cloud services such as AWS, Slack, Mailchimp, and more. (BleepingComputer)
- This data is commonly used by researchers and companies to train or fine-tune large language models (LLMs). As a result, models trained on this data may inadvertently learn insecure patterns or even embed these credentials in outputs. (Security Boulevard)
- The leaked credentials remained valid, and many were reused across multiple domains — increasing the risk of broad cloud-service exploits or downstream supply-chain compromise. (TechRadar)
Why It Matters
This incident signals a critical blind spot in AI safety and enterprise security:
- AI supply-chain risk: Public datasets often used for training (like Common Crawl) can carry live secrets — meaning AI models may unknowingly propagate sensitive credentials.
- Widespread exposure: Thousands of keys across many services (cloud, APIs, SaaS) puts any organization using or deploying AI at risk.
- Model-driven credential leakage: AI coding tools or assistants may suggest insecure defaults — hardcoded API keys, embedded tokens, or credentials — derived from unsafe training data.
- Attack surface expansion: Attackers can abuse exposed credentials to gain cloud access, exfiltrate data, or pivot deeper into applications tied to AI usage.
For enterprises working with AI, this incident underlines the urgent need for secret hygiene, provenance checking, and supply-chain governance in AI development pipelines.
PointGuard AI Perspective
At PointGuard AI, we consider this a foundational example of why end-to-end AI security must include dataset and supply-chain hygiene, not just runtime model monitoring. To address these risks, our platform offers:
- ML-Asset Discovery & Inventory: Detects all external models, datasets, and training assets imported into projects — highlighting third-party data with embedded secrets or questionable provenance.
- Secret-Scan & Serialization Policies: Scans model inputs and training data for hard-coded secrets, insecure serialization formats, or leaked credentials before deployment.
- Behavioral & Output Monitoring: Watches for suspicious model outputs or code suggestions that reference credentials or private keys — blocking or flagging them before they can be codified.
- Governance & Supply-Chain Risk Controls: Maintains an AI-SBOM (software bill of materials) for ML assets and enforces lifecycle controls, version audits, and credential rotation.
- Developer Awareness & Safe Coding Guidelines: Integrates best practices to prevent embedding credentials in code and encourages use of secure secret-management tools.
This incident reinforces a key principle: the security of AI depends on the cleanliness and governance of the data feeding it — not just the models themselves.
Incident Scorecard Details
Total AISSI Score: 7.1 / 10
Criticality = 7, Thousands of valid credentials — tokens, API keys, passwords — exposed in publicly available datasets used for AI training.
Propagation = 6, Dataset was widely distributed and used by many organizations and researchers; keys are publicly accessible and embeddable.
Exploitability = 8, Attack requires minimal effort — any developer or attacker can load dataset, extract keys and attempt cloud or API access.
Supply Chain = 8, Root cause is shared public data used across ML supply chains; indicates a systemic vulnerability affecting all users of such datasets.
Business Impact = 7, Potential for credential misuse, data theft, cloud account compromise, and regulatory exposure depending on downstream usage.
Sources
- The Hacker News — 12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training (The Hacker News)
- BleepingComputer / securitynews — Nearly 12,000 API Keys and Passwords Found in AI Training Dataset (BleepingComputer)
- Truffle Security Blog — Research Finds 12,000 ‘Live’ Secrets in Common Crawl Dataset (trufflesecurity.com)
- Security Boulevard — Thousands of Live API Keys and Passwords Exposed in Public Datasets (Security Boulevard)
