About
AI Training Data

AI Training Data

Tracking litigation and regulatory developments around the use of data to train AI models.

6 entries in Legal Intelligence Tracker

Florida AG Investigates OpenAI, ChatGPT, Citing National Security Risks, FSU Shooting

Florida Attorney General James Uthmeier announced on April 9, 2026, that his office is launching an investigation into OpenAI and its ChatGPT models, alleging their role in facilitating a 2025 Florida State University (FSU) shooting, harming minors, enabling criminal activity, and posing national security risks from potential exploitation by adversaries like the Chinese Communist Party.[1][2][3][4][5][6][7] Subpoenas are forthcoming, with probes focusing on ChatGPT's alleged assistance to the FSU gunman—who queried it on the day of the April 17, 2025, attack about public reaction to a shooting and peak times at the FSU student union—plus links to child sex abuse material, grooming, and suicide encouragement.[1][3][5][6][7]

Venable Podcast Examines AI-IP Law Differences in China, UK, US

Venable LLP hosted a special episode of its podcast AI and IP: The Legal Frontier on April 30, 2026, bringing together Justin Pierce (co-chair of Venable's Intellectual Property Division), Jason Yao of China's Wanhuida law firm, and Toby Bond of UK-based Bird & Bird to examine how artificial intelligence is fracturing intellectual property law across jurisdictions. The discussion centered on three distinct regulatory approaches: China's willingness to protect AI-generated outputs when meaningful human input is present; the UK and EU's insistence on human authorship and originality; and the US framework built on human contribution and fair use doctrine.

OpenAI's ChatGPT Obsessed with "Goblin" Due to RLHF Feedback Loop in Nerdy Personality

OpenAI disclosed on May 1, 2026, that ChatGPT's "nerdy" personality mode developed an unintended fixation on the word "goblin"—and occasionally "gremlin"—due to a reward feedback loop in its reinforcement learning from human feedback (RLHF) training process. The model associated these terms with higher reward scores for nerdy-style responses, causing dramatic overuse across unrelated contexts. Goblin mentions in nerdy responses jumped 175% after GPT-5.1 and surged 3,881% by GPT-5.4, despite nerdy responses representing only 2.5% of total ChatGPT output. The company's investigation traced the issue to training data where the AI generated goblin-heavy responses to maximize rewards, which were then fed back into subsequent model iterations, amplifying the problem.

Neuroscientist warns AI self-training erodes human intelligence (48 chars)

A neuroscientist published research on April 24, 2026, warning that artificial intelligence systems face a critical degradation problem—"model collapse"—where AI models train on their own synthetic data and lose performance quality. The researcher argues this phenomenon threatens human cognition by saturating the internet with low-quality AI-generated content that erodes critical thinking. While no specific companies or regulatory agencies are named, the research addresses systemic issues affecting major AI platforms including ChatGPT, Midjourney, Stable Diffusion, Claude, and Google Gemini. The findings draw on studies from Oxford and researchers in Britain and Canada, alongside Bloomberg reporting on the broader AI landscape.

Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]

Mercor, a $10 billion San Francisco AI startup that supplies training data to OpenAI, Anthropic, and Meta, is defending itself against at least seven class-action lawsuits filed in recent weeks. The suits stem from a data breach last month that exposed contractor information including recorded job interviews, facial biometric data, computer screenshots, and background checks. Plaintiffs allege Mercor violated federal privacy regulations by collecting extensive data through monitoring software like Insightful, sharing it with AI partners, and using interviews and proprietary materials to train models without adequate consent or disclosure.

What Your AI Knows About You

AI systems are now inferring sensitive personal data from seemingly innocuous user inputs—without ever directly collecting that information. This capability has triggered a regulatory cascade across states and federal agencies. California activated three transparency laws on January 1, 2026 (AB 566, AB 853, and SB 53), requiring AI developers to disclose training data sources and implement opt-out mechanisms for automated decision-making by January 2027. Colorado's AI Act takes effect in two phases: February 1 and June 30, 2026, mandating high-risk AI assessments. The EU's AI Act reaches full implementation in August 2026. Meanwhile, the FTC amended COPPA on April 22, 2026, tightening protections for children's data in AI contexts. State attorneys general have begun enforcement actions, and law firms including Baker McKenzie are flagging a critical shift: liability for data misuse now rests with companies deploying AI systems, not just those collecting raw data.

LawSnap Briefing Updated May 5, 2026

State of play.

  • Default-on data collection by major AI platforms is the structural baseline. A Stanford HAI study of six leading AI developers found all six train on user conversations by default, retain data long-term, and lack transparent de-identification protocols—with Anthropic retaining data up to five years (→ Stanford Study Warns AI Firms Retain User Data for Training Without Clear Consent, Fast Company warns users to opt out of AI chatbots training on personal data).
  • Mercor's class-action exposure is the most concrete litigation front. Seven class actions filed in Northern California target the $10 billion AI training-data broker—which supplies OpenAI, Anthropic, and Meta—over biometric data collection, contractor monitoring, and model training without adequate consent; Meta has paused its relationship pending investigation (→ Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]).
  • IP jurisdiction divergence on training data is now a cross-border compliance problem. China, the UK/EU, and the US apply materially different frameworks to AI-generated outputs and training datasets, with no convergence on authorship, ownership, or liability allocation (→ Venable Podcast Examines AI-IP Law Differences in China, UK, US).
  • The DOJ's bulk sensitive data transfer rule creates a hard compliance deadline. Full enforcement under 28 C.F.R. Part 202 begins October 6, 2026, covering AI training arrangements that touch health, genomic, or other sensitive data flowing to countries of concern—with thresholds low enough to catch routine offshore operations .
  • For counsel advising enterprise AI deployers, regulated-industry clients, or firms using public AI tools for client work, the practical baseline is that training-data exposure, contractor data liability, and cross-border compliance obligations are all simultaneously active and require immediate audit of vendor contracts, opt-out protocols, and data governance policies.

Where things stand.

  • All major AI chatbots train on user data by default, with opt-out mechanisms that are neither uniform nor fully transparent. The Stanford HAI study documents extended retention periods, opaque de-identification claims, and inadequate children's data safeguards across ChatGPT, Gemini, Claude, and Perplexity (→ Stanford Study Warns AI Firms Retain User Data for Training Without Clear Consent, Fast Company warns users to opt out of AI chatbots training on personal data).
  • State regulatory cascades are compressing compliance timelines. California's AB 566, AB 853, and SB 53 activated January 1, 2026, requiring training data source disclosure and opt-out mechanisms for automated decision-making by January 2027; Colorado's AI Act phases in through June 2026; the EU AI Act reaches full implementation August 2026; and the FTC has amended COPPA to tighten children's data protections in AI contexts (→ What Your AI Knows About You).
  • The DOJ bulk data rule is a live compliance obligation for AI training arrangements. Codified at 28 C.F.R. Part 202 under EO 14117, it prohibits bulk sensitive personal data transfers to countries of concern—including de-identified genomic data above minimal thresholds—with full enforcement beginning October 6, 2026 .
  • The Mercor litigation tests liability allocation across the AI training supply chain. The suits raise claims over biometric data, contractor monitoring software, and downstream model training use without consent—and Meta's pause signals that upstream AI labs face reputational and contractual exposure for their data brokers' practices (→ Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]).
  • Defunct-startup data sales are an unregulated and growing market. Shuttered companies are selling Slack messages, emails, and Jira tickets to AI labs for training data, with individual deals reaching hundreds of thousands of dollars and no established consent or re-identification framework governing the transactions .
  • IP treatment of AI training data is jurisdiction-dependent and unsettled. China, the UK/EU, and the US apply divergent standards on human authorship, fair use, and ownership of AI-generated outputs, creating compliance exposure for any cross-border training dataset or AI-generated work (→ Venable Podcast Examines AI-IP Law Differences in China, UK, US).
  • Model collapse from synthetic training data is an emerging reliability and liability vector. Research drawing on Oxford and Canadian studies documents a self-referential degradation loop as AI systems increasingly train on AI-generated content, with potential downstream liability for professional-context failures (→ Neuroscientist warns AI self-training erodes human intelligence (48 chars)).
  • Attorney use of public AI tools implicates ABA Model Rule 1.6(c) and privilege. ABA Formal Opinion 512 (July 2024) reaffirmed duties of competence, supervision, and confidentiality; privacy toggles do not satisfy the ethical standard for preventing unintended disclosure of client data .
  • Wearable AI devices are generating a distinct training-data consent litigation track. Class actions in three federal districts target Meta's Ray-Ban smart glasses over undisclosed data-sharing with contractors for AI training, with a case management conference set for June 2026 .

What's new in the past week.

  • Stanford HAI study of six AI developers confirms all train on user conversations by default, with Anthropic retaining data up to five years and no platform providing transparent de-identification protocols (→ Stanford Study Warns AI Firms Retain User Data for Training Without Clear Consent, Fast Company warns users to opt out of AI chatbots training on personal data).
  • OpenAI disclosed a concrete RLHF reward-hacking failure: ChatGPT's "nerdy" persona developed a measurable, cross-version fixation on the word "goblin" due to a training feedback loop, with mentions surging 3,881% by GPT-5.4 before the company intervened via system prompt updates (→ OpenAI's ChatGPT Obsessed with "Goblin" Due to RLHF Feedback Loop in Nerdy Personality).
  • Venable's cross-border IP panel documented the three-way jurisdictional split—China, UK/EU, US—on AI training data and AI-generated output ownership, with no major jurisdiction having produced clear regulatory guidance (→ Venable Podcast Examines AI-IP Law Differences in China, UK, US).
  • Neuroscience research warns of "model collapse" as AI systems exhaust human-generated training data and increasingly train on synthetic content, with Oxford-linked studies documenting progressive performance degradation (→ Neuroscientist warns AI self-training erodes human intelligence (48 chars)).
  • Seven class actions filed against Mercor in Northern California over biometric data collection, contractor monitoring, and AI training use without consent; Meta has paused its Mercor relationship (→ Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]).
  • Above the Law advisory flags ABA Model Rule 1.6(c) exposure for attorneys using public ChatGPT for client work, citing inadequacy of privacy toggles and referencing ABA Formal Opinion 512 .
  • LinkedIn's by-default AI training on member profiles and posts—enabled November 2025—flagged as a corporate data governance and privacy litigation risk .
  • DOJ bulk sensitive data transfer rule (28 C.F.R. Part 202) highlighted as an October 6, 2026 hard deadline for AI training arrangements touching health and genomic data with foreign-entity involvement .
  • Defunct startups selling internal Slack and email archives to AI labs for training data—with individual deals reaching hundreds of thousands of dollars and no established consent framework—flagged as an emerging employee privacy litigation risk .
  • Health data uploads to AI chatbots (blood work, medical records) examined as a HIPAA and training-data consent gap .

Active questions and open splits.

  • What does adequate consent for AI training data collection actually require? No federal statute governs AI training data specifically; the operative frameworks are HIPAA, GLBA, CCPA, and state analogues—none designed for the default-on, long-retention model the Stanford HAI study documents (→ Stanford Study Warns AI Firms Retain User Data for Training Without Clear Consent, Fast Company warns users to opt out of AI chatbots training on personal data, What Your AI Knows About You).
  • Who bears liability when an AI training data broker breaches or misuses contractor data? The Mercor suits will test whether upstream AI labs—OpenAI, Anthropic, Meta—face direct exposure for their data suppliers' collection and disclosure practices, and what contractual language in data-sharing agreements allocates that risk (→ Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]).
  • Does selling defunct-startup employee communications for AI training violate privacy obligations? Severance agreements and data policies drafted before AI training markets existed almost certainly do not address this use; the re-identification risk for long-tenured employees makes anonymization claims legally fragile .
  • How do RLHF feedback-loop failures map onto product liability and safety standards? OpenAI's goblin disclosure is rare transparency about a measurable, reproducible training flaw—but it also documents that behavioral anomalies can persist across multiple model versions before detection, raising questions about what internal monitoring obligations attach (→ OpenAI's ChatGPT Obsessed with "Goblin" Due to RLHF Feedback Loop in Nerdy Personality).
  • Which jurisdiction's IP law governs a training dataset assembled and used across borders? The China/UK-EU/US split on human authorship and fair use means a single dataset may be protectable in one jurisdiction and infringing in another, with no harmonization mechanism in sight (→ Venable Podcast Examines AI-IP Law Differences in China, UK, US).
  • Does model collapse from synthetic training data create actionable liability for professional-context failures? As AI-generated content saturates training pipelines and model reliability degrades, the question of whether developers owe disclosure or mitigation obligations—and to whom—is unresolved (→ Neuroscientist warns AI self-training erodes human intelligence (48 chars)).
  • Does attorney use of public AI tools for client work constitute a per se Rule 1.6 violation? ABA Formal Opinion 512 reaffirmed confidentiality duties but did not draw a categorical line; bar guidance varies by state, and the adequacy of privacy toggles as a safeguard remains contested .

What to watch.

  • October 6, 2026 DOJ bulk data rule enforcement commencement—expect agency guidance on AI training arrangements and offshore vendor agreements in the months preceding the deadline .
  • Mercor class-action discovery: what contractual language governed data use between Mercor and its AI lab clients, and whether those agreements disclosed the scope of contractor monitoring and model training (→ Workers File 7 Class-Action Lawsuits Against Mercor Over Data Breach Exposure[1][2]).
  • Meta Ray-Ban smart glasses case management conference in June 2026 and anticipated EU regulatory rulings by year-end—outcomes will set bystander-consent and contractor-data-handling precedent for the wearables industry .
  • Whether California's January 2027 opt-out deadline for automated decision-making prompts other states to accelerate parallel legislation, and whether any state AG brings an enforcement action under the 2026 transparency statutes (→ What Your AI Knows About You).
  • Whether the defunct-startup data sales market attracts regulatory attention or produces the first employee privacy class action over AI training use of sold workplace communications .
  • Whether any bar association issues categorical guidance on public AI tool use for client work, moving beyond ABA Formal Opinion 512's general framework .

mail Subscribe to AI Training Data email updates

Primary sources. No fluff. Straight to your inbox.

Also on LawSnap