Reddit Sues Startups It Says Illegally Scraped the Site to Train AI — what happened and why it matters

Reddit has sued Perplexity AI and several data-scraping firms, accusing them of illegally harvesting millions of user posts to train AI models. The lawsuit claims they bypassed Reddit’s protections and violated its terms, threatening its data licensing business. The case could set a key precedent for how AI firms use online content.

Reddit has launched a sweeping lawsuit against Perplexity AI and three data-scraping intermediaries, accusing them of illegally harvesting millions of user posts and comments to train commercial AI systems. The case, filed in federal court in New York, signals a major escalation in the ongoing war between social platforms and AI developers over who controls the vast reservoirs of online human knowledge.


⚖️ The Lawsuit

Reddit’s 58-page complaint targets:

  • Perplexity AI, a fast-growing AI search startup recently valued at over $20 billion,

  • Oxylabs UAB (Lithuania), a large proxy and web-scraping infrastructure provider,

  • AWMProxy, which Reddit describes as “a former Russian botnet operation,” and

  • SerpApi, a U.S.-based search API firm.

According to Reddit, the defendants used automated scraping systems at industrial scale to bypass Reddit’s anti-bot protections and harvest content even after being explicitly blocked. The company alleges that Perplexity not only ignored its robots.txt restrictions but continued scraping through Google Search to evade detection.

“After Reddit issued a cease-and-desist letter in May 2024, Perplexity’s product showed a 40-fold increase in Reddit citations,” the complaint claims.

Reddit says this scraping directly violated its Terms of Service, Digital Millennium Copyright Act (DMCA) provisions, and U.S. Computer Fraud and Abuse Act (CFAA), amounting to unfair competition and unjust enrichment.


🧠 What Reddit Says Is at Stake

At the heart of the case is Reddit’s argument that its vast database of user-generated discussions — spanning two decades — is a valuable commercial asset. The company already licenses data to major partners, including Google and OpenAI, under paid agreements. Unauthorized scraping, Reddit argues, undermines its ability to monetize and control access to this data.

Reddit likens the defendants to “would-be bank robbers who, after failing to get into the vault, break into the armored truck carrying the cash instead,” according to court filings.

“Reddit built one of the world’s largest archives of human conversation. No company should be able to simply take it, repackage it, and profit from it,” Reddit said in a statement to The Verge.


🤖 Perplexity’s Response

Perplexity AI has denied the allegations, saying its systems do not train on Reddit data and that the company obtains information from “publicly available and fair-use sources.”

In a statement to Axios, a Perplexity spokesperson said:

“We respect robots.txt and other site directives. Reddit’s claims are misguided and we intend to defend our position vigorously.”

Perplexity has positioned itself as an “AI-powered answer engine” — a hybrid between a search engine and chatbot — offering real-time answers with citations. However, investigations by Wired and Forbes earlier this year suggested Perplexity’s models sometimes drew from sources blocked by web crawlers, raising questions about data provenance.


🔍 The Technical Arms Race

The lawsuit describes a sophisticated proxy network designed to cloak the scraping activity. Reddit alleges that Oxylabs and AWMProxy helped Perplexity rotate IP addresses and use Google’s search index as a backdoor to scrape Reddit posts when direct access was blocked.

SerpApi, another defendant, allegedly facilitated “massive automated queries” to extract Reddit content through search results — a method Reddit says constitutes “indirect scraping.”

These tactics, if proven, could set a legal precedent for indirect scraping liability — expanding accountability beyond the AI companies themselves to the infrastructure providers that enable data harvesting.


🧩 Broader Context: Reddit’s Crackdown on AI Use

This isn’t Reddit’s first confrontation with AI developers. In June 2025, the company sued Anthropic, creator of the Claude AI assistant, for similar data scraping violations. That earlier case is still ongoing, and legal experts believe the outcomes of both suits could reshape how tech companies access and train on public data.

Reddit CEO Steve Huffman has been vocal about the issue, calling unlicensed scraping “data laundering” that exploits community contributions without compensation. The company’s upcoming IPO filing earlier this year highlighted data licensing as a core revenue strategy.


📊 Implications for the AI Industry

If Reddit prevails, it could:

  1. Force AI firms to pay licensing fees for online data used in model training.

  2. Deter scraping via legal precedent, making “robots.txt” violations more enforceable in court.

  3. Encourage regulatory frameworks governing fair data use and AI transparency.

On the other hand, if Perplexity and its co-defendants successfully argue fair use or lack of direct copyright infringement, the decision could embolden other AI developers to continue training on publicly available data without compensation.


🚨 The Bigger Picture

The case lands amid a global reckoning over AI’s appetite for human data. Publishers, artists, and digital communities are fighting to reclaim ownership of their work from opaque machine-learning pipelines. Reddit’s aggressive legal stance could mark a turning point in how online communities define the boundaries of consent in the AI era.

As one legal analyst told Business Insider:

“If Reddit wins, data scraping for AI training may become a licensed business overnight.”


🗞️ Sources

 

Reuters, AP News, Axios, The Verge, Business Insider, SiliconAngle, Wired, PBS, and Reddit court filings (SDNY 25-cv-0921).