Internet crawlers deployed by Perplexity to scrape web sites are allegedly skirting restrictions, based on a brand new report from Cloudflare. Particularly, the report claims that the corporate’s bots look like “stealth crawling” websites by disguising their identification to get round robots.txt information and firewalls.
Robots.txt is an easy file web sites host that lets internet crawlers know if they will scrape a web sites’ content material or not. Perplexity’s official internet crawling bots are “PerplexityBot” and “Perplexity-Consumer.” In Cloudflare’s checks, Perplexity was nonetheless capable of show the content material of a brand new, unindexed web site, even when these particular bots have been blocked by robots.txt. The habits prolonged to web sites with particular Internet Software Firewall (WAF) guidelines that restricted internet crawlers, as nicely.
Cloudflare believes that Perplexity is getting round these obstacles by utilizing “a generic browser meant to impersonate Google Chrome on macOS” when robots.txt prohibits its regular bots. In Cloudlfare’s checks, the corporate’s undeclared crawler might additionally rotate by IP addresses not listed in Perplexity’s official IP vary to get by firewalls. Cloudflare says that Perplexity seems to be doing the identical factor with autonomous system numbers (ASNs) — an identifier for IP addresses operated by the identical enterprise — writing that it noticed the crawler switching ASNs “throughout tens of hundreds of domains and tens of millions of requests per day.”
Engadget has reached out to Perplexity for touch upon Cloudflare’s report. We’ll replace this text if we hear again.
Up-to-date info from web sites is important to corporations coaching AI fashions, particularly as service’s like Perplexity are used as replacements for engines like google. Perplexity has additionally been caught prior to now circumventing the foundations to remain up-to-date. A number of web sites reported in 2024 that Perplexity was nonetheless accessing their content material regardless of them forbidding it in robots.txt — one thing the corporate blamed on the third-party internet crawlers it was utilizing on the time. Perplexity later partnered with a number of publishers to share income earned from advertisements displayed alongside their content material, seemingly as a make-good for its previous habits.
Stopping corporations from scraping content material from the online will seemingly stay a recreation of whack-a-mole. Within the meantime, Cloudflare has eliminated Perplexity’s bots from its record of verified bots and carried out a technique to determine and block Perplexity’s stealth crawler from accessing its clients’ content material.


