#WebCrawlers

PPC Landppcland
2026-02-28

FYI: Anthropic clarifies what its three web crawlers do - and how to block them: Anthropic today updated its crawler documentation, detailing ClaudeBot, Claude-User, and Claude-SearchBot - what each collects and what blocking them means for site visibility. ppc.land/anthropic-clarifies-w

PPC Landppcland
2026-02-26

ICYMI: Anthropic clarifies what its three web crawlers do - and how to block them: Anthropic today updated its crawler documentation, detailing ClaudeBot, Claude-User, and Claude-SearchBot - what each collects and what blocking them means for site visibility. ppc.land/anthropic-clarifies-w

PPC Landppcland
2026-02-25

Anthropic clarifies what its three web crawlers do - and how to block them: Anthropic today updated its crawler documentation, detailing ClaudeBot, Claude-User, and Claude-SearchBot - what each collects and what blocking them means for site visibility. ppc.land/anthropic-clarifies-w

AI Daily Postaidailypost
2026-02-04

🚀 Akamai’s latest data shows a sharp rise in AI training bots and content‑fetching crawlers since July. These bots are reshaping web traffic patterns, stressing infrastructure and raising privacy questions. How will developers and open‑source projects adapt? Dive into the numbers and what they mean for the future of machine‑learning pipelines.

🔗 aidailypost.com/news/akamai-da

2026-01-30

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/
2025-12-01

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

2025-10-14

There are a couple mentioned in this article, and a link from the article to a list that is maintained. I can't say if any will fit your issues, as I've not used them but I hope it helps.

https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

https://tldr.nettime.org/@asrg/113867412641585520


#bot-scraping #web-crawlers #tools
Mind Ludemindlude
2025-09-12

People CEO just called Google a 'bad actor' for allegedly 'stealing content' via its AI crawler. Apparently, you can't block the AI without also blocking the good old web crawler. Convenient, much? What's your take on content ownership when the machines are doing the 'reading'?
techcrunch.com/2025/09/12/goog

2025-07-18

Also #Wikimedia complaints about the constant surge in #LLM #crawling. It is a strain to the free project of #Wikipedia and such. For my MSc thesis, I fetched wikipedia #archives which they regularly snapshot and hand out for free.

Is this resource simply ignored, because it would cause additional processing effort next to the anyway running #webcrawlers? @wikimediafoundation , can you briefly comment on whether or not you have numbers on LLM companies that turn to archives, instead of crawling?

N-gated Hacker Newsngate
2025-07-11

📰🤡 Ah, the digital age's finest: creating fake JPEGs to fool unsuspecting web crawlers! 🎭🐛 This riveting tale involves a Spigot—no, not the leaky kind—that pumps out random for to gulp down, while the author marvels at their own concoction like a child watching ants on a sidewalk. 🙄🔍
ty-penguin.org.uk/~auj/blog/20

2025-06-29

Is there a website like example.com/, but has multiple URLs and is OK to spider/crawl?

I want to upgrade the examples in spidr's README to websites that wouldn't mind being randomly spidered.
github.com/postmodern/spidr#re

example.com isn't a viable option, because it only has one webpage, so not a very good demonstration of web spidering...

#web #webspiders #webcrawlers #exampledotcom #exampledotnet #demo

2025-06-18

Keeping the Web Up Under the Weight of AI Crawlers
Mitigation tactics: caching, static content, and rate limiting.

#ai #WebDev #WebCrawlers #WebServers

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-06-16

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

eff.org/deeplinks/2025/06/keep

#AI #GenerativeAI #WebCrawlers #BigTech #WebScraping #OpenWeb

N-gated Hacker Newsngate
2025-06-02

🚨 Breaking News: Someone discovered inode zero in and thought it was worth writing about! 🙄 The author then spirals into a paranoid rant about HTTP UserAgents, because, clearly, the world is ending and it's all the web crawlers' fault. 🌐🔍 The real mystery: how they managed to drag this topic out into a whole article. 😂
utcc.utoronto.ca/~cks/space/bl

Barry TennisonBari10@fosstodon.org
2025-04-25

@johnlsheridan @puntofisso Talk about “interesting times”. You may wish to read the Part 1 as well as this Part 2 post. #DDOS #webcrawlers
void.rehab/notes/a6tp1qqquesrj

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst