#webscrapers

2025-10-25

Press Gazette: AI companies steal publisher traffic then undermine trust by getting answers wrong. “The research points to a generally corrosive impact of AI answer engines on the news ecosystem. It found that the likes of Perplexity and Google AI Overviews are stealing publisher traffic but also contributing to declining trust in the news industry by giving distorted answers.”

https://rbfirehose.com/2025/10/25/press-gazette-ai-companies-steal-publisher-traffic-then-undermine-trust-by-getting-answers-wrong/

2025-09-13

#Google says they are fighting #adblockers which they frame as thief tools. #Invidious @invidious , #Piped, #webscrapers, #scriptblockers are »adblockers«. Judges especially in post democracy USA start to ignore #antitracking arguments and back Google in their fight against »adblockers«.

We should understand the newer developments as another argument to not feed #GAFAM / #BigTech any more with our content, develop contraceptives and argue about tracking. Instead let us use #Peertube @peertube , Youtube #concurrency, #OnlyOffice @ONLYOFFICE , #Cryptpad @CryptPad , #Nextcloud @nextcloud , #HedgeDoc @hedgedoc , #posteo @posteo.de , #Threema @threemaapp , #Signal @signalapp , #LibreOffice @libreoffice, #LibreWolf @librewolf , …, i.e., software as much #libre as possible.

2025-09-06

Press Gazette: AI bots bombard publisher websites with ‘no meaningful value exchange’. “Unwanted AI scraping of publisher websites is placing a costly financial burden on publishers, according to one leading industry executive. Chris Dicker, chief executive of Candr Media Group and board member of the Independent Publishers Alliance, said that the publisher’s Trusted Reviews website was […]

https://rbfirehose.com/2025/09/06/press-gazette-ai-bots-bombard-publisher-websites-with-no-meaningful-value-exchange/

2025-06-17

404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums. “AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today.” As you might imagine this drives me absolutely WILD.

https://rbfirehose.com/2025/06/17/404-media-ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/

2025-03-09

WTF? On Tuesday morning I was using 37GB of space. Today I'm using 41GB of space. That's nearly 10% of my server's storage used in five days!

Checking my logs and apparently a new bunch of Alibaba Cloud IPs have been hammering my servers. One server alone is two orders of magnitude more traffic above the second highest IP. Between the lot of them? FOUR ORDERS OF MAGNITUDE!

They're now getting redirected to a tarpit.

#SysAdminProblems #FuckTheBots #WebScrapers

2025-01-11

TechCrunch: How OpenAI’s bot crushed this seven-person company’s website ‘like a DDoS attack’. “On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous […]

https://rbfirehose.com/2025/01/11/techcrunch-how-openais-bot-crushed-this-seven-person-companys-website-like-a-ddos-attack/

2024-12-26

Search Engine Journal: Google Warns: Beware Of Fake Googlebot Traffic. “Google’s Developer Advocate, Martin Splitt, warns website owners to be cautious of traffic that appears to come from Googlebot. Many requests pretending to be Googlebot are actually from third-party scrapers.”

https://rbfirehose.com/2024/12/26/google-warns-beware-of-fake-googlebot-traffic-search-engine-journal/

2024-06-10

😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐

💣 Learn about the impact of scraper bots and how to prevent them: bit.ly/3RiXgya

#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface

Jeremia Kimelmanjeremiak@journa.host
2022-12-18

New technical blog post from me where I share a pattern I've been using a LOT recently: recursive promises!

Promises (and their friends `async/await`) are *freaking* great in Javascript. And they make it easy to add in retry behavior to make your web scraper more stable.

If you build web scrapers using JS I'd love to hear what kinds of techniques you use to avoid hung or broken scraping scripts!

jeremiak.com/blog/more-stable-

#javascript #webscrapers

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst