Major websites block AI crawlers from scraping their content

Major websites are blocking AI crawlers from accessing their content, including Amazon, Quora, The New York Times, CNN, ABC, Reuters, and many others. According to Originality.AI, an AI detection tool, almost 20% of the top 1000 websites in the world block crawler bots from collecting web data for AI use. Large language models (LLMs) such as OpenAI’s ChatGPT and Google’s Bard require massive amounts of data to train their AI systems. OpenAI also released its own web crawler, GPTBot, to scan webpages and enhance its AI services, recently revealing how it could be blocked.

The web crawlers blocked include GPTBot and CCBot, Common Crawl’s web crawler, an open repository of web data. The web crawlers scan web pages and scrape data to help train AI products. However, website operators are increasingly concerned about the impact of these crawlers on their content and want to protect their intellectual property.

What is a web crawler?

A web crawler, also known as a web spider or web bot, is a software program that systematically navigates the internet, visiting web pages and collecting (or scraping) data from them. Web crawlers are primarily used for indexing web content for search engines and gathering data for AI training.

Why does it matter?

Most of the text and images available on the internet are under copyright. Crawlers do not request permission or pay for a license to extract data and information. As generative AI tools such as ChatGPT take centre stage, awareness about the ownership of the data these crawlers collect to train LLM-based AI models is rising.

Website operators now take the protection of content and intellectual property into their own hands.
OpenAI and others are facing a backlash from mainstream authors such as Stephen King and multiple lawsuits from well-known outlets like the New York Times. Last month, Agence France-Presse, Getty Images, and other reputable media called for AI regulation, including transparency about datasets used to train the models and consent for copyrighted material. Denying AI crawlers access to major websites could have significant implications for the future development of AI bots. If these crawlers are blocked on more sites, it could limit the amount and quality of data available to train AI models and, therefore, their progress.