August 8, 2024

Major News Sites Block SearchGPT

In July 2024, OpenAI was poised to become a key figure in the world of search. Just a couple of weeks later, their new search engine, SearchGPT, hit a significant roadblock: the search engine’s crawler, OAI-SearchBot, was blocked by several major news outlets.

As a result, the world of search is at an impasse. This blockade has led to a renewed debate about data collection, privacy and intellectual property.

Here we investigate why these websites have blocked SearchGPT and what these actions could mean for the future of AI-driven search.

What is OpenAI’s SearchGPT?

Launched on 26 July 2024, SearchGPT is a new search engine from OpenAI. Although the company is best known for their AI-powered chatbot, ChatGPT, SearchGPT marks the company’s first real foray into the world of search.

This “temporary prototype” was initially released to an audience of 10,000 users and publishers. This limited launch was designed to allow OpenAI to refine its functionalities and improve its search experience.

Like other search engines, SearchGPT operates by sourcing information from publicly-accessible web pages. This is powered by OAI-SearchBot, the company’s new web crawler. It’s this bot that is responsible for indexing the content that informs SearchGPT’s answers.

Alongside OAI-SearchBot, SearchGPT uses natural language processing (NLP) to understand the web and interpret search queries. According to OpenAI, this enables their search engine to go beyond traditional keyword matching, compiling answers that are both relevant and contextually rich.

In terms of layout, SearchGPT features a similar format to GPT-4o. It’s accessed via a search box that poses the question: “What are you searching for.” Users simply enter their query and SearchGPT populates the page with information as well as links to relevant sources.

Why Have Sites Blocked SearchGPT?

Despite the hype around this release, SearchGPT has received a somewhat lacklustre reception. In fact, several of the world’s biggest news organisations have made a bold move: they have decisively blocked the bot from accessing, crawling and indexing its content.

But why is this? When these sites are already crawled by tech giants like Google and Microsoft, why is this search engine seemingly being singled out? This backlash appears to stem from a lack of trust.

According to several reports, OpenAI has previously collected data without consent. This data was scooped up by GPTBot, a web crawler that is used to train and refine the company’s future AI models.

The issue here is that GPTBot did not differentiate between content for consumption, such as articles and blogs, and personal information. It’s the collection of this personal data that has sparked concerns and led regulators to scrutinise the company’s AI models. And according to recent headlines, it appears these fears are justified.

Despite being hit by several lawsuits—including a 160-page complaint in 2023—it seems that OpenAI is still failing to comply with data standards. This issue resurfaced again in May 2024, when the EU Data Protection Board (EDPB) found that the firm was continuing to fall short of EU regulation.

As a result, there is growing unease about data privacy and the potential misuse of information. However, OpenAI has been quick to debunk concerns. When launching SearchGPT, the company explicitly stated that OAI-SearchBot is solely used for search:

“OAI-SearchBot is used to link to and surface websites in search results in the SearchGPT prototype. It is not used to crawl content to train OpenAI’s generative AI foundation models. To help ensure your site appears in search results, we recommend allowing OAI-Searchbot in your site’s robots.txt file and allowing requests from our published IP ranges.”

Privacy aside, several publications have also expressed concern about intellectual property rights. This is an issue that goes beyond OpenAI, with notable names like The New York Times also suing Microsoft.

Charlie Stadtlander, spokesperson for The New York Times, explained the company’s stance:

“The Times does not authorize the use of our works for generative search or AI training purposes without an express written agreement, regardless of whether we do or do not block or restrict any particular bot from crawling our content.”

Stadtlander continued:

“By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue.”

Which News Outlets Have Blocked OAI-SearchBot?

Despite OpenAI’s assurances, news outlets remain concerned about data collection and usage. As a result, several sites have blocked OAI-SearchBot. This includes major publications like The New York Times, Daily Mail, Vogue, GQ and CNBC.

But it’s not just OAI-SearchBot that has come under fire. This blockade increases when we look at GPTBot, another of OpenAI’s crawlers. This controversial bot has been blocked by dozens of news websites including The Guardian, BBC and Washington Post.

At the time of writing, 14 (or 1.4%) of the top 1,000 websites have opted to block OAI-SearchBot. In contrast, more than one third (35.8%) of these sites have blocked GPTBot—the figure currently sits at 358 out of 1,000.

These figures were sourced from Originality on 8th August 2024.

This blockade has been spotted by several experts within the SEO community. Among them is Glenn Gabe, who shared his findings in a post on X.

While news websites are taking a stand, the issue of blocking AI crawlers is not confined to the news industry alone. Other sectors have also closed their proverbial door on OpenAI, with the likes of Instagram, Amazon and Healthline continuing to block GPTBot.

However, we noticed something interesting when researching this piece: Reddit has not blocked any of OpenAI’s crawlers. This is surprising given that Steve Huffman, Reddit’s CEO, recently confirmed the company’s stance on search engines.

In July 2024, Huffman stated that Reddit would continue blocking major search engines from crawling its content. The exception to their rule is Google—the tech giant pays Reddit $60 million per year for the privilege.

You can read our full story on Reddit’s decision in this blog post.
Perhaps the keyword here is ‘major.’ While ChatGPT remains popular—the bot currently boasts 200 million active monthly users—OpenAI is still a relatively small player. And, without being able to crawl every corner of the web, SearchGPT could struggle to compete.

The Broader Implications of Blocking Crawlers

From the actions taken by major news organisations to Reddit’s extensive blockade, these decisions reflect a growing resistance to unregulated AI-driven data collection. The crux of the issue is this: while AI promises to enhance search results, it also raises significant concerns surrounding data privacy and intellectual property.

However, it’s important to note that these issues go beyond OpenAI. Major websites, including The New York Times, have also raised complaints against Microsoft. Meanwhile, Google’s reversal on third-party cookie withdrawal has sparked widespread debate.

It’s clear that this standoff highlights the need for a balanced approach; one that allows for technological innovation while also protecting users and ensuring the integrity of intellectual property. With AI still in its infancy, these blockades could lead to significant changes in how this technology is deployed and regulated.

Do you have any questions about AI-powered search? Want to make sure your SEO strategies are hitting the mark? Or find out if your web pages are being indexed? Our team can help. Get in touch with our experts today to find out more.

Marcus Hearn

Marcus has spent his career growing the organic search visibility of both large organisations and SMEs. He specialises in technical SEO but he’s obsessed with curating strategies that leverage expertise and unlock potential.

Continue to learn

A New Google Core Update is Expected ‘Within the Coming Weeks’ – Are You Ready?
Google Has Introduced INP to Core Web Vitals
Google is Banning Impersonation and Fake Endorsements in Ads