Blocking ChatGPT with QUIC.cloud

Blocking ChatGPT with QUIC.cloud is as easy as accessing our CDN Security settings. But why would you want to do this?

Large Language Models

The internet is all abuzz with talk about “Artificial Intelligence.” Generally speaking, what they’re really talking about is large language models, or LLMs. LLMs are designed to process and generate human-like language. The most popular of these right now is OpenAI‘s ChatGPT.

In addition to the text-based LLMs, we have the text-to-image models, which are designed to generate digital images from natural-language prompts. DALL-E, also developed by OpenAI, is a good example of one of these.

Large language models and text-to-image models are trained on massive sets of data, much of which it procures by scraping public websites.

This can be a divisive issue. For example:

  • Pro: It’s for the “greater good.” When AI solutions are trained using your website content, you are contributing to its understanding of natural language and context. This can lead to more relevant and accurate interactions for everyone who uses the AI tool in the future.
  • Con: Copyright infringement. Many writers, artists, photographers, and other content creators feel that allowing AI to train on their content infringes upon their copyright and intellectual property rights. They argue that they should not be required to provide their work for free for use in this manner.

There are more pros and cons, but we’re not here to say who’s right and who’s wrong.

We’re just here to let you know that if, as a content creator, you do not want your articles or your artwork to be used to teach “artificial intelligence” models, QUIC.cloud can help keep the bots away from your site.

Content Scrapers

A blue metal toy robot looks down and to the right

LLMs and related models use content scrapers to collect text and images from your site. This content is used to populate their massive datasets, which are in turn used for training the models.

The reputations of these content scrapers vary. There are those like ChatGPT and Google Bard, which claim to respect robots.txt files, and publicize the scraper user-agents for easier opt out.

And then there are those like img2dataset, which can be set to actively ignore robots.txt. And by default, it spoofs a Mozilla user-agent so that site owners cannot easily block it with readily accessible tools.

The most effective way to keep LLMs and text-to-image models from scraping your future content is to stop them before they get anywhere near your server: block them at the CDN level. For this, we need to know the model’s user agent.

Content Scraper User Agents

These are the AI content scraper user agents we know as of this writing (and frankly, we don’t know a lot. More transparency in this area would be a welcome development):

  • CCBot: this is Common Crawl, and it’s used by many models for training, including ChatGPT and Google Bard. Additionally, it’s used by LAION (or, Large-scale Artificial Intelligence Open Network), which is a massive dataset that is used to train several models. LAION collects image URLs, which are then crawled by the aforementioned img2dataset. You may not be able to block img2dataset directly, but you can stop them from finding your content through LAION, at least.
  • GPTBot: this is the dedicated scraper bot that ChatGPT uses to populate its dataset.
  • ChatGPT-User: If you prompt ChatGPT, and ask it to refer to a website, this is the user agent ChatGPT uses to visit the website and collect the data.
  • img2dataset: We’ve covered the tactics this scraper tool will resort to, but there is an option that more honest users of the tool may enable, if they wish. Under this option, users may add an “img2dataset” token to the user agent header. This allows the bot to be properly identified and blocked.

We’ll update this list as we learn more user agents. Feel free to leave a comment, if you have knowledge of others we should include.

Blocking ChatGPT with QUIC.cloud

A blue metal toy robot look straight ahead

If you have decided to block ChatGPT and its ilk from your content, here’s how: Start by visiting your QUIC.cloud Dashboard. Choose the domain you wish to protect. Then, navigate to CDN > CDN Config > Security > Access Control > User Agent. The Blocklist field is where all of the action is.

Enter one bot per line. You may use regex, if you wish. An entry is considered a match if the text is found anywhere in the User-Agent header. QUIC.cloud will automatically reject any request from user agents listed in the Blocklist.

Example

Let’s assume you want to block all of the bots that we mentioned above. Add them one per line to the Blocklist field, like so:

CCBot
GPTBot
ChatGPT-User
img2dataset

Press the Save Access Control Settings button.

You can update this list at any time, as more scraper user agent names become known.

Don’t Wait

A collection of metal toy robots

When you use QUIC.cloud to repel site scrapers, you are protecting your content from all of the scrapers on your block list.

IMPORTANT! Blocking a site scraper is not retroactive. If bots have already scraped your content, there’s not much you can do about it. But adding the bot user agents to the QUIC.cloud User Agent Blocklist NOW will prevent them from accessing any additional content from this point forward.

The User Agent Blocklist is one of the many security features available to QUIC.cloud Standard Plan users. Want to know more? Visit our knowledge base!

Leave a Comment