This collection is a curated list of websites that employ the robots.txt
file to restrict access to AI Agents, AI crawlers and GPTs.
It will be updated monthly.
The robots.txt
file allows website owners to control and limit the access of these user agents to certain areas of their website by specifying rules and directives.
# OpenAI’s web crawler: GPT3.5, GPT4, ChatGPT
# https://platform.openai.com/docs/bots
User-agent: GPTBot
# ChatGPT plugins
# https://platform.openai.com/docs/bots
User-agent: ChatGPT-User
# OpenAI Search bot
# https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot
# Google's web crawler: Bard, VertexAI, Gemini
# https://blog.google/technology/ai/an-update-on-web-publisher-controls/
User-agent: Google-Extended
# Apple's web crawler, dedicated to GenAI projects
# https://support.apple.com/en-us/119829
User-agent: Applebot-Extended
# Claude
User-agent: anthropic-ai
# Claude Bot
User-agent: ClaudeBot
# Claude web
User-agent: Claude-Web
# Cohere
User-agent: Cohere-ai
# Perplexity
User-agent: PerplexityBot
# Common Crawl
# https://commoncrawl.org/ccbot
User-agent: CCBot
# Omglibot: webz.io
# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/
User-agent: Omgilibot
User-agent: Omgili
User-agent: Webzio-Extended
# Facebook: Llama
# https://developers.facebook.com/docs/sharing/bot/
User-agent: FacebookBot
# ByteDance: Duobao
User-agent: Bytespider
# Censorship area
Disallow: /
Please note that this blocklist is intended for informational purposes only. Despite the provoking project name, it's fine to disallow web crawling and protect content ownership.
- Scanned: 66
- ✅ Passing: 38 %
- 🔐 Blocked: 62 %
- ❓ Unknown: 0 %
- Scanned: 9
- ✅ Passing: 56 %
- 🔐 Blocked: 44 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Prime Video | 🌍 | ✅ |
Netflix | 🌍 | ✅ |
Disney+ | 🌍 | 🔐 |
Hulu | 🇺🇸 | 🔐 |
HBO Max | 🇺🇸 | ✅ |
Canal+ | 🇫🇷 | 🔐 |
FranceTV | 🇫🇷 | ✅ |
TF1 | 🇫🇷 | 🔐 |
6Play | 🇫🇷 | ✅ |
- Scanned: 6
- ✅ Passing: 67 %
- 🔐 Blocked: 33 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Soundcloud | 🌍 | 🔐 |
Youtube | 🌍 | ✅ |
Apple Music | 🌍 | ✅ |
Spotify | 🌍 | 🔐 |
Deezer | 🇫🇷 | ✅ |
LastFM | 🇬🇧 | ✅ |
- Scanned: 8
- ✅ Passing: 75 %
- 🔐 Blocked: 25 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Google Podcasts | 🌍 | ✅ |
Apple Podcast | 🌍 | ✅ |
Spotify Podcaster | 🌍 | 🔐 |
Buzzsprout | 🌍 | ✅ |
Podbean | 🌍 | ✅ |
Acast | 🇬🇧 | ✅ |
AudioMeans | 🇫🇷 | ✅ |
Radio France | 🇫🇷 | 🔐 |
- Scanned: 6
- ✅ Passing: 67 %
- 🔐 Blocked: 33 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
PornHub | 🌍 | 🔐 |
YouPorn | 🌍 | 🔐 |
Xnxx | 🌍 | ✅ |
Xvideos | 🌍 | ✅ |
Xhamster | 🌍 | ✅ |
OnlyFan | 🌍 | ✅ |
- Scanned: 5
- ✅ Passing: 100 %
- 🔐 Blocked: 0 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Bible | 🇺🇸 | ✅ |
Bible gateway | 🇺🇸 | ✅ |
Jehovah's Witnesses | 🇺🇸 | ✅ |
Vatican | 🇻🇦 | ✅ |
Islamweb | 🌍 | ✅ |
- Scanned: 13
- ✅ Passing: 31 %
- 🔐 Blocked: 62 %
- ❓ Unknown: 8 %
Name | Country | Status |
---|---|---|
🌍 | 🔐 | |
🌍 | 🔐 | |
🌍 | ✅ | |
Hacker News | 🌍 | ❓ |
Lobsters | 🌍 | 🔐 |
🌍 | 🔐 | |
TikTok | 🌍 | ✅ |
🌍 | 🔐 | |
🌍 | ✅ | |
Quora | 🌍 | 🔐 |
VK | 🇷🇺 | ✅ |
TripAdvisor | 🌍 | 🔐 |
Yelp | 🌍 | 🔐 |
- Scanned: 42
- ✅ Passing: 76 %
- 🔐 Blocked: 19 %
- ❓ Unknown: 5 %
Name | Country | Status |
---|---|---|
Michael Jackson | 🇺🇸 | ✅ |
Madonna | 🇺🇸 | ✅ |
Taylor Swift | 🇺🇸 | 🔐 |
Rihanna | 🇺🇸 | ✅ |
Bruno Mars | 🇺🇸 | ✅ |
Justin Bieber | 🇺🇸 | 🔐 |
Beyoncé | 🇺🇸 | ✅ |
Katy Perry | 🇺🇸 | 🔐 |
Lady Gaga | 🇺🇸 | 🔐 |
Hardwell | 🇺🇸 | ✅ |
Dimitri Vegas & Like Mike | 🇺🇸 | ✅ |
Kanye West | 🇺🇸 | ❓ |
Black Eyed Peas | 🇺🇸 | ✅ |
Imagine Dragons | 🇺🇸 | ✅ |
Twenty One Pilots | 🇺🇸 | ✅ |
Maroon 5 | 🇺🇸 | 🔐 |
Selena Gomez | 🇺🇸 | 🔐 |
Usher | 🇺🇸 | 🔐 |
Stromae | 🇧🇪 | ✅ |
Aya Nakamura | 🇫🇷 | ❓ |
Soprano | 🇫🇷 | ✅ |
Johnny Hallyday | 🇫🇷 | ✅ |
Grand Corps Malade | 🇫🇷 | ✅ |
Zaho | 🇫🇷 | ✅ |
Jean Louis Aubert | 🇫🇷 | ✅ |
Camelia Jordana | 🇫🇷 | ✅ |
Indochine | 🇫🇷 | ✅ |
Tryo | 🇫🇷 | ✅ |
David Guetta | 🇫🇷 | ✅ |
Mc Solaar | 🇫🇷 | ✅ |
Zaz | 🇫🇷 | ✅ |
Christine and the Queens | 🇫🇷 | ✅ |
Boulevard des Airs | 🇫🇷 | ✅ |
Calogero | 🇫🇷 | ✅ |
Hoshi | 🇫🇷 | ✅ |
Avicii | 🇸🇪 | ✅ |
Adele | 🇬🇧 | ✅ |
Calvin Harris | 🇬🇧 | ✅ |
Ed Sheeran | 🇬🇧 | ✅ |
Arctic Monkeys | 🇬🇧 | ✅ |
Coldplay | 🇬🇧 | ✅ |
The Weeknd | 🇨🇦 | 🔐 |
- Scanned: 3
- ✅ Passing: 100 %
- 🔐 Blocked: 0 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
White House | 🇺🇸 | ✅ |
Elysée | 🇫🇷 | ✅ |
Europe | 🇪🇺 | ✅ |
- Scanned: 28
- ✅ Passing: 82 %
- 🔐 Blocked: 18 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Google Scholar | 🌍 | ✅ |
Sci-Hub | 🌍 | ✅ |
PubPeer | 🌍 | ✅ |
Scopus | 🇳🇱 | 🔐 |
Elsevier | 🇳🇱 | 🔐 |
ScienceDirect | 🇳🇱 | 🔐 |
MDPI | 🇨🇭 | ✅ |
Springer | 🇩🇪 | ✅ |
Wiley | 🇺🇸 | ✅ |
American Chemical Society | 🇺🇸 | ✅ |
PubMed | 🇺🇸 | ✅ |
Academia | 🇺🇸 | ✅ |
Science | 🇺🇸 | 🔐 |
ArXiv | 🇺🇸 | ✅ |
American Physical Society | 🇺🇸 | ✅ |
Mendeley | 🇬🇧 | ✅ |
Nature | 🇬🇧 | 🔐 |
Taylor & Francis | 🇬🇧 | ✅ |
Oxford University Press | 🇬🇧 | ✅ |
Cambridge University Press | 🇬🇧 | ✅ |
Royal Society of Chemistry | 🇬🇧 | ✅ |
ResearchGate | 🇩🇪 | ✅ |
BNF | 🇫🇷 | ✅ |
Cairn | 🇫🇷 | ✅ |
Persee | 🇫🇷 | ✅ |
Gallica | 🇫🇷 | ✅ |
HAL | 🇫🇷 | ✅ |
OpenEdition | 🇫🇷 | ✅ |
- Scanned: 3
- ✅ Passing: 67 %
- 🔐 Blocked: 33 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Github | 🌍 | ✅ |
Gitlab | 🌍 | ✅ |
Stack Overflow | 🌍 | 🔐 |
- Scanned: 19
- ✅ Passing: 74 %
- 🔐 Blocked: 26 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Wikipedia | 🌍 | ✅ |
Medium | 🌍 | 🔐 |
Substack | 🌍 | ✅ |
Common Crawl | 🌍 | ✅ |
Internet Archive | 🌍 | ✅ |
Wayback Machine | 🌍 | ✅ |
Notion | 🌍 | ✅ |
Weather | 🇺🇸 | 🔐 |
AccuWeather | 🇺🇸 | ✅ |
Météo France | 🇫🇷 | ✅ |
Getty Images | 🇺🇸 | ✅ |
Shutterstock | 🇺🇸 | 🔐 |
Adobe Stock | 🇺🇸 | 🔐 |
Unsplash | 🇨🇦 | 🔐 |
Pexels | 🇩🇪 | ✅ |
Pixabay | 🇩🇪 | ✅ |
Flickr | 🇺🇸 | ✅ |
500px | 🇨🇦 | ✅ |
Giphy | 🇺🇸 | ✅ |
- Scanned: 1
- ✅ Passing: 100 %
- 🔐 Blocked: 0 %
- ❓ Unknown: 0 %
Name | Country | Status |
---|---|---|
Indeed | 🇺🇸 | ✅ |
A.k.a: do they understand their business model? 💸
Name | Status |
---|---|
Getty Images | ✅ |
Pexels | ✅ |
500px | ✅ |
A.k.a: this is public interest. 🖕
Name | Status |
---|---|
Medium | 🔐 |
Quora | 🔐 |
Elsevier | 🔐 |
Scopus | 🔐 |
Science | 🔐 |
ScienceDirect | 🔐 |
Nature | 🔐 |
Looking for contributions:
- Enrich website database
- Chinese websites
- New categories
Please open issues!
- Ping me on Twitter @samuelberthe (DMs, mentions, whatever :))
- Fork the project
- Fix open issues or request new features
Don't hesitate ;)
python -m venv venv
source ./venv/bin/activate
pip3 install -r requirements.txt
python3 scrape.py
# then copy the last version into readme
Give a ⭐️ if this project helped you!
Copyright © 2024 Samuel Berthe.
This project is MIT licensed.