CCBot
Common Crawl's web crawler that collects data to create an open repository of web crawl data.
What is CCBot?
CCBot (Common Crawl Bot) is the web crawler operated by Common Crawl, a nonprofit that maintains an open repository of web crawl data. Many AI companies and researchers use Common Crawl's dataset to train their models, including several large language models. While CCBot itself is not owned by an AI company, allowing it means your content may be included in datasets used for AI training by multiple organizations. The Common Crawl dataset is one of the largest publicly available web archives.
How Qwairy Makes This Actionable
Qwairy tracks CCBot visits to your website. Monitor when Common Crawl indexes your content and understand how your pages contribute to open AI training datasets.
Frequently Asked Questions
Related Terms
AI Crawler
Indexing robot used by AI companies to collect data intended to train or feed their models.
robots.txt
Text file placed at the root of a website to indicate to indexing robots which pages to explore or avoid.
GPTBot
OpenAI's web crawler used to collect data intended to train GPT models.