Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»SEO & Marketing»Most Major News Publishers Block AI Training & Retrieval Bots
    SEO & Marketing

    Most Major News Publishers Block AI Training & Retrieval Bots

    AwaisBy AwaisJanuary 8, 2026No Comments4 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Most Major News Publishers Block AI Training & Retrieval Bots
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Most top news publishers block AI training bots via robots.txt, but they’re also blocking the retrieval bots that determine whether sites appear in AI-generated answers.

    BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found 79% block at least one training bot. More notably, 71% also block at least one retrieval or live search bot.

    Training bots gather content to build AI models, while retrieval bots fetch content in real time when users ask questions. Sites blocking retrieval bots may not appear when AI tools try to cite sources, even if the underlying model was trained on their content.

    What The Data Shows

    BuzzStream examined the top 50 news sites in each market based on SimilarWeb traffic share, then deduplicated the list. The study grouped bots into three categories: training, retrieval/live search, and indexing.

    Training Bot Blocks

    Among training bots, Common Crawl’s CCBot was the most frequently blocked at 75%, followed by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%.

    Google-Extended, which trains Gemini, was the least blocked training bot at 46% overall. US publishers blocked it at 58%, nearly double the 29% rate among UK publishers.

    Harry Clarkson-Bennett, SEO Director at The Telegraph, told BuzzStream:

    “Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.”

    Retrieval Bot Blocks

    The study found 71% of sites block at least one retrieval or live search bot.

    Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot, which powers ChatGPT’s live search, was blocked by 49%. ChatGPT-User was blocked by 40%.

    Perplexity-User, which handles user-initiated retrieval requests, was the least blocked at 17%.

    Indexing Blocks

    PerplexityBot, which Perplexity uses to index pages for its search corpus, was blocked by 67% of sites.

    Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none.

    The Enforcement Gap

    The study acknowledges that robots.txt is a directive, not a barrier, and bots can ignore it.

    We covered this enforcement gap when Google’s Gary Illyes confirmed robots.txt can’t prevent unauthorized access. It functions more like a “please keep out” sign than a locked door.

    Clarkson-Bennett raised the same point in BuzzStream’s report:

    “The robots.txt file is a directive. It’s like a sign that says please keep out, but doesn’t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.”

    Cloudflare documented that Perplexity used stealth crawling behavior to bypass robots.txt restrictions. The company rotated IP addresses, changed ASNs, and spoofed its user agent to appear as a browser.

    Cloudflare delisted Perplexity as a verified bot and now actively blocks it. Perplexity disputed Cloudflare’s claims and published a response.

    For publishers serious about blocking AI crawlers, CDN-level blocking or bot fingerprinting may be necessary beyond robots.txt directives.

    Why This Matters

    The retrieval-blocking numbers warrant attention here. In addition to opting out of AI training, many publishers are opting out of the citation and discovery layer that AI search tools use to surface sources.

    OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Blocking one doesn’t block the other. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval.

    These blocking choices affect where AI tools can pull citations from. If a site blocks retrieval bots, it may not appear when users ask AI assistants for sourced answers, even if the model already contains that site’s content from training.

    The Google-Extended pattern is worth watching. US publishers block it at nearly twice the UK rate, though whether that reflects different risk calculations around Gemini’s growth or different business relationships with Google isn’t clear from the data.

    Looking Ahead

    The robots.txt method has limits, and sites that want to block AI crawlers may find CDN-level restrictions more effective than robots.txt alone.

    Cloudflare’s Year in Review found GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role Google’s crawler plays in search indexing and AI training.

    For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now.


    Featured Image: Kitinut Jinapuck/Shutterstock

    Block bots Major News Publishers Retrieval Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    Why customer personas help you win earlier in AI search

    March 18, 2026

    SEO Test Shows It’s Trivial To Rank Misinformation On Google

    March 18, 2026

    Google expands Personal Intelligence to AI Mode, Gemini, Chrome

    March 18, 2026

    Google AI Overviews Cut Germany’s Top Organic CTR By 59%

    March 18, 2026

    Google says AI Mode stays ad-free for Personal Intelligence users

    March 18, 2026

    Search Referral Traffic Down 60% For Small Publishers, Data Shows

    March 18, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    Manifold-Matching Autoencoders

    March 18, 2026

    arXiv:2603.16568v1 Announce Type: cross Abstract: We study a simple unsupervised regularization scheme for autoencoders called…

    One Model to Rule Them All? SAP-RPT-1 and the Future of Tabular Foundation Models

    March 18, 2026

    Why customer personas help you win earlier in AI search

    March 18, 2026

    Broccoli Confetti Rice Recipe | Epicurious

    March 18, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Google AI Overviews Cut Germany’s Top Organic CTR By 59%

    March 18, 2026

    SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

    March 18, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.