{"id":6074,"date":"2025-08-12T10:03:38","date_gmt":"2025-08-12T10:03:38","guid":{"rendered":"https:\/\/serisec.com\/index.php\/2025\/08\/12\/reddit-to-block-internet-archive-as-ai-companies-have-scraped-data-from-wayback-machine\/"},"modified":"2025-08-12T10:03:38","modified_gmt":"2025-08-12T10:03:38","slug":"reddit-to-block-internet-archive-as-ai-companies-have-scraped-data-from-wayback-machine","status":"publish","type":"post","link":"https:\/\/serisec.com\/index.php\/2025\/08\/12\/reddit-to-block-internet-archive-as-ai-companies-have-scraped-data-from-wayback-machine\/","title":{"rendered":"Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine"},"content":{"rendered":"<p>    Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p>Reddit has announced plans to significantly restrict the Internet Archive\u2019s Wayback Machine from indexing its platform, citing concerns that AI companies have been exploiting the archival service to circumvent Reddit\u2019s data protection policies.\u00a0<\/p>\n<p>The move represents another escalation in Reddit\u2019s ongoing battle to control access to its user-generated content amid the AI training data boom.<\/p>\n<pre class=\"wp-block-preformatted\"><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">Key Takeaways<\/mark><\/strong><br>1. The Wayback Machine will only be able to archive Reddit's homepage, not individual posts or comments.<br>2. Companies were using archived data to bypass Reddit's direct access restrictions<br>3. Reddit prefers paid licensing deals over free data access.<\/pre>\n<h2 class=\"wp-block-heading\" id=\"h-block-wayback-machine-access-nbsp\"><strong>Block Wayback Machine Access\u00a0<\/strong><\/h2>\n<p>Starting today, Reddit will implement what it calls \u201cramping up\u201d restrictions that will block the Wayback Machine from accessing post detail pages, comment threads, and user profiles.\u00a0<\/p>\n<p>The Internet Archive will only retain the ability to index Reddit\u2019s homepage, effectively limiting historical records to snapshots of trending headlines and popular posts on given dates.<\/p>\n<p>\u201cInternet Archive provides a service to the open web, but we\u2019ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,\u201d Reddit spokesperson Tim Rathschmidt <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/reddit_will_block_the_internet_archive\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">explained<\/a>.\u00a0<\/p>\n<p>The company has identified specific instances where AI training companies have used the robots.txt bypass capabilities inherent in archived content to access Reddit data that would otherwise be restricted by the platform\u2019s current API rate limiting and crawler blocking mechanisms.<\/p>\n<p>Reddit\u2019s technical implementation will likely involve updating its robots.txt file with specific User-Agent strings targeting Internet Archive crawlers, while potentially implementing server-side blocking based on IP ranges associated with the Wayback Machine\u2019s infrastructure.\u00a0<\/p>\n<p>This approach mirrors the platform\u2019s recent strategy of blocking search engine crawlers unless companies enter paid licensing agreements.<\/p>\n<p>This restriction forms part of Reddit\u2019s comprehensive approach to monetizing its data assets in the AI era.\u00a0<\/p>\n<p>The platform has entered into significant deals with Google and <a href=\"https:\/\/cybersecuritynews.com\/tag\/openai\/\" target=\"_blank\" rel=\"noreferrer noopener\">OpenAI<\/a> for official data access, while simultaneously pursuing legal action against companies like Anthropic for allegedly continuing to scrape content after claiming to have stopped.<\/p>\n<p>Reddit\u2019s 2023 API pricing changes, which effectively shuttered popular third-party applications, were justified using similar reasoning about preventing unauthorized AI training.<\/p>\n<p>The company has implemented rate limiting, authentication requirements, and usage monitoring across its technical infrastructure to maintain control over data access.<\/p>\n<p>Mark Graham, director of the <a href=\"https:\/\/cybersecuritynews.com\/internet-archive-breached-again\/\" target=\"_blank\" rel=\"noreferrer noopener\">Wayback Machine<\/a>, acknowledged ongoing discussions with Reddit about the matter, suggesting potential technical solutions may be explored.\u00a0<\/p>\n<p>However, Reddit\u2019s position appears firm: until the Internet Archive can guarantee compliance with platform policies regarding user privacy and content deletion respect, access will remain severely limited.<\/p>\n<p>This development highlights the growing tension between open web archival principles and commercial data control in the AI training landscape.<\/p>\n<p class=\"has-text-align-center has-background\" style=\"background:linear-gradient(180deg,rgb(238,238,238) 93%,rgb(169,184,195) 100%)\">Equip your SOC with full access to the latest threat data from <strong>ANY.RUN TI Lookup<\/strong> that can Improve incident response -&gt; <strong><a href=\"https:\/\/any.run\/threat-intelligence-feeds\/?utm_source=csn_aug&amp;utm_medium=article&amp;utm_campaign=how-to-get-real-time-iocs&amp;utm_content=feeds-cta1&amp;utm_term=050825#contact-sales\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Get 14-day\u00a0Free\u00a0Trial<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/cybersecuritynews.com\/reddit-to-block-internet-archive\/\">Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine<\/a> appeared first on <a href=\"https:\/\/cybersecuritynews.com\/\">Cyber Security News<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Florence Nightingale<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/cybersecuritynews.com\/reddit-to-block-internet-archive\/\">Go to cyber-security-news<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine Reddit has announced plans to significantly restrict the Internet Archive\u2019s Wayback Machine from indexing its platform, citing concerns that AI companies have been exploiting the archival service to circumvent Reddit\u2019s data protection policies.\u00a0 The move represents another escalation in Reddit\u2019s ongoing [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[129,63,1440],"tags":[130],"class_list":["post-6074","post","type-post","status-publish","format-standard","hentry","category-cyber-security","category-cyber-security-news","category-tech-news","tag-cyber-security-news"],"_links":{"self":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts\/6074"}],"collection":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/comments?post=6074"}],"version-history":[{"count":0,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/posts\/6074\/revisions"}],"wp:attachment":[{"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/media?parent=6074"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/categories?post=6074"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/serisec.com\/index.php\/wp-json\/wp\/v2\/tags?post=6074"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}