How are websites able to remove their content from the WayBack Machine?
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

If you click on this link (from the Wayback Machine), you'll notice that the web content successfully loads.

But then, something else loads, and the content is destroyed. (perhaps robots.txt?)

What is the Economist.com loading that is making this 404 happen? Somehow, the Economist.com is loading something externally, and that is causing the 404 to result.

Is there a way to prevent this?

Maybe a Chrome extension that does step-by-step website loading--such that the user can decide what to load, and what to not load.

Anyway, i'm just looking for ideas, because it seems like a lot of websites are mad that their content is on the Wayback Machine, and they are sabotaging the archive--so that a 404 replaces the content.

LINK:
https://web.archive.org/web/20150803014731/http://worldif.economist.com/article/11/what-if-autonomous-vehicles-rule-the-world-from-horseless-to-driverless

UPDATE:

I found one way to do it. But it'd be nice if there was a more elegant solution--maybe a chrome extension.
Video: http://tinyurl.com/v7vgf28

Tags
robots-txt

Crowdsource coding tasks.

2 Solutions


For your particular link(economist) i was able to hit view page source at that instant and got HTML but no CSS
HERE IS THE LINK TO THE HTML.

To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt) and then submit your site below.

The robots.txt file will do two things:

 1.   It will remove all documents from your domain from the Wayback Machine.

 2.   It will tell the Internet Archive's crawler not to crawl your site in the future.

To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

                   User-agent: ia_archiver

                   Disallow: /

Robots.txt is the most widely used method for controlling the behavior of automated robots on your site (all major robots, including those of Google, Alta Vista, etc. respect these exclusions). It can be used to block access to the whole domain, or any file or directory within. There are a large number of resources for webmasters and site owners describing this method and how to use it. Here are a few:

   o    http://www.global-positioning.com/robots_text_file/index.html

   o    http://www.webtoolcentral.com/webmaster/tools/robots_txt_file_generator

   o    http://pageresource.com/zine/robotstxt.htm

Once you have put a robots.txt file up, submit your site (www.yourdomain.com) on the form on http://pages.alexa.com/help/webmasters/index.html#crawl_site.

The robots.txt file must be placed at the root of your domain (www.yourdomain.com/robots.txt). If you cannot put a robots.txt file up, submit a request to wayback2@archive.org.

Right, but my goal is to see the content at the above link from the Economist. e.g. if you click the link and watch closely, you'll notice that the webpage content actually does successfully load. But then is wiped away a second later. Is there anyway to halt this? Any way to prevent the eocnomist from wiping their content at Wayback Machine?
tonloc 10 months ago
HEY I'VE UPDATED THE SOLUTION!
ST2-EV 10 months ago
Hi. Thanks. But what method did you use to get this HTML. e.g. my goal is to be able to do this on future links when i stumble upon them in WayBack machine. e.g. there should be some sort of way to download content form a web source, and "press PAUSE" to prevent further downloading.
tonloc 10 months ago
Update: hmm i found one method to do it. It'd be nice if there was a more reliable way. Video in original post.
tonloc 10 months ago
i just hit view page source at the instant when the page loads.
ST2-EV 10 months ago
Winning solution

Another solution is to go to Chrome > Settings > Advanced > Site Settings > JavaScript and Add a new blocking URL: web.archive.org

Screenshot: https://prnt.sc/qm8czr

This should disable JavaScript when loading a Web Archive page and thus allow the full economist page to load without getting the 404 message.

Video walkthrough: https://drive.google.com/open?id=1p7bJOgBsZHlGg2OsrQCGrpJRPEH64vBz

Hi thanks. So in your video, what you did was prevent chrome form executing javascript at archive.org? So, what this did was prevent the Economists "destroyer script" from firing after the website loaded?
tonloc 10 months ago
There is a script loaded in the page (filename: 0d74ccc6.js ) that causes the 404 redirect. If you can prevent this script from executing, you can probably browser through all the archived pages without redirections. You can use a Chrome Extension such as Resources Override (https://chrome.google.com/webstore/detail/resource-override/pkoacgokdfckfpndoffpifphamojphii) to specify a specific file to be excluded from execution, e.g, the aforementioned one. When loading the archived page through the Resources Override extension and the script being blocked, you can see the article loading just fine.
kostasx 10 months ago
I am not sure if this is the economist that is blocking archived pages or a conflict with some of archive.org's scripts used to load the archived pages. Some further research is needed to see what's happening under the hood.
kostasx 10 months ago
View Timeline