Tutorial :Snapshot website over time


I'm a developer for a marketing team and one of the features that often gets requested is: Can we go back to see what our site (or what X page) looked like back in X.

Are there any good solutions for solving for this request?


Source Control should be able to solve your request in house. Label things appropriately and have an internal server to deploy that label to, and you should have no issue. If you have an automated deployment tool and choose your labels wisely, it should be relatively simple to write an app that will check out your source at label X and deploy it, by only having a user enter the label. Now if your labels we something like the date, they would just have to enter the date in the correct format and wait 5 minutes for the deploy.


have a look at the way back machine it's not perfect, but there are some embarrasing old sites still in there that I worked on :)


Have you looked at the wayback machine at archive.org?


If that doesn't meet your needs, maybe you could automate something with your source control repository that could pull a version for a specific date.


Similar to what others have suggested, (assuming a dynamic website) I would use output caching to generate the web page's code, and then use Subversion to track the changes.

Using the WayBack machine is probably only a last resort, such as if an individual asks to see a webpage from before you set this system up. One cannot rely on the WayBack Machine to contain everything that one needs.


My suggestion would be to simply run wget over the site every night and store that on archive.yourdomain.com. Add a control to each page for those with the appropriate permissions that passes the URL of the current page to a date picker. Once a date is chosen load archive.yourdomain.com/YYYYMMDD/original_url.

Letting users browse the entire site without broken links on archive.yourdomain.com might require some URL re-writing or copying the archived copy of the site from some respository to the root of archive.yourdomain.com. To save disk space, that might be the best option. Store the wget copies zipped, then extract the date the user requests. There are some issues with this, such as how do you deal with multiple users wanting to view multiple archived pages from different dates at the same time, etc.

I'd suggest that running wget over your site each night is superior to retrieving it from source control since you would obtain the page as it was shown to WWW visitors, complete with any dynamically served content, errors, omissions, random rotated ads, etc.

EDIT: You could store the wget output in source control, I'm not sure what that would buy you over zipping it up on a file system somewhere outside source control. Also note this plan would use up large amounts of disk space over time assuming a website of any size.


As Grant says, you could combine wget with revision control for space-savings. I am actually trying to write a script to do this for my usual browsing since I don't trust the Internet Archive or WebCite to be around indefinitely (and they are not very searchable).

The script would go something like this: cd to directory; invoke the correct wget --mirror command or whatever; run darcs add $(find .) to check into the repository any new files; then darcs record --all.

Wget ought to overwrite any changed files with the updated version; darcs add will record any new files/directories; darcs record will save the changes.

To get the view as of date X, you simply pull from your repo all patches up to date X.

You don't store indefinitely many duplicate copies because DVCSs don't save history unless there's actual changes to file content. You will get 'garbage' in the sense of pages changing to no longer require CSS or JS or images you previously downloaded, but you could just periodically delete everything and record that as a patch, and the next wget invocation will only pull in what is needed for the latest version of a webpage. (And you can still do full-text search, just now you search the history rather than the files on-disk.)

(If there are big media files being downloaded, you can toss in something like rm $(find . -size +2M) to delete them before they get darcs added.)

EDIT: I wound up not bothering with explicit version control, but letting wget create duplicates and occasionally weeding them with fdupes. See http://www.gwern.net/Archiving%20URLs


The WayBackMachine might be able to help.


Depending on your pages and exactly what you are asking for you might consider putting copies of the pages in source control.

This probably won't work if your content is in a database but if they are just HTML pages that you are changing over time then SCM would be the normal way to do this. The WayBackMachine that everyone mentions is great but this solution is more company specific allowing you to capture ever nuance of changes over time. You have no control over the WayBackMachine (to my knowledge).

In Subversion, you can set up hooks and automate this. In fact, this might even work if you are using content from a database...

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »