MongoDB Replica-Set Aware Backup Script

I’ve created a nice little bash script to take MongoDB backups that is replicaset aware.

It will only take a backup from a replica so if you have the classic master,replica,arbiter configuration you can setup the script via cron on both (current) master and replica and the backup will only run on the replica.

It will then tar.gz the backup and upload it to Google Storage. It can be easily adapted to upload the backup to S3 using s3cmd or the aws cli (aws-cli).

Cross posted at Forecast:Cloudy (my cloud blog).

Clone S3 Bucket Script

I had to backup an S3 bucket so I whiped out a small script to clone a bucket.

It’s written in Python and depends on the excellent Boto library. If you are running Python < 2.7 you’ll also need the argparse library (both available also via pip).

View the gist here: https://gist.github.com/1275085

Or here below:

Crawling to the people

Yaniv let the cat out of the bag about some of our ideas for making other parts of the search and its relevant data open, free and accessible to all of us.

I’d thought I’ll add some background and my thoughts on the subject.

First, the idea was iterated a couple of times when we were in that place where you have a solution(s) and you are seeking a problem(s) to solve.

It all started from this post by Jeremie Miller. Jeremie, being the good guy that he is, was thinking about create standards and protocols to make the crawling, processing and sharing of data for search and search engines public, free and accessible. While neither Yaniv nor I are in Jeremie’s loop and have no idea of what he is up to (but you can count on it to be interesting, that’s for sure), we talked about it a bit and it sunk in.

We both liked the idea of having the raw data accessible as well as being able to run custom post processors that can make something useful out of it so that no one is tied to whatever logic and algorithms the crawler writer enforces.

Then came the announcement from Kevin Burton about spinn3r, a service that re uses the web index of the Blogosphere crawled by TailRank’s crawler and allows you (and everyone else) to use that crawled data.

This information also sunk in and today at lunch (which did take quite a while :-) ) we started to brainstorm about it a bit more seriously.

This can really open up and innovate search from the bottom up. Give access to a lot of people to APIs and capabilities that were previously only available for big companies. This is the platform that can create something very interesting.

We would love to hear your comments.