Disco Tip – Crunching web server logs

At my day job we use Disco, a Python + Erlang based Map-Reduce framework, to crunch our web servers and application logs to generate useful data.

Each web server log file per day is a couple of GB of data which can amount to a lot of log data that needs to be processed on a daily.

Since the files are big it was easier for us to perform all the necessary filtering of find the rows of interest in the “map” function. The problem is, that it requires us to return some generic null value for rows that are not interesting for us. This causes the intermediate files to contains a lot of unnecessary data that has the mapping of our uninteresting rows.

To significantly reduce this number, we have started to use the “combiner” function so that our intermediate results contains an already summed up result of the file the node is currently processing that is composed only from the rows we found interesting using the filtering in the “map” phase.

For example, if we have 1,000 rows and only 200 answer a certain filtering criteria for a particular report, instead of getting 1,000 rows in the intermediate file out of which 800 have the same null value, we now get only 200 rows.

In some cases we saw an increase of up to 50% in run time (the increase in speed is the result of reducing less rows from the intermediate files), not to mention a reduction in disk space use during execution due to the smaller intermediate files.

That way, we can keep the filtering logic in the “map” function while making sure we don’t end up reducing unnecessary data.

New programming languages forces you to re-think a problem in a fresh way (or why do we need new programming languages. always.)

Whenever a new programming language appears some claim its the best thing since sliced bread (tm – not mine ;-) ), other claim its the worst thing that can happen and you can implement everything that the language provides in programming language X (assign X to your favorite low level programming language and append a suitable library).

After seeing Google’s new Go programming language I must say I’m excited. Not because its from Google and it got a huge buzz around the net. I am excited about the fact that people decided to think differently before they went on and created Go.

I’m reading Masterminds of Programming: Conversations with the Creators of Major Programming Languages (a good read for any programming language fanaticos) which is a set of interviews with various programming languages creators and its very interesting to see the thoughts and processes behind a couple of the most widely used programming languages (and even some non-so-widely-used programming languages).

In a recent interview Brad Fitzpatrick (of LiveJournal fame and now a Google employee) was asked:

You’ve done a lot of work in Perl, which is a pretty high-level language. How low do you think programmers need to go – do programmers still need to know assembly and how chips work?

To which he replied:

… I see people that are really smart – I would say they’re good programmers – but say they only know Java. The way they think about solving things is always within the space they know. They don’t think ends-to-ends as much. I think it’s really important to know the whole stack even if you don’t operate within the whole stack.

I subscribe to Brad’s point of view because   a) you need to know your stack from end to end – from the metals in your servers (i.e. server configuration), the operating system internals to the data structures used in your code and   b) you need to know more than one programming language to open up your mind to different ways of implementing a solution to a problem.

Perl has regular expressions baked into the language making every Perl developer to think in pattern matching when performing string operations instead of writing tedious code of finding and replacing strings. Of course you can always use various find and replace methods, but the power and way of thinking of compiled pattern matching makes it much more accessible, powerful and useful.

Python has lists and dictionaries (using a VERY efficient hashtable implementation, at least in CPython) backed into the language because lists and dictionaries are very powerful data structures that can be used in a lot solutions to problems.

One of Go’s baked in features is concurrency support in the form of goroutines. Goroutines makes the use of multi-core systems very easy without the complexities that exists in multi-processing or multi-threading programming such as synchronization. This feature actually shares some ancestry with Erlang (which by itself has a very unique syntax and vocabulary for scalable functional programming).

Every programming language brings something new to the table and a new way of looking at things and solving problems. That’s why its so special :-)