At my day job we use Disco, a Python + Erlang based Map-Reduce framework, to crunch our web servers and application logs to generate useful data.
Each web server log file per day is a couple of GB of data which can amount to a lot of log data that needs to be processed on a daily.
Since the files are big it was easier for us to perform all the necessary filtering of find the rows of interest in the “map” function. The problem is, that it requires us to return some generic null value for rows that are not interesting for us. This causes the intermediate files to contains a lot of unnecessary data that has the mapping of our uninteresting rows.
To significantly reduce this number, we have started to use the “combiner” function so that our intermediate results contains an already summed up result of the file the node is currently processing that is composed only from the rows we found interesting using the filtering in the “map” phase.
For example, if we have 1,000 rows and only 200 answer a certain filtering criteria for a particular report, instead of getting 1,000 rows in the intermediate file out of which 800 have the same null value, we now get only 200 rows.
In some cases we saw an increase of up to 50% in run time (the increase in speed is the result of reducing less rows from the intermediate files), not to mention a reduction in disk space use during execution due to the smaller intermediate files.
That way, we can keep the filtering logic in the “map” function while making sure we don’t end up reducing unnecessary data.