At my day job we use Disco, a Python + Erlang based Map-Reduce framework, to crunch our web servers and application logs to generate useful data.
Each web server log file per day is a couple of GB of data which can amount to a lot of log data that needs to be processed on a daily.
Since the files are big it was easier for us to perform all the necessary filtering of find the rows of interest in the “map” function. The problem is, that it requires us to return some generic null value for rows that are not interesting for us. This causes the intermediate files to contains a lot of unnecessary data that has the mapping of our uninteresting rows.
[Read More]