Disco Tip – Crunching web server logs

March 21, 2010

At my day job we use Disco, a Python + Erlang based Map-Reduce framework, to crunch our web servers and application logs to generate useful data.

Each web server log file per day is a couple of GB of data which can amount to a lot of log data that needs to be processed on a daily.

Since the files are big it was easier for us to perform all the necessary filtering of find the rows of interest in the “map” function. The problem is, that it requires us to return some generic null value for rows that are not interesting for us. This causes the intermediate files to contains a lot of unnecessary data that has the mapping of our uninteresting rows.

To significantly reduce this number, we have started to use the “combiner” function so that our intermediate results contains an already summed up result of the file the node is currently processing that is composed only from the rows we found interesting using the filtering in the “map” phase.

For example, if we have 1,000 rows and only 200 answer a certain filtering criteria for a particular report, instead of getting 1,000 rows in the intermediate file out of which 800 have the same null value, we now get only 200 rows.

In some cases we saw an increase of up to 50% in run time (the increase in speed is the result of reducing less rows from the intermediate files), not to mention a reduction in disk space use during execution due to the smaller intermediate files.

That way, we can keep the filtering logic in the “map” function while making sure we don’t end up reducing unnecessary data.

Share and Enjoy:
  • del.icio.us
  • Reddit
  • Facebook
  • FriendFeed
  • HackerNews
  • Twitter
  • Posterous

tags: , , , , ,
posted in Development, Disco, Tips n' Tricks by Eran Sandler

Follow comments via the RSS Feed | Leave a comment | Trackback URL

1 Tweet

1 Comment to "Disco Tip – Crunching web server logs"

  1. erans wrote:

    Tip about crunching web server logs in #Disco (#DiscoProject) http://bit.ly/discotip1

    This comment was originally posted on Twitter

Leave Your Comment

Additional comments powered by BackType

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org