Richard Bucker

Aggregated logging - the Google lesson

Posted at — Mar 14, 2015

Whenever I build or deploy distributed systems the topic of log aggregation always pops up. I wish it were a difficult topic because then there would be some money to be made with a solution and with the number of times I’ve built these systems… I’d be very wealthy.

A friend of mine pointed me to a service/application that I was already familiar with. Bosun is a monitoring application from the makes of stack exchange. If my memory serves me this system is the first Go application written by Fog Creek/Stack Exchange. I do not remember my exact first impressions but looking at the code today I see a number of questionable practices. However,  in the end it comes down to answering a few brief questions which will all direct you you to the same place.

- are you going to aggregate 100% of the messages from the system being monitored
- how big is that 100%
- if you have 100 or 20,000 systems being monitored will the log aggregator be able to hold all of that data
- what is the data retention policy and do you have enough storage
- how long is it going to take to aggregate the data
- when its time to start deleting data will the system be available (usually not)
- what sort of queries will need to be performed on the data
- will you map-reduce
- what happens when the primary aggregator fails
- replicate the primary DB to a hot backup
- how many users will query the data in real time
- what sort of monitoring and alerting dashboard is there
- clock drift, latency, queuing cause event ordering issues.

And so on…

Basically, if you think you’re going to aggregate the logs from 20K servers to some sort of logging/aggregation server like logstash, loggly, new-relic then you might not actually know your data or your systems.

Google has considerably more servers and yet they do not do log aggregation. Google performs event aggregation. When something happens that requires intervention then the system being monitored sends an alert to the monitor which then alerts the appropriate staff. This is an actionable event. If the event requires more information, as in the real logs, then the operator or SRE must log into or request the logs from the alerting system.