My mother always told me about the carpenter who would measure twice and cut once. It didn’t make any sense to me as a kid, but I soon understood exactly what she meant the day I built my first set of shelves.
Hrm. If only I had listened to her sooner …
It seems that this advice can be applied to software engineering too, as one of the most important aspects of building a scalable online application (or any software architecture for that matter) is the ability to measure what’s going on. After all, if you can’t tell what your application is up to, then how on earth are you going to be able to tune it, or fix it when it goes wrong (or even notice when it’s broken!), if you’re not monitoring it?
Makes total sense, eh?
Of course it makes sense!
Which is why we recommend it to all our clients — monitor what your application is doing, otherwise you won’t be able to know when you have to put your scalability plan into place. (You do have a scalability plan, don’t you?!)
So recently we’ve been investigating different solutions for monitoring hosts and applications that we install for clients. Traditionally we’ve used plain old Ganglia for host performance characteristics e.g. CPU, RAM, disk, network i/o etc. and Nagios to warn us when certain things aren’t working how we expect e.g is that host up and running, are there 23 newsletter sign ups happening every hour, does that web page say what we expect, is that database responding to queries etc.
If something isn’t working how we expect to be working, then we get told about it. Depending on the system, the responsible engineer(s) receive an email, SMS text, or ping via their aNag app on their Android device. Sweet.
But we never really get much of an insight into how our software is performing once it goes live; we rarely get to see how our apps perform once they’re up and running.
Don’t get me wrong, we have logging throughout development/deployment/production life cycle, we run performance testing and load testing before go live, we collect stats. But nothing that’s easily harnessed and displayed via a simple dashboard.
We’ve been looking into different technologies to solve this. From one end of the scale there’s the roll your own a la Etsy and the statsd plus graphite solution, all the way to the SaaS (software as a service) solution from New Relic.
Most reporting mechanisms implement some kind of UDP fire and forget network ping which is perfect for the kind of web based applications we all love to develop; hey, it seriously doesn’t matter if our reporting metric doesn’t get through, because remember we still have logs on the actual host in case of emergency, but it’s great to see performance stats being charted on a cool dashboard in (nigh on) real time.
New Relic gives us that in-depth x-ray look into how our apps are performing — was it that pymongo query affecting throughput, or was it the network causing excessive latency? We can even drill down to the Python method/function layer in our RESTful services if that’s what it takes.
The statsd approach requires a little more forethought into what we need to do to implement measurement recording, but at the same time it gives us a higher level of flexibility in what we want to measure exactly.
In the meantime, Johnny is working on an internal Erlang product that, as part of its plugin module nature, will be able to measure, report, and take necessary action on significant events that occur within an application or that occur based on the server’s performance characteristics. I can’t wait to get this into production!
Either way, measuring performance is the ultimate goal, because without measurements, you can’t take any informed action, and without informed action, you’re effectively running your online business in the dark.
And you wouldn’t want that, in the same way you wouldn’t want a set of shelves that didn’t fit. Now where did I put that tape measure and saw …?