Manager of IT: February 2012

Monday, February 13, 2012

Different takes on why monitoring sucks and what's to be done about it #in

Why monitoring sucks — for now

http://gigaom.com/2012/02/12/why-monitoring-sucks-for-now/

A new (old) model
I’d suggest that any well-designed monitoring tool can help automate the OODA loop for operations teams.
1. Deep integration
2. Contextual alerting and pattern recognition
3. Timeliness
4. High resolution
5. Dynamic configuration
What’s next for monitoring?

Why Alerts Suck and Monitoring Solutions need to become Smarter

http://www.appdynamics.com/blog/2012/01/23/why-alerts-suck-and-monitoring-solutions-need-to-become-smarter/

#1 Problem Identification – Do I have a problem?
#2 Problem Isolation – Where is my problem?
#3 Problem Resolution – How do I fix my problem?

My ideal monitoring system
http://forecastcloudy.net/2012/01/12/my-ideal-monitoring-system/

Hosted (CloudKick, ServerDensity, CloudWatch, RevelCloud and others) vs Installed (Nagios, Munin, Ganglia, Cacti)
Hosted solutions pricing plans use varied parameters such as price/server, price/metric, retention policy, # of metrics tracked, realtime-ness, etc.
Poll based method – where collecting server polls the other servers/service vs. Push – where you have a client on the server that pushes locally collected data to logging/monitoring server
Allowing custom metrics – not all systems allows monitoring, plotting, sending and alert on custom data (at least not in a easy manner)

Friday, February 3, 2012

puppet day #2 -- and I need a custom fact

Objective
One of the first things I wanted to accomplish with puppet is to track down rogue cron jobs under accounts of people that are no longer here. The broader objective is to delete old/un-used accounts.

Problem
But there was some evidence that a few of these old accounts still had cron jobs running. So, we couldn't just delete the old accounts, but needed to proceed cautiously to insure we didn't stomp on some cron job that was actually needed!

I was looking for puppet to tell me which systems had cron jobs under this old account. Now, puppet is a declarative language, so something like:

if /var/spool/cron/userfoo exists, notify me, so I can take a look and see what I need to fix/replace

doesn't exist! In puppet, you have to declare whether something should or should not exist and then puppet will take the corresponding action. I just wanted puppet to tell me about something on my system. I didn't want puppet to take an action!

Solution
It's up to the puppetlabs provided facter to help out here. Puppet ships with a bundle called facter that collects a lot of bits of information about systems, like their OS, RAM, kernel version, etc. The code to gather these facts is written in ruby and is extensible. I needed a custom fact that would indicate whether or not /var/spool/cron/userfoo or (on solaris) /var/spool/cron/crontabs/userfoo exists. Writing that code is actually straight forward (my first ruby code ever! yay!). Getting that code onto my agents had an obstacle to overcome.

Problem #2
Puppet does not deliver custom facts to agents by default. Agents and the puppetmaster need this set in /etc/puppet.conf

pluginsync = true

This required using puppet to update the puppet.conf and restart puppet. That's what I built. Getting puppet to allow delivery of custom facts by default is a listed feature request: http://projects.puppetlabs.com/issues/5454

The only gotcha here is to make sure you include:

hasrestart => true,

in your init.pp for the puppet service. Otherwise puppet will send a stop, but not a start since it can't send a start since it is no longer running!

Resources
http://conshell.net/wiki/index.php/Puppet
grabbed this:

kill -USR1 `cat /var/run/puppet/puppetd.pid`; tail -f /var/log/syslog

from the above link. Which I shortened to:

kill -USR1 `pgrep puppet`; tail -f /var/log/syslog

Config details after the jump

Puppet installed -- let's do something!

Having got a critical mass (but not all) of my servers running puppet and talking to the puppetmaster, I was ready to start actually doing something with puppet. So, the first thing I wanted to do, was update the motd on the servers. I appreciate a standard look and feel when logging into a server and being provided with some useful info about the host I'm on. Moreover, I wanted to communicate to system users that

I found this: https://github.com/aussielunix/puppet-motd, which uses a puppet template to collect a number of facts along with a really big ASCII banner, that I quite like.
_

_ __ _ _ _ __ _ __ ___| |_
| '_ \| | | | '_ \| '_ \ / _ \ __|
| |_) | |_| | |_) | |_) | __/ |_
| .__/ \__,_| .__/| .__/ \___|\__|
|_| |_| |_|
_ _
_ __ ___ __ _ _ __ __ _ __ _ ___ __| | | |
| '_ ` _ \ / _` | '_ \ / _` |/ _` |/ _ \/ _` | | |
| | | | | | (_| | | | | (_| | (_| | __/ (_| | |_|
|_| |_| |_|\__,_|_| |_|\__,_|\__, |\___|\__,_| (_)
|___/

Any files that have a 'Puppet' header need to be changed in puppet.

Interesting tidbit
In my motd.erb template, I included:

Uptime: <%= uptime %>

What happens with this, is that the "uptime" fact (and the other facts included in the template) gets evaluated on the client on every puppet run and a flat file without the puppet mock-up is laid down on the file system. This file gets compared and reevaluated on every run. Here's the point: every day the uptime changes and a new file is laid down in /etc/motd and the old file is backed up. This is clearly pretty inefficient, and needs to be replaced with a function that will process/update the uptime on login, and not on every puppet run.

Resources
https://github.com/aussielunix/puppet-motd
~~My init.pp and motd.erb are in the comments~~ I just discovered that I can't format in comments, so adding the file specs after the jump...

managing users with puppet

useful resource:
http://itand.me/using-puppet-to-manage-users-passwords-and-ss