Never wake up at 3:00 AM because an instance went down ever again.
(Most everything on this page is a work in progress. Feel free to send any feedback directly to me, zach@zachgoldberg.com or file bugs on github).
Cerebro simplifies and automates some of the most common system administration/ops headaches: building and scaling a cluster of virtual instances in the cloud, and covering common 'problematic scenarios' with the goal of minimizing infrastructure related 3:00 AM headaches. For example, deploy a cluster of 4 MongoDB nodes with the push of a button, and if, 3 weeks later, one of them somehow disappears on a Saturday at 3:00 AM when all your sysadmins are out on the town finishing their last round of drinks before last call, a new instance will automatically be provisioned, Mongo deployed to the machine and booted, all without needing to force anybody to sober up and get to a terminal.
Cerebro has mostly been developed by me and with some help and brainstorming from Ryan Borgeouix. Most of the features described in the overview are at least partially implemented. It is also in use in at least one production environment and a number of development/testing environment. That said, it is sorely lacking in documentation and is probably not yet as easy as it will be to get a fully working stack up.
Our goal right now is to get a few more people using the system and contributing in one way or another. Patches, forks, bug reports. etc are all more than welcome on the project's github page, or you can contact me directly at zach@zachgoldberg.com with any issues you may be having.
Puppet and Chef Puppet's self description_Puppet Enterprise is IT automation software that gives system administrators the power to easily automate repetitive tasks, quickly deploy critical applications, and proactively manage infrastructure, on-premises or in the cloud._
Chef's self description. Anyway, that means something along the lines of writing ruby to describe your infrastructure, and then it gives you tools to deploy that infrastructure. If a machine that you've deployed goes down, an admin will have to get a page (from another system, not Chef, maybe a Nagios or a NewRelic) and spinup a new one manually.
Cerebro's focus is not on automating 'repetitive' tasks. Cerebro focuses on automating the random problematic tasks that wake you up in the morning as well as one time deployment tasks. Processes that have bloated too much, instances that fall over etc.
The key distinction here is that Cerebro focuses a lot on your code as a running entity in the cloud. That is, where is it, how many instances of it are there, is it alive, is it using too much RAM, etc. All of the things you usually monitor with Nagios etc. Chef and puppet tackle problems earlier on in the production lifecycle, namely in the machine setup phase -- ensuring code and configuration is deployed appropriately etc. Many deployments are very sophisticated and will need more automation in the code deployment phase than Cerebro provides. in which case Puppet and Chef may be better solutions. There may even be a world in which both tools are used in the same stack -- Cerebro to manage machines and supervise processes and puppet/chef to do deployments though this is not actively supported at present.
MCollective, Capistrano, Func and Fabric These are self described remote server automation and deployment tools. Effectively, managing servers that are already part of your cluster with ssh or similar. Similar to puppet and chef, the core focus is on an earlier part of the sysadmin/ops lifecycle than Cerebro. That said, Cerebro does have a remote automation step, during which it would be perfectly appropriate to invoke your fabric/chef/func/mcollective etc. scripts to setup a machine before the Cerebro process supervisor initiates and manages your jobs.
And, if all goes well and you need to scale up, that's a simple 2 step process:
Cerebro is made up of three parts: Task Sitter a Machine Sitter and a Cluster Sitter
Task Sitter -- A harness to manage an arbitrary task or process.
Goal: Instead of thinking about how many machines you need to run a process on the task sitter's goal is to force the admin to think instead in terms of CPU and RAM, an to plan how much of each resource a process should use ahead of time.
The Task Sitter's job is to enforce the limits that the admin thinks a process should obey. It can handle the cases where a process disobeys these limits.
Together with a machine sitter a machine can be completely managed to run various tasks efficiently within the resource constraints of the machine.
With a cluster sitter an admin can define how many CPUs and how much RAM a particular task can use and it can go to machines, look for available CPU and RAM where the process fits and slot it in there.
Task Sitter Details
Define constraints
Define runtime metadata
Monitoring
Machine Sitter Details
Cluster Sitter Details
Cerebro Configuration File: # See settings.py
Example Job Configuration Format
{
"dns_basename": "redis.startup.com",
"deployment_recipe": "mystartup.recipes.deploy",
"deployment_layout": {
"aws-us-west-2a": {
"mem": 500,
"cpu": 1
},
"aws-us-east-1b": {
"mem": 50,
"cpu": 10
}
},
"recipe_options": {
# Passed as a dictionary to your jobs
"release_dir": "/opt/startup/releases/"
},
"persistent": true,
"task_configuration":
{
# Tasksitter configuration.
"allow_exit": false,
"name": "Portal Server",
"command": "/opt/code/run_portal_server",
"auto_start": false,
"ensure_alive": true,
"max_restarts": -1,
"restart": true,
"uid": 0,
"cpu": .5, # allow this job to use 50% of CPU
"mem": 1200, # Allow this job to use 1.2GB of RAM
}
},
def run_deploy(options):
# API? Kickoff your chef recipe? TODO: More work and structure is needed here.
logger.*()
(This is all a bit complex right now, some simplification and optionality is needed)
1. Create #.PROVIDER_REGION.basename as a A record to the machine
2. Add another CNAME to PROVIDER_REGION.basename to #.PROVIDER_REGION etc.
You should manually setup, e.g. "redis.startup.com" to be a cname to all of the PROVIDER_REGION.redis.startup.com. A complete DNS layout looks as follows
startup.com
redis.startup.com (Admin Created)
-> CNAME aws-us-west-1.redis.startup.com (Admin Created)
-> CNAME aws-us-east-1.redis.startup.com (Admin Created)
aws-us-west-1.redis.startup.com (Admin Created)
-> A 45.67.20.106 (Cerebro Created)
-> A 45.67.20.105 (Cerebro Created)
0.aws-uswest-1.redis.startup.com (Cerebro Created)
-> A 45.67.20.106 (Cerebro Created)
1.aws-uswest-1.redis.startup.com (Cerebro Created)
-> A 45.67.20.105 (Cerebro Created)
aws-us-east-1.redis.startup.com (Admin Created)
-> A 12.67.20.106 (Cerebro Created)
0.aws-us-east-1.redis.startup.com (Cerebro Created)
-> A 12.67.20.106 (Cerebro Created)
So, if you point your servers to redis.startup.com they should get either
The cname returns an A record for each machine of that type. e.g. redis.startup.com -> aws-us-west-1.redis.startup.com -> 12.67.20.106
I've had a few questions on Cerebro's security model. Namely, that there is none. This is for two reasons: time, and it's not immediately obvious to me that one is required. Your cloud should, in an ideal world, be completely firewalled off from the outside world. All of cerebro's management is done via HTTP connections on non-standard ports which should be accessible only within your firewalled cloud or VPC. To manage my machines within this environment I usually poke a hole or two with a reverse SSH port forward, or simply VPN beyond the firewall. This isn't a perfect scenario, anybody within your cloud can do some bad things, but it seems 'good enough' until somebody cares to beef up the internal security model.