Cerebro

Never wake up at 3:00 AM because an instance went down ever again.


Project maintained by ZachGoldberg Hosted on GitHub Pages — Theme by mattgraham

(Most everything on this page is a work in progress. Feel free to send any feedback directly to me, zach@zachgoldberg.com or file bugs on github).

Overview

Cerebro simplifies and automates some of the most common system administration/ops headaches: building and scaling a cluster of virtual instances in the cloud, and covering common 'problematic scenarios' with the goal of minimizing infrastructure related 3:00 AM headaches. For example, deploy a cluster of 4 MongoDB nodes with the push of a button, and if, 3 weeks later, one of them somehow disappears on a Saturday at 3:00 AM when all your sysadmins are out on the town finishing their last round of drinks before last call, a new instance will automatically be provisioned, Mongo deployed to the machine and booted, all without needing to force anybody to sober up and get to a terminal.

Project Status

Cerebro has mostly been developed by me and with some help and brainstorming from Ryan Borgeouix. Most of the features described in the overview are at least partially implemented. It is also in use in at least one production environment and a number of development/testing environment. That said, it is sorely lacking in documentation and is probably not yet as easy as it will be to get a fully working stack up.

Our goal right now is to get a few more people using the system and contributing in one way or another. Patches, forks, bug reports. etc are all more than welcome on the project's github page, or you can contact me directly at zach@zachgoldberg.com with any issues you may be having.

Comparison to other tools

Puppet and Chef Puppet's self description_Puppet Enterprise is IT automation software that gives system administrators the power to easily automate repetitive tasks, quickly deploy critical applications, and proactively manage infrastructure, on-premises or in the cloud._

Chef's self description. Anyway, that means something along the lines of writing ruby to describe your infrastructure, and then it gives you tools to deploy that infrastructure. If a machine that you've deployed goes down, an admin will have to get a page (from another system, not Chef, maybe a Nagios or a NewRelic) and spinup a new one manually.

Cerebro's focus is not on automating 'repetitive' tasks. Cerebro focuses on automating the random problematic tasks that wake you up in the morning as well as one time deployment tasks. Processes that have bloated too much, instances that fall over etc.

The key distinction here is that Cerebro focuses a lot on your code as a running entity in the cloud. That is, where is it, how many instances of it are there, is it alive, is it using too much RAM, etc. All of the things you usually monitor with Nagios etc. Chef and puppet tackle problems earlier on in the production lifecycle, namely in the machine setup phase -- ensuring code and configuration is deployed appropriately etc. Many deployments are very sophisticated and will need more automation in the code deployment phase than Cerebro provides. in which case Puppet and Chef may be better solutions. There may even be a world in which both tools are used in the same stack -- Cerebro to manage machines and supervise processes and puppet/chef to do deployments though this is not actively supported at present.

MCollective, Capistrano, Func and Fabric These are self described remote server automation and deployment tools. Effectively, managing servers that are already part of your cluster with ssh or similar. Similar to puppet and chef, the core focus is on an earlier part of the sysadmin/ops lifecycle than Cerebro. That said, Cerebro does have a remote automation step, during which it would be perfectly appropriate to invoke your fabric/chef/func/mcollective etc. scripts to setup a machine before the Cerebro process supervisor initiates and manages your jobs.

Current Capabilities

Getting Started in 6 steps

  1. Bundle your software into an easily deployable package (Using python buildout, or an egg, or a WAR file, or a .deb etc. If you don't already do this it's not a bad idea to start now).
  2. Write a python "deployment class" (see docs below) for your package
  3. Write a system-deployment configuration file which defines how many machines you want your code to run on and provides credentials to needed cloud APIs (VM provider, DNS provider etc.)
  4. Easy install cerebrod on a management server and start the daemon with "cerebrod" (see cerebrod --help for more).
  5. Easy install cerebrod on your development machine and run "cerebro updatejobcfg" on your local machine and pass it your configuration and the location of your clustersitter you've created in step 3.
  6. Checkout the cerebro web UI (http://managementserver:30000/overview) and watch things happen. You'll find the generated DNS names for your machines and lots more info there.

And, if all goes well and you need to scale up, that's a simple 2 step process:

  1. Update your job config file to require more machines.
  2. Run "updatejobcfg" with your updated config file

Under the Hood

Cerebro is made up of three parts: Task Sitter a Machine Sitter and a Cluster Sitter

Task Sitter

Task Sitter -- A harness to manage an arbitrary task or process.

Goal: Instead of thinking about how many machines you need to run a process on the task sitter's goal is to force the admin to think instead in terms of CPU and RAM, an to plan how much of each resource a process should use ahead of time.

The Task Sitter's job is to enforce the limits that the admin thinks a process should obey. It can handle the cases where a process disobeys these limits.

Together with a machine sitter a machine can be completely managed to run various tasks efficiently within the resource constraints of the machine.

With a cluster sitter an admin can define how many CPUs and how much RAM a particular task can use and it can go to machines, look for available CPU and RAM where the process fits and slot it in there.

Task Sitter Details

Machine Sitter

Machine Sitter Details

Cluster Sitter

Cluster Sitter Details

Providers

Configuration


Cerebro Configuration File: # See settings.py

Example Job Configuration Format

{
    "dns_basename": "redis.startup.com",
    "deployment_recipe": "mystartup.recipes.deploy",
    "deployment_layout": {
        "aws-us-west-2a": {
            "mem": 500,
            "cpu": 1
        },
        "aws-us-east-1b": {
            "mem": 50,
            "cpu": 10
        }
    },
    "recipe_options": {
        # Passed as a dictionary to your jobs
        "release_dir": "/opt/startup/releases/"
    },
    "persistent": true,
    "task_configuration":
        {
            # Tasksitter configuration. 
            "allow_exit": false,
            "name": "Portal Server",
            "command": "/opt/code/run_portal_server",
            "auto_start": false,
            "ensure_alive": true,
            "max_restarts": -1,
            "restart": true,
            "uid": 0,
    "cpu": .5, # allow this job to use 50% of CPU
    "mem": 1200, # Allow this job to use 1.2GB of RAM
        }
},

Deployment Recipe Interface

def run_deploy(options):
    # API?  Kickoff your chef recipe?  TODO: More work and structure is needed here.
    logger.*()

DNS Setup

(This is all a bit complex right now, some simplification and optionality is needed)

  1. Create #.PROVIDER_REGION.basename as a A record to the machine
  2. Add another CNAME to PROVIDER_REGION.basename to #.PROVIDER_REGION etc.

So, if you point your servers to redis.startup.com they should get either

  1. If your using global load balancing, a cname to one of aws-us-west-1.redis.startup.com or aws-us-east-1.redis.startup.com based on the callers location
  2. or both CNAMEs

The cname returns an A record for each machine of that type. e.g. redis.startup.com -> aws-us-west-1.redis.startup.com -> 12.67.20.106

Security

I've had a few questions on Cerebro's security model. Namely, that there is none. This is for two reasons: time, and it's not immediately obvious to me that one is required. Your cloud should, in an ideal world, be completely firewalled off from the outside world. All of cerebro's management is done via HTTP connections on non-standard ports which should be accessible only within your firewalled cloud or VPC. To manage my machines within this environment I usually poke a hole or two with a reverse SSH port forward, or simply VPN beyond the firewall. This isn't a perfect scenario, anybody within your cloud can do some bad things, but it seems 'good enough' until somebody cares to beef up the internal security model.