Lessons from Etsy: Avoiding Kitchen Nightmares - #ChefConf 2012
mcdonnps
20,264 views
50 slides
May 16, 2012
Slide 1 of 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
About This Presentation
Talk by Patrick McDonnell (@mcdonnps) at #ChefConf 2012
Chef makes it so easy to change configuration en masse that it can be dangerous if not used with certain precautions and in accordance with a well thought out testing workflow. In our use of Chef at Etsy, we have devised many in-house best pr...
Talk by Patrick McDonnell (@mcdonnps) at #ChefConf 2012
Chef makes it so easy to change configuration en masse that it can be dangerous if not used with certain precautions and in accordance with a well thought out testing workflow. In our use of Chef at Etsy, we have devised many in-house best practices in response to failures which have helped greatly in avoiding catastrophic outages. This talk will focus on mistakes we've made and how we've avoided repeating them by enforcing standards in cookbooks, testing changes before rollout through the use of environments and in conjunction with the Spork plugin for Knife, and linting cookbooks with Foodcritic. I'll also talk about using handlers intelligently to monitor Chef runs and how to generate reports from the myriad data available in CouchDB.
How Do You Enforce This?
•
Documented standards and communicated best practices
•
Robust testing workflow
•
Environments
•
Knife Plugins
•
Linting with rules derived from standards
•
Foodcritic
Testing Workflow
How We Use Environments
•
Three environments: production, development, testing
•
Testing is unconstrained
•
Test nodes are depooled and “flipped” to the testing environment,
then repooled and analyzed
•
Test nodes are then flipped back to production
Working with Environments
•
knife-flip by Etsy engineer Jon Cowie
(https://github.com/jonlives/knife-flip)
%knife node flip somenode.etsy.com testing
%knife role flip SomeRole testing
•
knife-bulkchangeenvironment (https://github.com/jonlives/knife-
bulkchangeenvironment)
%knife node bulk_change_environment testing production
Keeping Environments in Sync
•
knife-env-diff by Etsy engineer John Goulah
•
Get it at https://github.com/jgoulah/knife-env-diff
% knife environment diff development production
diffing environment development against production
cookbook: hadoop
development version: = 0.1.0
production version: = 0.1.8
cookbook: mysql
development version: = 0.2.4
production version: = 0.2.5
Introducing Knife Spork
•
Knife plugin providing a testing/versioning workflow
•
Authored by Jon Cowie
•
Get it at https://github.com/jonlives/knife-spork
Spork Features
•
Four stage process
•
Check: Look at versioning info for a cookbook
•
Bump: Automatically increment the cookbook’s version number
•
Upload: Knife upload and freeze
•
Promote: Set environment constraints equal to specified version
% knife spork check foodcritic
Checking versions for cookbook foodcritic...
Current local version: 0.0.4
Remote versions (Max. 5 most recent only):
*0.0.4, frozen
0.0.3, frozen
0.0.2, unfrozen
0.0.1, frozen
DANGER: Your local cookbook has same version number as the
starred version above!
Please bump your local version or you won't be able to
upload.
% knife spork bump foodcritic
Loaded config file /home/pmcdonnell/git/chef-repo/config/
spork-config.yml...
Loaded config file /etc/spork-config.yml...
Pulling latest changes from git
Pulling latest changes from git submodules (if any)
Bumping patch level of the foodcritic cookbook from 0.0.4 to
0.0.5
Git add'ing /home/pmcdonnell/git/chef-repo/cookbooks/
foodcritic/metadata.rb
% knife spork promote foodcritic --remote
Pulling latest changes from git
Checking that foodcritic version 0.0.5 exists on the server
before promoting (any error means it hasn't been uploaded
yet)...
foodcritic version 0.0.5 found on server!
Environment: production
Adding version constraint foodcritic = 0.0.5
Saving changes into production.json
Git add'ing /home/pmcdonnell/git/chef-repo/environments/
production.json
Uploading production to server
WARNING: You're about to promote changes to several
cookbooks:
logrotate: = 0.1.24 changed to = 0.1.23
foodcritic: = 0.0.4 changed to = 0.0.5
Are you sure you want to continue? (Y/N) n
You said no, so I'm done here.
Would you like to reset your local production.json to match
the server?? (Y/N) y
Git add'ing /home/pmcdonnell/git/chef-repo/environments/
production.json
production.json reset.
Spork’s Logging Mechanisms
•
Irccat: Logs to IRC channel (https://github.com/RJ/irccat)
Environment production uploaded at 2012-05-15 18:35:42 UTC by pmcdonnell
Constraints updated on server in this version:
ldap: = 0.1.26 changed to = 0.1.27•
Gist: Added to irccat notifications on promote --remote
•
Graphite: promote --remote sends to deploys.chef metric
[11:35:33] <irccat> CHEF: pmcdonnell uploaded and froze cookbook ldap version 0.1.27
[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment production
https://github.etsycorp.com/gist/376967
[11:35:43] <irccat> CHEF: pmcdonnell uploaded environment development
https://github.etsycorp.com/gist/376968
Linting
Foodcritic
•
A lint tool for Chef cookbooks written by Andrew Crump
(http://acrmp.github.com/foodcritic/)
•
Comes with a good set of default rules and is very easily extensible
•
To enable in spork config:
foodcritic:
enabled: true
fail_tags: [any]
tags: [foo]
include_rules: [/home/me/myrules]
Etsy’s Rules
•
A work in progress, but newly open-sourced at
https://github.com/etsy/foodcritic-rules
•
Our rules are “style”-tagged rules that serve to enforce what we
consider to be best practices in our environment
•ETSY001 - Package or yum_package resource used with :upgrade action
•ETSY002 - Execute resource used to run git commands
•ETSY003 - Execute resource used to run curl or wget commands
•ETSY004 - Execute resource defined without conditional or action :nothing
•ETSY005 - Action :restart sent to a core service
Rule Resulting from Image Outage
•
ETSY005 - Action :restart sent to a core service
•
Trippable services include httpd, mysql, memcached, postgresql-server
% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/
git/chef-repo/cookbooks/apache
ETSY005: Action :restart sent to a core service :
/home/pmcdonnell/git/chef-repo/cookbooks/apache/recipes/
default.rb:39
02:27 < jallspaw> [Sat, 10 Jul 2010 01:45:01 +0000]
INFO: Upgrading package[memcached] version from
1.4.2-1.fc10 to 1.4.5-1.el5
Don’t leave
“known unknowns”
lying in wait
Resulting Foodcritic Rule
•
ETSY001 - Package or yum_package resource used with :upgrade action
•
Enforces always using :install
% foodcritic -t etsy -I ~/git/chef-repo/config/rules.rb ~/
git/chef-repo/cookbooks/memcache
ETSY001: Package or yum_package resource used with :upgrade
action: /home/pmcdonnell/git/chef-repo/cookbooks/memcache/
recipes/default.rb:20
Resulting Foodcritic Rule
20 package "memcached" do
21 action :upgrade
22 end
Changed to:
20 package "memcached" do
21 version "1.4.2-1.fc10"
22 action :install
23 end
Reporting and Monitoring
Using Handlers
•
Etsy’s handlers (https://github.com/etsy/chef-handlers)
•
Log failures to IRC
•
Graph aggregated metrics with Graphite
•
Graph chef “deploys”
[10:52:03] <irccat> Chef run failed on dev-dbtasks01.ny4dev.etsy.com
[10:52:03] <irccat> https://github.etsycorp.com/gist/371229
Graph with Graphite
•
Metrics reporting made possible by knife-lastrun, authored by
John Goulah (https://github.com/jgoulah/knife-lastrun)
•
Provides a handler and knife plugin for reporting on the most recent
chef run, storing data as node attributes
•
Elapsed, starting, and ending time
•
Exit code status
•
Backtrace/exception information
% dsh -g all -c -M 'grep "Chef Run complete in" /var/log/
chef/client.log | head -n 3' 2>&1 | tee /tmp/tee && grep
'Chef Run complete' /tmp/tee | sort -n -k +13 | tail -5
dn0035.doop: [Mon, 14 May 2012 03:21:07 +0000] INFO: Chef
Run complete in 512.936813012 seconds
dn0004.doop: [Mon, 14 May 2012 04:28:03 +0000] INFO: Chef
Run complete in 677.423964906 seconds
dn0006.doop: [Mon, 14 May 2012 04:29:51 +0000] INFO: Chef
Run complete in 770.231469266 seconds
dn0025.doop: [Mon, 14 May 2012 04:26:13 +0000] INFO: Chef
Run complete in 787.183615612 seconds
dn0030.doop: [Mon, 14 May 2012 04:30:42 +0000] INFO: Chef
Run complete in 848.586507872 seconds
Finding Run Time Outliers
•
Knife doesn’t currently support Lucene’s NumericRangeQuery
•
Elapsed time is a floating point number, but we can only match it as a
string due to query limitations in knife
•
Work around it with knife search -a
% knife search node 'elapsed:[200 TO 225]' -a
lastrun.runtimes.elapsed
4 items found
id: cent6-vmtemplate.ny4dev.etsy.com
lastrun.runtimes.elapsed: 21.642378406
id: sandboxmisc01.ny4.etsy.com
lastrun.runtimes.elapsed: 211.749555
id: smardenfeld.vm.ny4dev.etsy.com
lastrun.runtimes.elapsed: 22.184596
id: bob0120.vm.ny4dev.etsy.com
lastrun.runtimes.elapsed: 21.348335354
% knife node lastrun sandboxmisc01.ny4.etsy.com
Status failed
Elapsed Time 211.78604
Start Time 2012-05-15 07:43:18 +0000
End Time 2012-05-15 07:46:50 +0000
Backtrace
Omitted for brevity
Exception
Chef::Exceptions::Package: package[diffutils]
(installerz::diffutils line 1) had an error: Yum failed -
#<Process::Status: pid 21293 exit 1> - returns: ["yum-dump
Repository Error: Cannot retrieve repository metadata
(repomd.xml) for repository: PostgreSQL-8.3-x86_64. Please
verify its path and try again"]
What Did Chef Just Do?
•
chefrecentupdates by Etsy engineer Laurie Denness
(https://github.com/lozzd/ChefScripts)
% chefrecentupdates
...
1 resources updated in /var/log/chef/client.log-20120505.gz:
[Fri, 04 May 2012 17:49:42 +0000]
INFO: cookbook_file[/usr/bin/gist]
...
Preventative Measures
Knife Preflight
% knife preflight memcache::datacache
Searching for nodes containing memcache::datacache in their expanded
run_list...
4 Nodes found
datacache03.ny4.etsy.com
datacache04.ny4.etsy.com
datacache01.ny4.etsy.com
datacache02.ny4.etsy.com
Searching for roles containing memcache::datacache in their run_list...
1 Roles found
Datacache
Found 4 nodes and 1 roles using the specified search criteria
•
By Jon Cowie (https://github.com/jonlives/knife-preflight)
Continuous Chef
•
Using Jenkins and base virtual machine images
“Out-of-Band” Management
•
dsh (distributed shell) works even if Chef server is down
•
Etsy’s dsh groups are managed by Chef and generated from the list of
nodes corresponding to each role
Configs Bundled with Packages
•
Be careful with configs distributed with packages overwriting Chef
configs
•
They must be replaced by Chef before restarting services, so watch
out for resource order
Jon will be at Velocity!
•
Workshop: Michelin Starred Cooking with Chef
•
11:00am Monday, 06/25/2012
•
Topics
•
Team-wide familiarity and understanding
•
Critical approach and experimentation with workflows
•
Plugin writing 101
We’re Hiring!
•
TONS of engineering positions open!
•
Especially looking for a talented network engineer; referrals welcome!
http://www.etsy.com/careers