Monitoring Far Beyond the Operating System - WeOp 2014

vechiato 11 views 21 slides May 16, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

It discusses various aspects such as the implementation of monitoring solutions, the significance of automation in incident management, integration via APIs, prioritization of incidents, and the involvement of senior team members in the implementation and management processes. Additionally, it stres...


Slide Content

Monitoring
Far Beyond the
Operating System
WeOp 2014
Marcus Vechiato - @vechiato
http://weop.com.br

Agenda
⦿Goal
⦿How do we envision a monitoring system?
⦿From simple to complex
⦿What to monitor?
⦿What to track?
⦿Locaweb numbers
⦿Where some get lost
⦿Configuration automation
⦿ITIL and ITSM Tools Automatic Incident Creation
⦿Tools already being used
⦿Challenges

Goal
The objective of this presentation is to explore monitoring
implementations without focusing on tools.

Best practices highlighting what worked well and the
lessons learned from mistakes made over the years.

How do we envision a monitoring system?

How do we envision a monitoring system?
⦿It's not just a tool.
⦿The monitoring tool is one of the components of the
process.
⦿Process - it can lead to bureaucracy if it's not effective.

Locaweb numbers
⦿Network
⚫Brocade / Cisco / Force10 and others
⦿~21k servers (physical and virtual)
⚫Windows (2003/2008/2012)
⚫Linux (CentOs/Redhat/Debian)
⚫Oracle/MySql/Postgre/MSSQL/MongoDB
⚫VmWare/Xen
⦿~500 thousand items/services monitored every minute
⦿~17 thousand incidents handled per month

From simple to complex
⦿Have a clear understanding of your biggest challenges to define your
objectives.
⦿Do not idealize the perfect system that will cover all the gaps, it does not
exist.
⦿Remember: what are your resources and what are the real skills of the
team.
⦿Prefer a gradual implementation with well-defined deliverables.

What to monitor?
⦿Core Services and Infrastructure - network/uninterruptible power
supply/temperature/DNS
⦿Operating System (memory/CPU/local network/disk) where applicable
Applications
⚫User perspective (HTTP/TCP requests)
⚫Local (memory usage/threads/processes/etc.)
⦿Business Indicators/errors
⚫Example: Sales per hour Example:
⚫Authentication failures per minute

What to track?
Convert the view of infrastructure indicators to products/components/teams

⦿Dashboards for different audiences
⚫Operations
○KPI view by teams/infrastructure
⚫Ex.: MTTR of N1 incidents by priority
⚫Ex.: SLA and MTTR of storage abc
⚫Products/Business
○Common and specific indicator view
⚫Ex.: SLA of product xyz 99.89%
⚫Ex.: MTTR of product xyz 0h45m

Where some get lost
⦿It's oversight to diagnose: "the xyz tool doesn't work, we need a new one."
⦿Monitoring probe intervals are too short.
⦿Retries are important to reduce false positives.
⦿From my experience:
⚫Standard probe intervals range from 1 to 5 minutes
⚫Retries:
○5 minutes during deployment/with known instabilities.
○3 minutes in stable environments.

Configuration Automation
⦿Monitoring is the best place to start managing component installation and
configurations.
⚫Start with the monitoring agent (if available).
⚫Monitoring server
○Via API where possible
○Configuration files
⦿Which tool to use for automation?
⚫It depends on your environment and the team's knowledge. Chef and
Puppet are good options to start with.

ITIL and ITSM Tools
⦿ITSM Tools
⚫I strongly recommend
⚫If you intend to manage incidents automatically, spend more time
evaluating which tool will be used
⦿Processes are the backbone
⚫Incident Management
⚫Problem Management
⚫Change Management
⦿CMDB - registration/control is mandatory
⚫In small installations, your monitoring tool is your CMDB
⚫In larger environments, you will need to synchronize it with the ITSM
tool

Automatic Incident Creation
Some benefits of automatic incident creation in larger environments:
⦿Addresses the inefficiency of manual incident logging
⦿Registers failures exactly when they occur
⦿Allows predefining the importance of each component/service and
prioritizing its resolution in case of failure
⦿Reduces informal incident resolution without logging
⦿Provides insight for in-depth analysis of the environment
⦿Integrated with crisis management, reduces resolution time and improves
related communication
⦿Enables realistic calculation of OLAs and SLAs

Automatic Incident Creation

⦿Integration via:
⚫API preferably (REST/SOAP)
⚫Email - with templates, most tools allow it (only use as a last resort)
⦿Use the priority when opening the incident to allow prioritization by the
resolving team. According to ITIL, on a scale of 1-5:
⚫Priorities (think of a pyramid):
○1 and 2: should be less than 5% of incidents
○3: 20%
○4: 30%
○5: 45%
⦿For each priority, define different resolution OLAs. Remember that this will
directly affect the size of the team.

Automatic Incident Creation

⦿Automatic reopening of incidents if resolved and
continue failing in monitoring or fail again within 30
minutes.
⦿New incident in case of new alarm after 30 minutes
from the last resolved incident.
⦿Suppress incident creation during scheduled
maintenance

Automatic Incident Creation

⦿Automatic closure of incidents if monitoring normalizes before team
intervention with status "no intervention" allows:
⚫Refinement of the solution and its efficiency
⚫Adjustment of very tight thresholds
⚫Information for opening Problems
⚫Failures in planning/execution of changes
⚫Quickly resume incident treatment after events with
hundreds/thousands of incidents opened in a short period of time

Tools already being used
⦿Monitoring (open source):
⚫Nagios
⚫Check_mk – Locaweb
⚫Zabbix
⦿ITSM:
⚫Service Now (API) – Locaweb
⚫CA – Service Desk Manager (API) – Locaweb
⚫HP – Service Center (API)
⚫OTRS – (API)

Challenges

⦿Golden Rule: "Every alarm must have a corrective action" even if it's just
adjusting the thresholds in case of false positives.
⦿Don't be fooled - in the beginning, you will have many false positives.
Persistence is key.
⦿If you don't close incidents automatically during instabilities, typically
network-related, you will be buried in incidents and will miss important
alarms when the instability ceases.

Challenges
⦿Who implements the solution and who administers day-to-day operations?
⚫Implementation of the solution: naturally the most Senior team/person.
⚫Who should enable the monitoring in new systems? If you thought in
the intern or the Junior members of the team, you're mistaken. It's also
the responsibility of the most Senior members. It should be automated.

Challenges
More important than the tools are the people and adherence to the defined
processes, end-to-end.

Periodically revisit the processes to adjust and evolve according to current
needs.

If any process is not working, change it. Do not allow it to be abandoned or
circumvented.

Q&A ?