KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

eschabell 79 views 63 slides Sep 08, 2024
Slide 1
Slide 1 of 63
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63

About This Presentation

Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? When you're moving so fast with agile practices across your DevOps, SRE's, and platform e...


Slide Content

chronosphere.io Choose Your Own Adventure Eric D. Schabell Director Evangelism @ericschabell{@fosstodon.org} KCD Porto, 27-28 Sep 2024 Cloud Native Observability Pitfalls

Cloud Native Observability

Cloud Native

Data volume Experiment: Hello World app on 4 node Kubernetes cluster with Tracing, End User Metrics (EUM), Logs, Metrics (containers / nodes) 30 days == +450 GB

Retention Retention Retention Retention Retention Retention Retention

Cloud Native at Scale

Observability…

Cloud Native Observability at Scale

O11y at Scale (need)

Picking Your Pitfalls Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section)

Ignoring existing landscape

If they can’t see me… they can’t hurt me ...

Prometheus for metrics, alerting, queries

Prometheus auto discovery

Manual instrumentation (java client lib)

Short link: bit.ly/prom-workshop

Applications (Java) OTel Auto Instrumentation (libraries) OTel API OTel SDK OTel Collector OTLP OTLP OTLP OpenTelemetry (Auto) instrumentation

Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.) , Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP OTLP OpenTelemetry Collector (agent)

Host Host Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.) , Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP Collector (gateway) OTel Collector Gateway

Short link: bit.ly/opentelemetry-workshop

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

2. Focusing on The Pillars

Pillars Phases

Developer Technology Bottom up

Pillar problems…

Car is on fire…

Better outcomes… Faster remediation… Easier detection… Happier customers…

Phase 1 Know something is happening as fast as possible…

Phase 2 Triage with specific information…

Phase 3 Understand to ensure never happens again…

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

3. Sneaky sprawling mess

Over 66% of organizations use more than 10 different observability tools – ESG report over exploding data volumes

Know Triage Understand

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

4. Controlling costs

“It’s remarkable how common this situation is, where an organization is paying more for their observability data, than they do for their production infrastructure.” ?

O11y data storage costs are broken. Keeping everything model?

Know the cost of observability metrics data?

Control costs and improve productivity Observability Platform DATA COLLECTION CONTROL PLANE STORE LENS Telemetry Pipeline Reduce Enrich Secure TRANSFORM AND ROUTE DATA IN YOUR ENVIRONMENT STORE DATA IN THIRD PARTY LOG & SIEM SOLUTIONS

Chronosphere named a Leader in the 2024 Gartner® Magic Quadrant™ for Observability Platforms Gartner, Magic Quadrant for Observability Platforms: By Gregg Siegfried, Padraig Byrne, Mrudula Bangera, Matt Crossley (12 August 2024) GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose https://chronosphere.io/2024-gartner-magic-quadrant

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

5. The protocol jungle

Without open standards, you’ll not find a way back…

Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.) , Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP OTLP OpenTelemetry Collector (agent)

Prometheus for metrics, alerting, queries

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

6. Underestimating cardinality

The struggle is real “I don't yet collect spans/traces because I can hardly get our devs to care about basic metrics, let alone traces.” “This is a large enterprise with approx. 1000 developers. Cultivating a culture of engineering that cares about availability is a challenge that we need to solve alongside any technical implementations.”

10 hours on average, per week, trying to triage and understand incidents - a quarter of a 40 hour work week

33% said those issues disrupted their personal life 39% admitting they are frequently stressed out

Cloud Native Observability at Scale

Control costs and improve productivity Observability Platform DATA COLLECTION CONTROL PLANE STORE LENS Telemetry Pipeline Reduce Enrich Secure TRANSFORM AND ROUTE DATA IN YOUR ENVIRONMENT STORE DATA IN THIRD PARTY LOG & SIEM SOLUTIONS

Ignoring existing landscape Focusing on The Pillars Sneaky sprawling mess Controlling costs T he protocol jungle Underestimating cardinality (Click on a pitfall to jump to that section, or jump to end ) Picking Your Next Pitfall

What should be the #1 item on cloud wishlist ? What should be #1 item on your cloud native observability wishlist?

chronosphere.io Questions? Eric D. Schabell Director Evangelism @ericschabell{@fosstodon.org}