Telemetry: The Overlooked Treasure in Axon Server-Centric Applications

RichardBouka 42 views 56 slides Oct 03, 2024
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

This talk delves into the often-underappreciated telemetry capabilities of Axon Server, demonstrating its pivotal role in enhancing microservice architectures.

While the benefits of modularity, CQRS + ES (Event Sourcing), and location transparency brought by Axon are widely recognized, Axon Server...


Slide Content

Telemetry: The Overlooked Treasure
in Axon Server-Centric Applications
Richard Bouška

The most uninteresting topic
CB B

Maintenance & support
●Reliable and effective administration of your
application portfolio
●Flexible on-demand analyses, training and
support for your business


ASSIST / me
DevOps & Observability
●Qualified technical and technological
support
●Solution automatisation
●Supervision of problem-free operation of
applications


Software Engineering
●Modern and professional custom-made SW
solutions
●Wide technological stack with emphasis on
cloud solutions


Low Code & Data Platforms
●Data transformation
●Easy-to-understand data visualisation


32
years on the market
100+
employees
500+
apps in our portfolio
250k+
delivered MD

ASSIST / me
Richard Bouška
ASSIST CTO

Our perception through the years
●2021: CQRS/ES & DDD … not only, its
●2022: easy microservices … not only, its
●2023: location transparency with awareness … not only, its
●2024: Telemetry
●2025: …

Hmm, metrics
is that a good idea anyway?

You will get stressed (Don’t Panic!)
●Peril Sensitive Sunglasses
○Joo Janta 200 Super-Chromatic Peril Sensitive Sunglasses:
designed to help people develop a relaxed attitude to danger.
○At the first hint of trouble, they turn totally black and thus
prevent you from seeing anything that might alarm you.

You can break your system

Prometheus, Grafana, Dynatrace
●PromQL
●Micrometer
●There is plenty di?erent metrics that Axons server exposes
●There is even more metrics that Axon Framework exposes
●being there .. JVM has and spring boot even even more
●and Dynatrace
●Counters, Gauges, Timers,
●Distributions, Histograms, Natural Histograms
●Percentiles, Mean, Median
●Sampling (theorem) and aliasing e?ects
●Grafana
●Dashboarding
●Layout systems
●Color perception
Alerting New infrastructure

One day you resolve the last obstacle
Richard Bouška:
Last crucial blocker resolved:
Prometheus has dark mode.
Ondra Halata:
Well, if it has dark mode, then I'll start working on it
immediately ?????? Until now, I wasn't convinced of its priority.

Statistics, sampling, convolution ??????
Avg over 1 min
Max: 700
Avg over 30 sec
Max: 1200
Avg over 5 min
Max: 300
Avg over 15 min
Max: 150
Avg over 30 min
Max: 110
Avg over 1h
Max: 70

Statistics, sampling theorem, math??????
What is the maximum UpdateAppointment calls count per second

But there must be some
benefits as well, no?

They are visually pleasant

Metrics are useful

What you get… (for free)
●Application architecture cleanup
●3rd party performance and stability monitoring
●Global system stability wearout
●Performance tuning support
●Finops / Green IT support data
●Post deployment regression detection
●Performance degradation detection
●3rd party component bugs detection
●Business oriented metric
●Newcomers CQRS ramp up support
When you use a tool or system with an understanding of how it operates best.
You don't have to be an engineer to be be a racing driver, but you do have to have
Mechanical Sympathy.
Jackie Stewart, racing driver
When you understand how a system is designed to be used, you can align
with the design to gain optimal performance.

Basic server metrics
●Axon server runs in JVM → heap memory was our concern …
○Good to check sometimes but no issues here ever … just properly size your servers

First useful dashboard
Event store, the source of truth … maintain the disk space or …
●90 days 12.3% → 15.5%
○→ we still have plenty of time
You don't want to hit 100% … keep your system under ~80%

Basic Axion server metrics
# of apps connected to individual nodes per context
Active applications → better than /health
# of apps events processed per node
Since node restart per context
3 nodes, 3 contexts, last 90 days
Last event token #
Since context created (they should overlap)

Counters are used for … Events
Application activity
(indirect info)
Node Load balance
(indirect info)

Now down the rabbit hole
Few stories from the battlefield
… but before we go a reminder

Team work together
SRE/Observability team DEVOPS team Development team
Especially for DDD and CQRS architectures:
●co-locate the Dev, DEVOPS and Observability teams.
○Instant feedback after deployment
○Flexible deployment support (same time zone)
○On measure ad-hoc dashboards
○Simple communication about new metrics definition
○By easier communication → improved architecture understanding
Architecture!!

Business as usual behavior
Mere observation e?ect (Look and See)
●How many trains did not crash?
●How many Chernobyls did not explode?
Are we sure about our backup strategy
“Invisible” bug in command handling
Input file parsing retry bug
Eyesessment!

20 segments/week
Is event segment size well chosen?
●What is your backup strategy?
●Which files do you need to backup?
●Proper sizing event store segment you can save backup storage size.
Setting Event Store Segment Size - closed segments (last 30 days)
1 segments/week 1 segments/ 3weeks

Snapshots: better late than later
~30ms mean handling time
~1.5s mean handling time
26 event segment files
6 event segment files
file.segment.open
Counts the number of event store
segments opened since start
local.aggregate.segments
Monitors the number of segments that were
accessed for reading aggregate event
requests
300 event segment files
100 event segment files
285

(Un)Subscription Queries
Do we have serious leak in the SubscriptionQueries unsubscribe
handling?

Is the system tired each evening?
10ms in the morning 45ms at the evening

Is application converging (DDD)
Usual application behavior
(average over last 7 weeks)
Current application
behavior

Event Processor Quiz
Event Reply (events that were ever processed in our system)
●Do you see the deployments / optimizations?
Event that made “more sense” was added
●AllReferencesMarkedAsUrgent instead of iteration over
●ReferenceMarkedAsUrgent for all references currently in basket
ReferenceMarkedAsUrgent
AllReferencesMarkedAsUrgent

Quiz 2 Query Handlers
We have 2 instances of QueryHandler, why only one is called?

Event Processor Replay Quiz
eventProcessor_latency:

Measures the di?erence between an
event's timestamp and the current
time, showing how far behind an event
processor is. ●How many processors?
●Are they processed in order?
●Why “last” value is != 0?
How long was the replay operation?
●~10hours!
Is the 10 hours replay time important?
●IDK … it depends

Event Processor Replay Quiz
●How many processors?
●Why some finished earlier?
●Are they processed in order?
●Why “last” value is != 0?
SequentialPerAggregatePolicy

It will force domain events that were raised from the same
aggregate to be handled sequentially.
Thus, events from di?erent aggregates may be handled
concurrently.
eventProcessor_latency:

Measures the di?erence between an event's timestamp
and the current time.
Showing how far behind an event processor is.
How long was the replay operation?
●~45 min
Is the 45 min good replay time?
●IDK … it depends

Do I have regression or bug in infra
Server sometimes refuses to ingest
events we are creating.
●Where is a problem?
○In Application ?
○In Server?
○In Framework?
Mere observation e?ect
(Look and See) Again!
New version deployment

Quick performance feedback
Most frequent
Most time
consuming
Commands Queries

Dashboard our SRE team
provides to our client

Main Application dashboard structure
Alerts
Business Metrics
Downstream Systems
Deployments
Axon Server
Mongo Projections
(micro) Services
Protocol WS

Axon Server panels for context
Commands
over selected period
Commands over equivalent period
Command rate
for selected period
equivalent period:
4 weeks avg over the same span
Command rate
for equivalent period
Are we winning? :-) over the last week

Command and Query services

up to date business metrics

up to date business metrics

Time for quick architecture sidenote
«protocol»
«projection»
«write-model»
«process»
«saga»
«adapter»

High level technical overview

Dashboard follows architecture
«protocol»
«projection»
«write-model»
«process»
«saga»
«adapter»

Ad-hoc Dashboard example

Event Processor Quiz
600 Events/sec ← server ~50 Ev/sec per EP 400% max capacity
10 hours duration
8 sec max
78% capacity max

Event Processor Quiz
220k Ev/sec ← server 3200% max capacity
45 min duration
3.5 sec max
~25k Ev/sec per EP
3200% capacity

Standard dashboards
SRE team provides to our
development team

Observe our services
? Can we have less events than commands ar or less commands than events
«projection»
«write-model»
Queries in Events in
? Why two thread in (yellow and green) but only one thread for event processing.
(Is that) a bug?
Commands in Events emitted

Per Aggregate Command Handlers
Failed command
Mean handling time
90% handling time
Call Rate
Aggregate 1
Aggregate 5
Aggregate 6

Per Aggregate Command Handlers

Per Projection Queries dashboard
Failed command
Mean handling time
90% handling time
Call Rate
Projection 1
Projection 5
Projection 6

Per Projection Query Handlers
Still the same drill

Downstream system
Still the same pattern

Event Processors Dashboard

To wrap it up

Server vs framework metrics
Axon Framework metrics
●Are orthogonal: easier to reason about
●Consume handler resources: more details = more heap space
●Can change name → (EventStore → EventBus)
●More flexible → per service configuration
●Not fully qualified, i.e. package name missing
●Contains info about emitted events PayloadType
Axon Server metrics
●All metrics for all apps for each context
●Immediately available for new application
●Only Axon specific metrics
●Contain full package names
●Contains info about dispatched commands and queries
●No additional infrastructure

Take Home Message
●Metrics are fun
●Metrics are important
●Metrics are your friend
●They are the most important benefit when using Axon
Thank you
?