Telemetry: The Overlooked Treasure in Axon Server-Centric Applications

RichardBouka 42 views 56 slides Oct 03, 2024

Slide 1 of 56

About This Presentation

This talk delves into the often-underappreciated telemetry capabilities of Axon Server, demonstrating its pivotal role in enhancing microservice architectures.

While the benefits of modularity, CQRS + ES (Event Sourcing), and location transparency brought by Axon are widely recognized, Axon Server...

Size: 16.48 MB

Language: en

Added: Oct 03, 2024

Slides: 56 pages

Slide Content

Telemetry: The Overlooked Treasure
in Axon Server-Centric Applications
Richard Bouška

The most uninteresting topic
CB B

Maintenance & support
●Reliable and effective administration of your
application portfolio
●Flexible on-demand analyses, training and
support for your business

ASSIST / me
DevOps & Observability
●Qualiﬁed technical and technological
support
●Solution automatisation
●Supervision of problem-free operation of
applications

Software Engineering
●Modern and professional custom-made SW
solutions
●Wide technological stack with emphasis on
cloud solutions

Low Code & Data Platforms
●Data transformation
●Easy-to-understand data visualisation

32
years on the market
100+
employees
500+
apps in our portfolio
250k+
delivered MD

ASSIST / me
Richard Bouška
ASSIST CTO

Our perception through the years
●2021: CQRS/ES & DDD … not only, its
●2022: easy microservices … not only, its
●2023: location transparency with awareness … not only, its
●2024: Telemetry
●2025: …

Hmm, metrics
is that a good idea anyway?

You will get stressed (Don’t Panic!)
●Peril Sensitive Sunglasses
○Joo Janta 200 Super-Chromatic Peril Sensitive Sunglasses:
designed to help people develop a relaxed attitude to danger.
○At the ﬁrst hint of trouble, they turn totally black and thus
prevent you from seeing anything that might alarm you.

You can break your system

Prometheus, Grafana, Dynatrace
●PromQL
●Micrometer
●There is plenty di?erent metrics that Axons server exposes
●There is even more metrics that Axon Framework exposes
●being there .. JVM has and spring boot even even more
●and Dynatrace
●Counters, Gauges, Timers,
●Distributions, Histograms, Natural Histograms
●Percentiles, Mean, Median
●Sampling (theorem) and aliasing e?ects
●Grafana
●Dashboarding
●Layout systems
●Color perception
Alerting New infrastructure

One day you resolve the last obstacle
Richard Bouška:
Last crucial blocker resolved:
Prometheus has dark mode.
Ondra Halata:
Well, if it has dark mode, then I'll start working on it
immediately ?????? Until now, I wasn't convinced of its priority.

Statistics, sampling, convolution ??????
Avg over 1 min
Max: 700
Avg over 30 sec
Max: 1200
Avg over 5 min
Max: 300
Avg over 15 min
Max: 150
Avg over 30 min
Max: 110
Avg over 1h
Max: 70

Statistics, sampling theorem, math??????
What is the maximum UpdateAppointment calls count per second

But there must be some
beneﬁts as well, no?

They are visually pleasant

Metrics are useful

What you get… (for free)
●Application architecture cleanup
●3rd party performance and stability monitoring
●Global system stability wearout
●Performance tuning support
●Finops / Green IT support data
●Post deployment regression detection
●Performance degradation detection
●3rd party component bugs detection
●Business oriented metric
●Newcomers CQRS ramp up support
When you use a tool or system with an understanding of how it operates best.
You don't have to be an engineer to be be a racing driver, but you do have to have
Mechanical Sympathy.
Jackie Stewart, racing driver
When you understand how a system is designed to be used, you can align
with the design to gain optimal performance.

Basic server metrics
●Axon server runs in JVM → heap memory was our concern …
○Good to check sometimes but no issues here ever … just properly size your servers

First useful dashboard
Event store, the source of truth … maintain the disk space or …
●90 days 12.3% → 15.5%
○→ we still have plenty of time
You don't want to hit 100% … keep your system under ~80%

Basic Axion server metrics
# of apps connected to individual nodes per context
Active applications → better than /health
# of apps events processed per node
Since node restart per context
3 nodes, 3 contexts, last 90 days
Last event token #
Since context created (they should overlap)

Counters are used for … Events
Application activity
(indirect info)
Node Load balance
(indirect info)

Now down the rabbit hole
Few stories from the battleﬁeld
… but before we go a reminder

Team work together
SRE/Observability team DEVOPS team Development team
Especially for DDD and CQRS architectures:
●co-locate the Dev, DEVOPS and Observability teams.
○Instant feedback after deployment
○Flexible deployment support (same time zone)
○On measure ad-hoc dashboards
○Simple communication about new metrics deﬁnition
○By easier communication → improved architecture understanding
Architecture!!

Business as usual behavior
Mere observation e?ect (Look and See)
●How many trains did not crash?
●How many Chernobyls did not explode?
Are we sure about our backup strategy
“Invisible” bug in command handling
Input ﬁle parsing retry bug
Eyesessment!

20 segments/week
Is event segment size well chosen?
●What is your backup strategy?
●Which ﬁles do you need to backup?
●Proper sizing event store segment you can save backup storage size.
Setting Event Store Segment Size - closed segments (last 30 days)
1 segments/week 1 segments/ 3weeks

Snapshots: better late than later
~30ms mean handling time
~1.5s mean handling time
26 event segment ﬁles
6 event segment ﬁles
file.segment.open
Counts the number of event store
segments opened since start
local.aggregate.segments
Monitors the number of segments that were
accessed for reading aggregate event
requests
300 event segment ﬁles
100 event segment ﬁles
285

(Un)Subscription Queries
Do we have serious leak in the SubscriptionQueries unsubscribe
handling?

Is the system tired each evening?
10ms in the morning 45ms at the evening

Is application converging (DDD)
Usual application behavior
(average over last 7 weeks)
Current application
behavior

Event Processor Quiz
Event Reply (events that were ever processed in our system)
●Do you see the deployments / optimizations?
Event that made “more sense” was added
●AllReferencesMarkedAsUrgent instead of iteration over
●ReferenceMarkedAsUrgent for all references currently in basket
ReferenceMarkedAsUrgent
AllReferencesMarkedAsUrgent

Quiz 2 Query Handlers
We have 2 instances of QueryHandler, why only one is called?

Event Processor Replay Quiz
eventProcessor_latency:

Measures the di?erence between an
event's timestamp and the current
time, showing how far behind an event
processor is. ●How many processors?
●Are they processed in order?
●Why “last” value is != 0?
How long was the replay operation?
●~10hours!
Is the 10 hours replay time important?
●IDK … it depends

Event Processor Replay Quiz
●How many processors?
●Why some ﬁnished earlier?
●Are they processed in order?
●Why “last” value is != 0?
SequentialPerAggregatePolicy

It will force domain events that were raised from the same
aggregate to be handled sequentially.
Thus, events from di?erent aggregates may be handled
concurrently.
eventProcessor_latency:

Measures the di?erence between an event's timestamp
and the current time.
Showing how far behind an event processor is.
How long was the replay operation?
●~45 min
Is the 45 min good replay time?
●IDK … it depends

Do I have regression or bug in infra
Server sometimes refuses to ingest
events we are creating.
●Where is a problem?
○In Application ?
○In Server?
○In Framework?
Mere observation e?ect
(Look and See) Again!
New version deployment

Quick performance feedback
Most frequent
Most time
consuming
Commands Queries

Dashboard our SRE team
provides to our client

Main Application dashboard structure
Alerts
Business Metrics
Downstream Systems
Deployments
Axon Server
Mongo Projections
(micro) Services
Protocol WS

Axon Server panels for context
Commands
over selected period
Commands over equivalent period
Command rate
for selected period
equivalent period:
4 weeks avg over the same span
Command rate
for equivalent period
Are we winning? :-) over the last week

Command and Query services

up to date business metrics

Time for quick architecture sidenote
«protocol»
«projection»
«write-model»
«process»
«saga»
«adapter»

High level technical overview

Dashboard follows architecture
«protocol»
«projection»
«write-model»
«process»
«saga»
«adapter»

Ad-hoc Dashboard example

Event Processor Quiz
600 Events/sec ← server ~50 Ev/sec per EP 400% max capacity
10 hours duration
8 sec max
78% capacity max

Event Processor Quiz
220k Ev/sec ← server 3200% max capacity
45 min duration
3.5 sec max
~25k Ev/sec per EP
3200% capacity

Standard dashboards
SRE team provides to our
development team

Observe our services
? Can we have less events than commands ar or less commands than events
«projection»
«write-model»
Queries in Events in
? Why two thread in (yellow and green) but only one thread for event processing.
(Is that) a bug?
Commands in Events emitted

Per Aggregate Command Handlers
Failed command
Mean handling time
90% handling time
Call Rate
Aggregate 1
Aggregate 5
Aggregate 6

Per Aggregate Command Handlers

Per Projection Queries dashboard
Failed command
Mean handling time
90% handling time
Call Rate
Projection 1
Projection 5
Projection 6

Per Projection Query Handlers
Still the same drill

Downstream system
Still the same pattern

Event Processors Dashboard

To wrap it up

Server vs framework metrics
Axon Framework metrics
●Are orthogonal: easier to reason about
●Consume handler resources: more details = more heap space
●Can change name → (EventStore → EventBus)
●More ﬂexible → per service conﬁguration
●Not fully qualiﬁed, i.e. package name missing
●Contains info about emitted events PayloadType
Axon Server metrics
●All metrics for all apps for each context
●Immediately available for new application
●Only Axon speciﬁc metrics
●Contain full package names
●Contains info about dispatched commands and queries
●No additional infrastructure

Take Home Message
●Metrics are fun
●Metrics are important
●Metrics are your friend
●They are the most important beneﬁt when using Axon
Thank you
?

Telemetry: The Overlooked Treasure in Axon Server-Centric Applications

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Telemetry: The Overlooked Treasure in Axon Server-Centric Applications

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......