Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
RichardBouka
42 views
56 slides
Oct 03, 2024
Slide 1 of 56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
About This Presentation
This talk delves into the often-underappreciated telemetry capabilities of Axon Server, demonstrating its pivotal role in enhancing microservice architectures.
While the benefits of modularity, CQRS + ES (Event Sourcing), and location transparency brought by Axon are widely recognized, Axon Server...
This talk delves into the often-underappreciated telemetry capabilities of Axon Server, demonstrating its pivotal role in enhancing microservice architectures.
While the benefits of modularity, CQRS + ES (Event Sourcing), and location transparency brought by Axon are widely recognized, Axon Server's telemetry provides invaluable insights that are crucial for optimizing performance and resilience.
Size: 16.48 MB
Language: en
Added: Oct 03, 2024
Slides: 56 pages
Slide Content
Telemetry: The Overlooked Treasure
in Axon Server-Centric Applications
Richard Bouška
The most uninteresting topic
CB B
Maintenance & support
●Reliable and effective administration of your
application portfolio
●Flexible on-demand analyses, training and
support for your business
ASSIST / me
DevOps & Observability
●Qualified technical and technological
support
●Solution automatisation
●Supervision of problem-free operation of
applications
Software Engineering
●Modern and professional custom-made SW
solutions
●Wide technological stack with emphasis on
cloud solutions
Low Code & Data Platforms
●Data transformation
●Easy-to-understand data visualisation
32
years on the market
100+
employees
500+
apps in our portfolio
250k+
delivered MD
ASSIST / me
Richard Bouška
ASSIST CTO
Our perception through the years
●2021: CQRS/ES & DDD … not only, its
●2022: easy microservices … not only, its
●2023: location transparency with awareness … not only, its
●2024: Telemetry
●2025: …
Hmm, metrics
is that a good idea anyway?
You will get stressed (Don’t Panic!)
●Peril Sensitive Sunglasses
○Joo Janta 200 Super-Chromatic Peril Sensitive Sunglasses:
designed to help people develop a relaxed attitude to danger.
○At the first hint of trouble, they turn totally black and thus
prevent you from seeing anything that might alarm you.
You can break your system
Prometheus, Grafana, Dynatrace
●PromQL
●Micrometer
●There is plenty di?erent metrics that Axons server exposes
●There is even more metrics that Axon Framework exposes
●being there .. JVM has and spring boot even even more
●and Dynatrace
●Counters, Gauges, Timers,
●Distributions, Histograms, Natural Histograms
●Percentiles, Mean, Median
●Sampling (theorem) and aliasing e?ects
●Grafana
●Dashboarding
●Layout systems
●Color perception
Alerting New infrastructure
One day you resolve the last obstacle
Richard Bouška:
Last crucial blocker resolved:
Prometheus has dark mode.
Ondra Halata:
Well, if it has dark mode, then I'll start working on it
immediately ?????? Until now, I wasn't convinced of its priority.
Statistics, sampling, convolution ??????
Avg over 1 min
Max: 700
Avg over 30 sec
Max: 1200
Avg over 5 min
Max: 300
Avg over 15 min
Max: 150
Avg over 30 min
Max: 110
Avg over 1h
Max: 70
Statistics, sampling theorem, math??????
What is the maximum UpdateAppointment calls count per second
But there must be some
benefits as well, no?
They are visually pleasant
Metrics are useful
What you get… (for free)
●Application architecture cleanup
●3rd party performance and stability monitoring
●Global system stability wearout
●Performance tuning support
●Finops / Green IT support data
●Post deployment regression detection
●Performance degradation detection
●3rd party component bugs detection
●Business oriented metric
●Newcomers CQRS ramp up support
When you use a tool or system with an understanding of how it operates best.
You don't have to be an engineer to be be a racing driver, but you do have to have
Mechanical Sympathy.
Jackie Stewart, racing driver
When you understand how a system is designed to be used, you can align
with the design to gain optimal performance.
Basic server metrics
●Axon server runs in JVM → heap memory was our concern …
○Good to check sometimes but no issues here ever … just properly size your servers
First useful dashboard
Event store, the source of truth … maintain the disk space or …
●90 days 12.3% → 15.5%
○→ we still have plenty of time
You don't want to hit 100% … keep your system under ~80%
Basic Axion server metrics
# of apps connected to individual nodes per context
Active applications → better than /health
# of apps events processed per node
Since node restart per context
3 nodes, 3 contexts, last 90 days
Last event token #
Since context created (they should overlap)
Counters are used for … Events
Application activity
(indirect info)
Node Load balance
(indirect info)
Now down the rabbit hole
Few stories from the battlefield
… but before we go a reminder
Team work together
SRE/Observability team DEVOPS team Development team
Especially for DDD and CQRS architectures:
●co-locate the Dev, DEVOPS and Observability teams.
○Instant feedback after deployment
○Flexible deployment support (same time zone)
○On measure ad-hoc dashboards
○Simple communication about new metrics definition
○By easier communication → improved architecture understanding
Architecture!!
Business as usual behavior
Mere observation e?ect (Look and See)
●How many trains did not crash?
●How many Chernobyls did not explode?
Are we sure about our backup strategy
“Invisible” bug in command handling
Input file parsing retry bug
Eyesessment!
20 segments/week
Is event segment size well chosen?
●What is your backup strategy?
●Which files do you need to backup?
●Proper sizing event store segment you can save backup storage size.
Setting Event Store Segment Size - closed segments (last 30 days)
1 segments/week 1 segments/ 3weeks
Snapshots: better late than later
~30ms mean handling time
~1.5s mean handling time
26 event segment files
6 event segment files
file.segment.open
Counts the number of event store
segments opened since start
local.aggregate.segments
Monitors the number of segments that were
accessed for reading aggregate event
requests
300 event segment files
100 event segment files
285
(Un)Subscription Queries
Do we have serious leak in the SubscriptionQueries unsubscribe
handling?
Is the system tired each evening?
10ms in the morning 45ms at the evening
Is application converging (DDD)
Usual application behavior
(average over last 7 weeks)
Current application
behavior
Event Processor Quiz
Event Reply (events that were ever processed in our system)
●Do you see the deployments / optimizations?
Event that made “more sense” was added
●AllReferencesMarkedAsUrgent instead of iteration over
●ReferenceMarkedAsUrgent for all references currently in basket
ReferenceMarkedAsUrgent
AllReferencesMarkedAsUrgent
Quiz 2 Query Handlers
We have 2 instances of QueryHandler, why only one is called?
Measures the di?erence between an
event's timestamp and the current
time, showing how far behind an event
processor is. ●How many processors?
●Are they processed in order?
●Why “last” value is != 0?
How long was the replay operation?
●~10hours!
Is the 10 hours replay time important?
●IDK … it depends
Event Processor Replay Quiz
●How many processors?
●Why some finished earlier?
●Are they processed in order?
●Why “last” value is != 0?
SequentialPerAggregatePolicy
It will force domain events that were raised from the same
aggregate to be handled sequentially.
Thus, events from di?erent aggregates may be handled
concurrently.
eventProcessor_latency:
Measures the di?erence between an event's timestamp
and the current time.
Showing how far behind an event processor is.
How long was the replay operation?
●~45 min
Is the 45 min good replay time?
●IDK … it depends
Do I have regression or bug in infra
Server sometimes refuses to ingest
events we are creating.
●Where is a problem?
○In Application ?
○In Server?
○In Framework?
Mere observation e?ect
(Look and See) Again!
New version deployment
Quick performance feedback
Most frequent
Most time
consuming
Commands Queries
Dashboard our SRE team
provides to our client
Main Application dashboard structure
Alerts
Business Metrics
Downstream Systems
Deployments
Axon Server
Mongo Projections
(micro) Services
Protocol WS
Axon Server panels for context
Commands
over selected period
Commands over equivalent period
Command rate
for selected period
equivalent period:
4 weeks avg over the same span
Command rate
for equivalent period
Are we winning? :-) over the last week
Command and Query services
up to date business metrics
up to date business metrics
Time for quick architecture sidenote
«protocol»
«projection»
«write-model»
«process»
«saga»
«adapter»
Event Processor Quiz
600 Events/sec ← server ~50 Ev/sec per EP 400% max capacity
10 hours duration
8 sec max
78% capacity max
Event Processor Quiz
220k Ev/sec ← server 3200% max capacity
45 min duration
3.5 sec max
~25k Ev/sec per EP
3200% capacity
Standard dashboards
SRE team provides to our
development team
Observe our services
? Can we have less events than commands ar or less commands than events
«projection»
«write-model»
Queries in Events in
? Why two thread in (yellow and green) but only one thread for event processing.
(Is that) a bug?
Commands in Events emitted
Per Aggregate Command Handlers
Failed command
Mean handling time
90% handling time
Call Rate
Aggregate 1
Aggregate 5
Aggregate 6
Per Aggregate Command Handlers
Per Projection Queries dashboard
Failed command
Mean handling time
90% handling time
Call Rate
Projection 1
Projection 5
Projection 6
Per Projection Query Handlers
Still the same drill
Downstream system
Still the same pattern
Event Processors Dashboard
To wrap it up
Server vs framework metrics
Axon Framework metrics
●Are orthogonal: easier to reason about
●Consume handler resources: more details = more heap space
●Can change name → (EventStore → EventBus)
●More flexible → per service configuration
●Not fully qualified, i.e. package name missing
●Contains info about emitted events PayloadType
Axon Server metrics
●All metrics for all apps for each context
●Immediately available for new application
●Only Axon specific metrics
●Contain full package names
●Contains info about dispatched commands and queries
●No additional infrastructure
Take Home Message
●Metrics are fun
●Metrics are important
●Metrics are your friend
●They are the most important benefit when using Axon
Thank you
?