Diagnosing Production Akka.NET Problems with OpenTelemetry.pptx

petabridge 212 views 30 slides Aug 22, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

We recently leveraged OpenTelemetry, Phobos, semantic logging, and more to resolve several high-impact, difficult bugs in Akka.NET itself. We want to share the techniques, best practices, methodologies, and tooling we used to do that - so in this webinar you will learn how to:

- Collect useful trac...


Slide Content

Diagnosing Production Akka.NET Problems with OpenTelemetry Phobos, Seq, Grafana, and More

Table of Contents OpenTelemetry Basics Local vs. “Production” Telemetry Stacks Working with Phobos Data Noise Control ( this is what will make OTEL actually useful ) OTEL Filtering Custom Metrics & Traces Crafting Useful Alerts

OpenTelemetry at Glance Unified vendor-neutral observability + monitoring standard + for all application runtimes Logging Metrics Tracing Can work locally All major APM vendors involved and on-board

OTEL Configuration as Code Resources : describes the topology of code being observed i.e. “service: web-service, namespace: QA environment, instance: host1, version: 1.5.21” Metrics : subscriptions to metric sources Traces : subscriptions to activity sources Exporter : where metrics + traces are sent for processing services // IServiceCollection . AddOpenTelemetry () . ConfigureResource (builder => { builder . AddEnvironmentVariableDetector () . AddTelemetrySdk () . AddServiceVersionDetector (); }) . WithMetrics (c => { c. AddRuntimeInstrumentation () . AddPhobosInstrumentation () . AddHttpClientInstrumentation () . AddTestLabMetrics (); }) . WithTracing (c => { c. AddHttpClientInstrumentation () . AddPhobosInstrumentation (); }) . UseOtlpExporter ( OtlpExportProtocol . Grpc , new Uri ( otlpEndpoint ));

OTEL Logging Configuration Requires a separate configuration against the LoggingConfigBuilder Probably a good idea to standardize how resources are detected No need to add an exporter again if you already added the OTLP exporter hostBuilder. ConfigureLogging (builder => { // LoggingConfigBuidler builder. ClearProviders (); builder. AddConsole (); builder. Services . Configure < LoggerFilterOptions >(opt => { opt. MinLevel = logLevel ; }); var resourceBuilder = ResourceBuilder . CreateDefault (); resourceBuilder . AddEnvironmentVariableDetector () . AddTelemetrySdk () . AddServiceVersionDetector (); builder. AddOpenTelemetry (options => { options. SetResourceBuilder ( resourceBuilder ); }); });

OTLP: OpenTelemetry Line Protocol OTLP: dedicated wire format used for processing metrics, traces, and logs OTLP Collector: dedicated OTLP receiver; forwards data to appropriate destinations Standardizes OTEL collection across any stack receivers: otlp : protocols: grpc : exporters: debug: prometheus : endpoint: "0.0.0.0:9464" otlp /jaeger: endpoint: rpi-pb-stresstest1:4317 tls : insecure: true otlphttp /seq: endpoint: http://rpi-pb-stresstest1/ingest/otlp otlp /aspire: endpoint: rpi-pb-stresstest1:18889 tls : insecure: true service: pipelines: traces: receivers: [ otlp ] exporters: [debug, otlp /jaeger, zipkin , otlphttp /seq, otlp /aspire] metrics: receivers: [ otlp ] exporters: [debug, prometheus , otlp /aspire] logs: receivers: [ otlp ] exporters: [debug, otlphttp /seq, otlp /aspire]

Local vs. Production Telemetry Stacks No, you don’t need DataDog everywhere

Local Telemetry Stack https://github.com/StephenCleary/LocalTelemetry OTLP, .NET Aspire, Prometheus, Grafana, Seq, Zipkin , Jaeger What I use: OTLP, Grafana, Prometheus, Seq Jaeger / Zipkin are memory hogs .NET Aspire metrics explorer is helpful, but Grafana dashboards are far superior

Production Telemetry Stack Probably best to go cloud-hosted Difficult to manage ingestion / storage at high volumes Be very careful with cardinality and data volume Vendors might bill per metric dimension Vendors might bill per volume Phobos naturally accounts for these OTEL filtering also helps Self-hosting is still an option!

Working with phobos data Looking for useful Akka.NET signals with Phobos

About Phobos OpenTelemetry instrumentation for Akka.NET Designed to be high-performance Automatically captures useful data, suppresses noise – no instrumentation code required by users Highly configurable Automatically correlates Akka.NET activity with non- Akka activity Costs $4000 per year per organization https://phobos.petabridge.com/

Metrics: What Data Does Phobos Capture? Akka.Cluster : nodes by status, reachability Akka.Cluster.Sharding : shard + entity allocations by region, address Akka : Actor starts / stops, current alive total by type & address Processed messages by actor type / message type / address Latency for message processing Log rates by level / address / exception types Actor crashes / restarts by type

Built-in & Custom Dashboards Demo

Tracing: What Data Does Phobos Capture? akka.msg.recv { MsgType } – tracks actors receiving messages akka.actor.ask { MsgType } – tracks end-to-end Ask operation akka.actor.start / spawn – tracks instantiation of an actor akka.actor.crash – tracks actor crashes /system/unhandled – tracks unhandled messages /system/ deadletters – tracks deadletters Automatically appends all Akka.NET logs to all msgs

Tracing Data Demo

Generally Useful Phobos Metrics Signals Did the cluster form correctly? Unreachable nodes Detected down / up nodes Is the cluster healthy? Error rate (via logs) Actor crash rate Is the application doing what it should? Live actors by type Message rates by type

Generally Useful Phobos Tracing Signals Specific errors / specific actors / specific messages Unhandled messages and dead letters akka.actor.ask duration akka.actor.start

Tracing Examples Messages Processed by Actor + Msg Type select count(*) as MsgCount from stream where akka.actor.recv.msgType is not null and akka.actor.type is not null group by( akka.actor.type ) as ActorType , group by( akka.actor.recv.msgType ) as MsgType order by MsgCount desc Tracking Shard Handoffs by Node select count(*) as MsgCount from stream where akka.actor.recv.msgType = ' Akka.Cluster.Sharding.ShardCoordinator+HandOff ’ group by(@Resource.service.instance.id) as Host order by Host desc

Tracing Examples Errors by Version (Vanilla OTEL) select count(*) as ErrorCount from stream where @Level in [' e','err','error '] and Has(@Resource.service.version)group by @Resource.service.name,@Resource.service.version Tracking Shard Handoffs by Node select count(*) as MsgCount from stream where akka.actor.recv.msgType = ' Akka.Cluster.Sharding.ShardCoordinator+HandOff ’ group by(@Resource.service.instance.id) as Host order by Host desc

Phobos Noise Control Carefully balancing signal vs. noise is key to making tracing useful + cost effective Trace filters are the most comprehensive solution Can also use Props or HOCON to disable / enable metrics on select actors Tracing / metrics of all /system actors disabled by default public sealed class DrawTogetherTraceFilter : ITraceFilter { public bool ShouldTraceMessage ( object message, bool alreadyInTrace ) { switch (message) { case IClusterShardingSerializable : return true ; case IClusterSingletonMessage : return true ; case IWithDrawingSessionId : return true ; case not null when alreadyInTrace : return true ; default : return false ; } } }

Phobos Noise Control Can decorate messages, actors with Phobos.Actor.Common interfaces INeverTrace – no traces INeverMonitor – no metrics INeverInstrumented - nothing public interface IDupeTestMessage : INeverInstrumented { } public sealed class AkkaClusterFormationDuration : UntypedActor , INeverInstrumented

OTEL Filtering Metrics Tracing . WithMetrics (c => { c. AddRuntimeInstrumentation () . AddPhobosInstrumentation () . AddHttpClientInstrumentation () . AddAspNetCoreInstrumentation () . AddMeter ( EmailSendingTelemetry . EmailOtelName ) // removes all ` akka.messages.latency ` metrics from export; . AddView ( " akka.messages.latency *" , MetricStreamConfiguration . Drop ); }) . WithTracing (c => { c. AddHttpClientInstrumentation () . AddPhobosInstrumentation () . AddAspNetCoreInstrumentation () . AddSource ( EmailSendingTelemetry . EmailOtelName ) // only keep 10% of all traces . SetSampler ( new TraceIdRatioBasedSampler ( 0.1d )); });

Custom opentelemetry Metrics and Traces

Actually Doing Things with OTEL Meter : this is where metrics come from ActivitySource : this is where traces come from Declare both as static singletons (logs can come normally from MSFT.EXT.Logging ) public static class DuplicateDetectorInstrumentation { public const string InstrumentationName = " DuplicateDetector " ; public static readonly Meter DupeMeter = new ( InstrumentationName ); public static readonly ActivitySource DupeActivitySource = new ( InstrumentationName ); public const string DuplicatesFoundName = " duplicates.detected " ; /// <summary> /// All actors we're interested in for potential duplicates /// </summary> public const string ActorTrackedName = " duplicates.tracked " ; public const string DuplicatesAliveDuration = " duplicates.detected.duration " ; public const string DuplicatesUnit = "actors" ; }

Actually Doing Things with OTEL: Metrics ObservedDuplicates = DupeMeter . CreateObservableGauge ( DuplicatesFoundName , () => { var state = _state ; var allTrackedDuplicates = ; foreach ( var ( actorPath , duplicates) in state. FoundDuplicates ) { // var actorPathTag = new KeyValuePair <string, object?>(" actor.path ", actorPath.ToString ()); // allTrackedDuplicates.Add (new Measurement<int>( duplicates.Count , actorPathTag )); allTrackedDuplicates += duplicates. Count ; } return allTrackedDuplicates ; }, DuplicatesUnit , "Number of duplicates actors found" ); Use the meter from earlier to create an “observable gauge” Metrics can return simple values (int, double) or they can return simple values with “tag” metadata

Actually Doing Things with OTEL: Traces using var dupeCheck = DupeActivitySource . StartActivity ( "duplicate-check" , ActivityKind . Server , Activity . Current ? . Context ?? default ); foreach ( var ( actorPath , duplicates) in stateWithDuplicates. FoundDuplicates ) { var duplicateTimes = duplicates. Select (c => c. state . Started ). OrderByDescending (c => c). ToArray (); // compute max duration var duration = DateTime . UtcNow - duplicateTimes. First (); dupeCheck ? . AddEvent ( new ActivityEvent ( "duplicate-found" , DateTimeOffset . UtcNow , new ActivityTagsCollection ( new Dictionary < string , object ? > { [ " actor.path " ] = actorPath , [ "duplicates" ] = duplicates. Count , [ "duration" ] = duration, [ "servers" ] = string . Join ( "," , duplicates. Select (c => c. node . ToString ())) }))); _ log . Warning ( "Found {0} duplicates [over {1}] for actor {2}: {3}" , duplicates. Count , duration, actorPath , string . Join ( ", " , duplicates)); } Use the activity source from earlier to start a new trace Can append events or tags to activities to make them more detailed / searchable

Crafting useful alerts Using Seq and Grafana

Alert Sources Metrics Coarse-grained Can easily detect “big” problems Can identify when a problem happened easily, but not why or where Traces & Logs Fine-grained Takes tremendous computing power to detect a “big” problem Can identify when, why, and where a problem happened

Useful Akka.NET Alerts Metrics More than N DOWNed or Unreachable nodes Error rate > than 30 min average Actor crash count > than 30 min average Specific alerts, i.e. Akka.Persistence errors Traces Look for acute problems, i.e. “SQL Timeout” errors

Seq & Grafana Alerts Demo