Diagnosing Production Akka.NET Problems with OpenTelemetry.pptx
petabridge
212 views
30 slides
Aug 22, 2024
Slide 1 of 30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
About This Presentation
We recently leveraged OpenTelemetry, Phobos, semantic logging, and more to resolve several high-impact, difficult bugs in Akka.NET itself. We want to share the techniques, best practices, methodologies, and tooling we used to do that - so in this webinar you will learn how to:
- Collect useful trac...
We recently leveraged OpenTelemetry, Phobos, semantic logging, and more to resolve several high-impact, difficult bugs in Akka.NET itself. We want to share the techniques, best practices, methodologies, and tooling we used to do that - so in this webinar you will learn how to:
- Collect useful traces, logs, and metrics using a combination of built-in and custom OpenTelemetry instrumentation;
- Create useful alerts and dashboards using OpenTelemetry metrics;
- Leverage semantic logging and tracing data in platforms like Seq to get behavioral questions answered;
- Eliminate wasteful noise and cost using Phobos’ built-in trace filtration system; and
- How to configure and deploy the OpenTelemetry agent to collect logs, traces, and metrics and ship them to APM destinations like .NET Aspire, Grafana, Prometheus, DataDog, and so on.
This webinar will take about an hour and everything you learn in it can be easily applied locally on your developers’ machines, in shared environments, or in productions. Everything you learn here will be highly portable and a great tool to keep in your belt.
Size: 347.78 KB
Language: en
Added: Aug 22, 2024
Slides: 30 pages
Slide Content
Diagnosing Production Akka.NET Problems with OpenTelemetry Phobos, Seq, Grafana, and More
Table of Contents OpenTelemetry Basics Local vs. “Production” Telemetry Stacks Working with Phobos Data Noise Control ( this is what will make OTEL actually useful ) OTEL Filtering Custom Metrics & Traces Crafting Useful Alerts
OpenTelemetry at Glance Unified vendor-neutral observability + monitoring standard + for all application runtimes Logging Metrics Tracing Can work locally All major APM vendors involved and on-board
OTEL Configuration as Code Resources : describes the topology of code being observed i.e. “service: web-service, namespace: QA environment, instance: host1, version: 1.5.21” Metrics : subscriptions to metric sources Traces : subscriptions to activity sources Exporter : where metrics + traces are sent for processing services // IServiceCollection . AddOpenTelemetry () . ConfigureResource (builder => { builder . AddEnvironmentVariableDetector () . AddTelemetrySdk () . AddServiceVersionDetector (); }) . WithMetrics (c => { c. AddRuntimeInstrumentation () . AddPhobosInstrumentation () . AddHttpClientInstrumentation () . AddTestLabMetrics (); }) . WithTracing (c => { c. AddHttpClientInstrumentation () . AddPhobosInstrumentation (); }) . UseOtlpExporter ( OtlpExportProtocol . Grpc , new Uri ( otlpEndpoint ));
OTEL Logging Configuration Requires a separate configuration against the LoggingConfigBuilder Probably a good idea to standardize how resources are detected No need to add an exporter again if you already added the OTLP exporter hostBuilder. ConfigureLogging (builder => { // LoggingConfigBuidler builder. ClearProviders (); builder. AddConsole (); builder. Services . Configure < LoggerFilterOptions >(opt => { opt. MinLevel = logLevel ; }); var resourceBuilder = ResourceBuilder . CreateDefault (); resourceBuilder . AddEnvironmentVariableDetector () . AddTelemetrySdk () . AddServiceVersionDetector (); builder. AddOpenTelemetry (options => { options. SetResourceBuilder ( resourceBuilder ); }); });
Local vs. Production Telemetry Stacks No, you don’t need DataDog everywhere
Local Telemetry Stack https://github.com/StephenCleary/LocalTelemetry OTLP, .NET Aspire, Prometheus, Grafana, Seq, Zipkin , Jaeger What I use: OTLP, Grafana, Prometheus, Seq Jaeger / Zipkin are memory hogs .NET Aspire metrics explorer is helpful, but Grafana dashboards are far superior
Production Telemetry Stack Probably best to go cloud-hosted Difficult to manage ingestion / storage at high volumes Be very careful with cardinality and data volume Vendors might bill per metric dimension Vendors might bill per volume Phobos naturally accounts for these OTEL filtering also helps Self-hosting is still an option!
Working with phobos data Looking for useful Akka.NET signals with Phobos
About Phobos OpenTelemetry instrumentation for Akka.NET Designed to be high-performance Automatically captures useful data, suppresses noise – no instrumentation code required by users Highly configurable Automatically correlates Akka.NET activity with non- Akka activity Costs $4000 per year per organization https://phobos.petabridge.com/
Metrics: What Data Does Phobos Capture? Akka.Cluster : nodes by status, reachability Akka.Cluster.Sharding : shard + entity allocations by region, address Akka : Actor starts / stops, current alive total by type & address Processed messages by actor type / message type / address Latency for message processing Log rates by level / address / exception types Actor crashes / restarts by type
Built-in & Custom Dashboards Demo
Tracing: What Data Does Phobos Capture? akka.msg.recv { MsgType } – tracks actors receiving messages akka.actor.ask { MsgType } – tracks end-to-end Ask operation akka.actor.start / spawn – tracks instantiation of an actor akka.actor.crash – tracks actor crashes /system/unhandled – tracks unhandled messages /system/ deadletters – tracks deadletters Automatically appends all Akka.NET logs to all msgs
Tracing Data Demo
Generally Useful Phobos Metrics Signals Did the cluster form correctly? Unreachable nodes Detected down / up nodes Is the cluster healthy? Error rate (via logs) Actor crash rate Is the application doing what it should? Live actors by type Message rates by type
Generally Useful Phobos Tracing Signals Specific errors / specific actors / specific messages Unhandled messages and dead letters akka.actor.ask duration akka.actor.start
Tracing Examples Messages Processed by Actor + Msg Type select count(*) as MsgCount from stream where akka.actor.recv.msgType is not null and akka.actor.type is not null group by( akka.actor.type ) as ActorType , group by( akka.actor.recv.msgType ) as MsgType order by MsgCount desc Tracking Shard Handoffs by Node select count(*) as MsgCount from stream where akka.actor.recv.msgType = ' Akka.Cluster.Sharding.ShardCoordinator+HandOff ’ group by(@Resource.service.instance.id) as Host order by Host desc
Tracing Examples Errors by Version (Vanilla OTEL) select count(*) as ErrorCount from stream where @Level in [' e','err','error '] and Has(@Resource.service.version)group by @Resource.service.name,@Resource.service.version Tracking Shard Handoffs by Node select count(*) as MsgCount from stream where akka.actor.recv.msgType = ' Akka.Cluster.Sharding.ShardCoordinator+HandOff ’ group by(@Resource.service.instance.id) as Host order by Host desc
Phobos Noise Control Carefully balancing signal vs. noise is key to making tracing useful + cost effective Trace filters are the most comprehensive solution Can also use Props or HOCON to disable / enable metrics on select actors Tracing / metrics of all /system actors disabled by default public sealed class DrawTogetherTraceFilter : ITraceFilter { public bool ShouldTraceMessage ( object message, bool alreadyInTrace ) { switch (message) { case IClusterShardingSerializable : return true ; case IClusterSingletonMessage : return true ; case IWithDrawingSessionId : return true ; case not null when alreadyInTrace : return true ; default : return false ; } } }
Phobos Noise Control Can decorate messages, actors with Phobos.Actor.Common interfaces INeverTrace – no traces INeverMonitor – no metrics INeverInstrumented - nothing public interface IDupeTestMessage : INeverInstrumented { } public sealed class AkkaClusterFormationDuration : UntypedActor , INeverInstrumented
OTEL Filtering Metrics Tracing . WithMetrics (c => { c. AddRuntimeInstrumentation () . AddPhobosInstrumentation () . AddHttpClientInstrumentation () . AddAspNetCoreInstrumentation () . AddMeter ( EmailSendingTelemetry . EmailOtelName ) // removes all ` akka.messages.latency ` metrics from export; . AddView ( " akka.messages.latency *" , MetricStreamConfiguration . Drop ); }) . WithTracing (c => { c. AddHttpClientInstrumentation () . AddPhobosInstrumentation () . AddAspNetCoreInstrumentation () . AddSource ( EmailSendingTelemetry . EmailOtelName ) // only keep 10% of all traces . SetSampler ( new TraceIdRatioBasedSampler ( 0.1d )); });
Custom opentelemetry Metrics and Traces
Actually Doing Things with OTEL Meter : this is where metrics come from ActivitySource : this is where traces come from Declare both as static singletons (logs can come normally from MSFT.EXT.Logging ) public static class DuplicateDetectorInstrumentation { public const string InstrumentationName = " DuplicateDetector " ; public static readonly Meter DupeMeter = new ( InstrumentationName ); public static readonly ActivitySource DupeActivitySource = new ( InstrumentationName ); public const string DuplicatesFoundName = " duplicates.detected " ; /// <summary> /// All actors we're interested in for potential duplicates /// </summary> public const string ActorTrackedName = " duplicates.tracked " ; public const string DuplicatesAliveDuration = " duplicates.detected.duration " ; public const string DuplicatesUnit = "actors" ; }
Actually Doing Things with OTEL: Metrics ObservedDuplicates = DupeMeter . CreateObservableGauge ( DuplicatesFoundName , () => { var state = _state ; var allTrackedDuplicates = ; foreach ( var ( actorPath , duplicates) in state. FoundDuplicates ) { // var actorPathTag = new KeyValuePair <string, object?>(" actor.path ", actorPath.ToString ()); // allTrackedDuplicates.Add (new Measurement<int>( duplicates.Count , actorPathTag )); allTrackedDuplicates += duplicates. Count ; } return allTrackedDuplicates ; }, DuplicatesUnit , "Number of duplicates actors found" ); Use the meter from earlier to create an “observable gauge” Metrics can return simple values (int, double) or they can return simple values with “tag” metadata
Actually Doing Things with OTEL: Traces using var dupeCheck = DupeActivitySource . StartActivity ( "duplicate-check" , ActivityKind . Server , Activity . Current ? . Context ?? default ); foreach ( var ( actorPath , duplicates) in stateWithDuplicates. FoundDuplicates ) { var duplicateTimes = duplicates. Select (c => c. state . Started ). OrderByDescending (c => c). ToArray (); // compute max duration var duration = DateTime . UtcNow - duplicateTimes. First (); dupeCheck ? . AddEvent ( new ActivityEvent ( "duplicate-found" , DateTimeOffset . UtcNow , new ActivityTagsCollection ( new Dictionary < string , object ? > { [ " actor.path " ] = actorPath , [ "duplicates" ] = duplicates. Count , [ "duration" ] = duration, [ "servers" ] = string . Join ( "," , duplicates. Select (c => c. node . ToString ())) }))); _ log . Warning ( "Found {0} duplicates [over {1}] for actor {2}: {3}" , duplicates. Count , duration, actorPath , string . Join ( ", " , duplicates)); } Use the activity source from earlier to start a new trace Can append events or tags to activities to make them more detailed / searchable
Crafting useful alerts Using Seq and Grafana
Alert Sources Metrics Coarse-grained Can easily detect “big” problems Can identify when a problem happened easily, but not why or where Traces & Logs Fine-grained Takes tremendous computing power to detect a “big” problem Can identify when, why, and where a problem happened
Useful Akka.NET Alerts Metrics More than N DOWNed or Unreachable nodes Error rate > than 30 min average Actor crash count > than 30 min average Specific alerts, i.e. Akka.Persistence errors Traces Look for acute problems, i.e. “SQL Timeout” errors