Detect operational anomalies in Serverless Applications with Amazon DevOps Guru at AWS Cloud Day Warsaw 2024
VadymKazulkin
15 views
80 slides
Sep 21, 2024
Slide 1 of 80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
About This Presentation
In this talk we’ll use a standard serverless application that uses API Gateway, Lambda, DynamoDB, SQS, SNS, Kinesis, Step Functions, Aurora (Serverless) (and other AWS-managed services). We'll explore how Amazon DevOps Guru recognizes operational issues and anomalies like increased latency and...
In this talk we’ll use a standard serverless application that uses API Gateway, Lambda, DynamoDB, SQS, SNS, Kinesis, Step Functions, Aurora (Serverless) (and other AWS-managed services). We'll explore how Amazon DevOps Guru recognizes operational issues and anomalies like increased latency and error rates (timeouts, throttling and increased latency). We will also explore DevOps Guru "Proactive Insights" which recognize configurational anti-patterns like missing failure destination on Kinesis Data Streams or DLQ on SQS or over-provisioning of AWS services like DynamoDB tables. We'll also integrate DevOps Guru with PagerDuty to provide even better incident management. We'll also investigate current shortcomings of the DevOps Guru service.
Amazon DevOps Guru analyzes data like application metrics, logs, events, and traces to establish baseline operational behavior and then uses ML to detect anomalies. The service uses pre-trained ML models that are able to identify spikes in application requests, so it knows when to alert and when not to.
Size: 6.46 MB
Language: en
Added: Sep 21, 2024
Slides: 80 pages
Slide Content
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
byVadymKazulkin
ip.labsGmbH
17.10.2023
How to Reduce
Cold Starts for
Java Serverless
Applications in AWS
GraalVM, AWS SnapStartand Co
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
Contact
Vadym Kazulkin
ip.labs GmbH Bonn, Germany
Co-Organizer of the Java User Group Bonn [email protected]
@VKazulkin
https://dev.to/vkazulkin
https://github.com/Vadym79/
https://de.slideshare.net/VadymKazulkin/
https://www.linkedin.com/in/vadymkazulkin
https://www.iplabs.de/
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
About ip.labs
3 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Lifecycle
4 Amazon DevOps Guru for the Serverless Applications
c
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
Amazon DevOps Guru
5 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH6
AIOPs
ArtificialIntelligenceforITOperations(AIOps)istheprocessofusing
machinelearningtechniquestosolveoperationalproblems.Thegoalof
AIOpsistoreducehumaninterventionintheIToperationsprocesses.
Byusingadvancedmachinelearningtechniques,youcanreduce
operationalincidentsandincreaseservicequality.AIOpscanhelpyou
with:
•Increaseservicequality
•forexample,bygroupingrelatedincidentsbasedontimeand
language
•Predictincidentsbeforetheyhappen
https://aws.amazon.com/devops-guru
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH7
What is AWS DevOps Guru
AmazonDevOpsGuruoffersafullymanagedAIOpsplatformpowered
bymachinelearning(ML)thatisdesignedtomakeiteasytoimprovean
application’soperationalperformanceandavailability
DevOpsGuruhelpsdetectbehaviorsthatdeviatefromnormaloperating
patternssoyoucanidentifyoperationalissueslongbeforetheyimpact
yourcustomers
•increasedlatency
•errorrates(timeouts,throttles,CPU,memoryand,diskutilization)
•resourceconstraints(exceedingAWSaccountlimits)
https://aws.amazon.com/devops-guru
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH8
Benefits of DevOps Guru
https://aws.amazon.com/devops-guru
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH9
https://aws.amazon.com/devops-guru
How DevOps Guru work
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH10
DevOps Guru is powered by pre-trained ML models
•Builtdomain-specific,single-purposemodelstoidentifyknownfailure
modesinsteadofnormalmetricbehavior.
•DevOpsGurureliesonalargeensembleofdetectors—statisticalmodels
tunedtodetectcommonadversescenariosinavarietyofoperational
metrics.
•DevOpsGurudetectorsdon’tneedtobetrainedorconfigured.They
workinstantlyaslongasenoughhistoryisavailable.
•Individualdetectorsworkinpreconfiguredensemblestogenerate
anomaliesonsomeofthemostimportantmetrics:errorrates,
availability,latency,incomingrequestrates,CPU,memory,anddisk
utilization,amongothers.
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH12
DevOps Guru pre-trained ML detectors with periodic behaviors
•Many metrics, such as the number of
incoming requests in customer-facing
APIs, exhibit periodic behavior.
•The purpose of the causal
convolution detector is to analyze
temporal data with such patterns and
to determine expected periodic
behavior.
•When the detector infers that a
metric is periodic, it adapts normal
metric behavior thresholds to the
seasonal pattern.
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
How future of software developers may look like
13 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH14
Monitoring & Alerting of the Serverless Applications
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH15
Monitoring & Alerting of the Serverless Applications
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Example Application
16
https://github.com/Vadym79/DevOpsGuruWorkshopDemo inspired by https://github.com/aws-samples/serverless-java-frameworks-samples
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH17
DevOps Guru Set Up
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH18
DevOps Guru Set Up with AWS Organizations
https://aws.amazon.com/blogs/mt/how-to-easily-configure-devops-guru-across-your-organization-with-systems-manager-quick-setup/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Dashboard
19 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Dashboard
20 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Reactive Insights
21 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Examples
22
•Warm up the application (takes between 1 and 24 hours) to create a base line
•Design test experiment to provoke errors and latency increase
•Reduce the service quote of the AWS service (API Gateway, Lambda,
DynamoDB)
•Set very low service quotas for the sake of reducing AWS costs
•Add latency artificially
•Stress test with HeyTool to run into the operational issues
•See if the DevOps Guru recognized the operational issues
•Remediate the operational issues by increasing service quote, removing the
artificial latency or stopping the stress test
•See whether DevOps Guru closes the incident when it’s resolved
https://github.com/rakyll/hey
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru: Recognize Operational Issues in DynamoDB
23 Amazon DevOps Guru for the Serverless Applications
c
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru Examples: DynamoDB Throttling
24
hey -q 20 -z 15m -c 20 -H "X-API-Key: XXXa6XXXX "
https://XXX.execute-api.eu-central
1.amazonaws.com/prod/products/1
Amazon DevOps Guru for the Serverless Applications
c
Vadym Kazulkin| @VKazulkin | ip.labsGmbH25
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH26
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH27
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH28
c
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH29
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH30
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH31
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH32
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH33
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH34
DevOps Guru Examples: DynamoDB Throttling
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru: Recognize Operational Issues in DynamoDB
35 Amazon DevOps Guru for the Serverless Applications
c
Vadym Kazulkin| @VKazulkin | ip.labsGmbH36
DevOps Guru Examples: API Gateway
HTTP 429 „too many requests“ Error
Query to exaust the quota
hey -q 10 -z 1m -c 10 -H "X-API-Key:
XXXa6XXXX" https://XXX.execute-api.eu
-central-1.amazonaws.com/prod/
products/1
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH37
DevOps Guru Examples: API Gateway
HTTP 404 „Not Found“ Error
Query for not existing product id, e.g. 200
hey -q 1 -z 15m -c 1 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/products/200
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH38 Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
DevOps Guru: Recognize Operational Issues in DynamoDB
39 Amazon DevOps Guru for the Serverless Applications
c
Vadym Kazulkin| @VKazulkin | ip.labsGmbH40
DevOps Guru Examples: Lambda Throttling 1
hey -q 5 -z 15m -c 5 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-
central-1.amazonaws.com/prod/products/1
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH41
DevOps Guru Examples: Lambda Throttling 1
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH42
Add 31 sec latency in the code of the Lambda function
DevOps Guru Examples: Lambda Timeout Error
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH43
DevOps Guru Examples: Lambda Error
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH44
Temporary add 28 sec latency in the code of
the Lambda function
DevOps Guru Examples: Lambda Increased Latency
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH45
DevOps Guru Examples: Lambda Increased Latency
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH46
DevOps Guru: Recognize Operational Issues in SQS
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH47
Temporary add 26 sec latency in
the code of the Lambda function
DevOps Guru: Operational Issues in SQS
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH48
DevOps Guru: Operational Issues in SQS
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH49
DevOps Guru: Recognize Operational Issues Amazon
in Kinesis
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH50
DevOps Guru Examples: Operational Issues in
Amazon Kinesis Data Stream -> Lambda -> (S3)
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH51
DevOps Guru: Recognize Operational Issues in
AWS Step Functions
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH52
DevOps Guru Examples: Operational Issues
in Amazon Step Functions -> Lambda
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH53
DevOps Guru: Recognize Operational Issues in Aurora
Serverless v2 PostgreSQL
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH54
DevOps Guru Examples: Enabling Performance
Insights for Aurora Serverless v2
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH55
DevOps Guru Examples: Operational Issues Lambda -
> Aurora Serverless v2 w/o RDS Proxy
hey -q 100 -z 15m -c 100 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/productsWithoutDataApi/2
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH56
DevOps Guru: Recognize Operational Issues in Aurora
Serverless v2 PostgreSQL using DataAPI
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH57
DevOps Guru Examples: Operational Issues Lambda -> Aurora
Serverless v2 using DataAPI
hey -q 100 -z 15m -c 100 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/productsWithDataApi/2
Amazon DevOps Guru for the Serverless Applications
No Aurora Serverless DB anomalous metrics
detected
Vadym Kazulkin| @VKazulkin | ip.labsGmbH58
DevOps Guru Examples: Operational Issues Lambda -> Aurora
Serverless v2 using DataAPI
hey -q 100 -z 15m -c 100 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/productsWithDataApi/1
Amazon DevOps Guru for the Serverless Applications
Data APINon Data API Non Data APIData API
Non Data API
Non Data APIData API Data API
Non Data APIData API
Vadym Kazulkin| @VKazulkin | ip.labsGmbH59
DevOps Guru Proactive Insights
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH60
DevOps Guru Proactive Examples: DynamoDB table
reads/writes are under utilized
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH61
DevOps Guru Proactive Examples: DynamoDB table
point in time recovery not enabled
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH62
DevOps Guru Proactive Examples: Lambda
timeout exceeds recommended SQS visibility
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH63
DevOps Guru Proactive Examples: Lambda Timeout Exceeds
Recommended SQS Visibility
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH64
DevOps Guru Proactive Examples: SQS Triggered Lambda
Does Not Have a DLQ
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH65
DevOps Guru Proactive Examples: Lambda Function Consuming
DynamoDB/Kinesis Stream Without Failure Destination
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH66
DevOps Guru Proactive Examples: Lambda Function Has
Concurrency Spillover
hey -q 1 -z 30m -c 9 -m DELETE -H "X-API-Key: XXXa6XXXX" -H "Content-Type: application/json;charset=utf-
8" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/products/11
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH67
DevOps Guru Proactive Examples: Lambda Function
does not have enough subnets
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH68
DevOps Guru integration in Incident
Management Tools
•AWS OPsCenter(via AWS Systems Manager)
•PagerDuty
•Atlassian Opsgenie
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH69
DevOps Guru Integration Settings
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH70
DevOps Guru Integration with PagerDuty
https://www.pagerduty.com/docs/guides/amazon-devops-guru-integration-guide/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH71
DevOps Guru Integration with PagerDuty
https://www.pagerduty.com/docs/guides/amazon-devops-guru-integration-guide/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH72
DevOps Guru Integration with PagerDuty
Enter „Integration
URL“ generated by
PagerDuty
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH73
DevOps Guru PagerDuty Incidents
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH74
DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH75
$3,024 per
resource per month
$2,016 per
resource per month
DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH76
DevOps Guru Cost Estimator
https://aws.amazon.com/de/devops-guru/pricing/
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH77
DevOps Guru Conclusions, Obeservations, Suggestions
•Most operational issues have been correctly recognized so far
•It took several (at least 7) minutes to create an incident after
anomaly appeared
•Correctly no insights created for the temporary incidents
•Short time Lambda, DynamoDB and API Gateway Throttling
•Lambda duration anomalous insights (Duration p90)
•took time to create such an insight (sometimes more than 30
minutes). Maybe because of the medium severity
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH78
DevOps Guru Conclusions, Obeservations, Suggestions
•Recommendations for the insight reason could be more precise (these are
limitations of CloudWatch though)
•No precise HTTP response code as API Gateway response but 4XX and
5XX
•No differentiation between Lambda throttling because of reaching
individual function concurrency limit or the total AWS account
concurrency limit
•No differentiation between Lambda Timeout and Init Error
•DevOps Guru Proactive Insights
•Missed some important ones, like not used Lambda Provisioned
Concurrency for a long period of time
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH79
•#AWS #Wishlist for DevOps Guru
•Support for EventBridge(and EventBridgePipes)
•Support for AppSync
•Support for Aurora (Serverless v2 )over DataAPI
•Better support for tracing i.e. AWS X-Ray, CloudWatch ServiceLens
and integrations with the 3
rd
observability tools i.e. Lumigo,
Datadog
DevOps Guru Conclusions, Obeservations, Suggestions
Amazon DevOps Guru for the Serverless Applications
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
by Vadym Kazulkin
ip.labs GmbH
17.10.2023
How to Reduce
Cold Starts for
Java Serverless
Applications in AWS
GraalVM, AWS SnapStartand Co
Vadym Kazulkin| @VKazulkin | ip.labsGmbH
by Vadym Kazulkin
ip.labs GmbH
17.10.2023
How to Reduce
Cold Starts for
Java Serverless
Applications in AWS
GraalVM, AWS SnapStartand Co