Apidays Singapore 2024 - API Monitoring x SRE by Ryan Ashneil and Eugene Wong, GovTech Singapore

APIdays_official 97 views 31 slides May 02, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

API Monitoring x SRE (Site Reliability Engineering)
Ryan Ashneil, Software Engineer - Government Technology Agency of Singapore
Eugene Wong, Senior DevOps Engineer - Government Technology Agency of Singapore

Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2...


Slide Content

API Monitoring x SRE Eugene Wong - Senior DevOps Engineer Ryan Ashneil - Software Engineer 18 April 2024 apidays

Discover the underlying technologies that power Singapore Government APIs . Peek into how we design our central API Monitoring Dashboards around the SRE principles. SYNOPSIS

CONTEXT BUILDING Introduction to our platforms API Monitoring within the API Lifecycle DASHBOARD DEEP DIVE SRE-designed dashboards Dashboarding: for preparation of major events Dashboarding: for authentication of APIs CONCLUSION Q&A CONTENTS

Learn about our platforms Context building

A full-fledged API management platform of the Singapore Government Tech Stack (SGTS) that allows government agencies, businesses and developers to manage and share APIs : Cross-zone bridging between the internet and intranet realms Experience a diversity of APIs (>30 agencies) to build meaningful government services Secure and stays current with policy updates, standards, and best practices regularly. Introduction of our platforms APEX

Based on the Elastic Stack , StackOps is a key monitoring component of the  SGTS , designed to boost observability and support SRE: Simplified monitoring setup on the Government Commercial Cloud Reduction of operation overheads to run monitoring Accelerated Mean Time to Resolution (MTTR) Helps APEX to monitor the 4 Golden Signals of SRE Introduction of our platforms StackOps

Where our platforms intersect API Monitoring within the API Lifecycle API Lifecycle As active applications consume published APIs, their traffic transactions are logged and piped into StackOps . Within APEX itself, our API publishers can run their entire API lifecycle activities, which are also logged and monitored. Critical metrics of our infrastructure are also captured and shipped to StackOps . Services Portals Gateways Infrastructure

Credits and Reference SRE-designed Dashboards The Site Reliability Workbook Practical Ways to Implement SRE (Niall Richard Murphy, David K. Rensin , Kent Kawahara & Stephen Thorne) Published by O’Reilly (3rd Release)

SRE-designed dashboards Dashboard deep dive

Monitoring in SRE SRE-designed Dashboards Which of these SRE Principles are related to monitoring? Monitoring Risk (evaluating risk of unexpected failures) SLOs Eliminating Toil (automating work to increase reliability and productivity) Automation (including testing, software deployment, incident response, team communication) Release Engineering Simplicity (simple to manage)

Monitoring in SRE – Interfaces (cont’d) SRE-designed Dashboards SRE Workbook Pg 63 Interfaces . You’ll likely need to offer different views of the same data based on audience… Be specific about creating dashboards that make sense to the people consuming the content.

Monitoring in SRE – Interfaces SRE-designed Dashboards

Monitoring in SRE – Ownership & Tooling SRE-designed Dashboards SRE Workbook Pg 7 Use the Same Tooling, Regardless of Function or Job Title . …teams minding a service should use the same tools, regardless of their role in the organization… The more divergence you have, the less your company benefits from each effort to improve each individual tool.

Monitoring in SRE – Ownership & Tooling SRE-designed Dashboards We used a common monitoring tool to measure API Performance ( traffic, latency ), Business Metrics ( SLO, SLI ), application logging and infrastructure metrics/logging. Product Managers, Devs and DevOps , as well as API publishers (customers), all used the same monitoring software – StackOps This has enabled us to work together and link events across different areas, fostering enhancements and streamlining the monitoring tool. Monitoring spans multiple domains, allowing APEX to oversee system performance during events and assist API users in real-time troubleshooting. Incentivizes us to make tool improvements to benefit all parties.

Monitoring in SRE – Speed SRE-designed Dashboards API metrics needing additional logging pipeline treatment were received in StackOps between a few minutes and a couple of hours after the API calls. Data was not fresh. Logging agents and implementation were carried out “out-of-the-box” from the vendor’s Kubernetes manifest We monitored every “hop” of the logging containers and re-architected the logging infrastructure to ensure that each “hop” of logging was rightly-sized and optimised (in configuration) for performance SRE Workbook Pg 62 Speed . Data should be available when you need it… Data more than four to five minutes stale might significantly impact how quickly you can respond to an incident.

Monitoring in SRE – Modelling SRE-designed Dashboards SRE Workbook Pg 22 Draw a high-level architecture diagram of your system; show the key components; the request flow, the data flow, and the critical dependencies. SRE Workbook Pg 39 Modelling User Journeys . You can use critical user journeys to help capture the experience of your customers. SRE Workbook Pg 72 Implementing Purposeful Metrics. Each exposed metric should serve a purpose… When you write a postmortem, think about which additional metrics would have allowed you to diagnose the issue faster.

Monitoring in SRE – Modelling (Cont’d) SRE-designed Dashboards How we define Critical Metrics – traffic that affects the bottom line (revenue) of APEX Being intentional and concise about which metrics to display in the Critical Metrics graph Mapped out the paths of critical flow inter-dependencies, including: K8s nodes Traffic Pods Database Load Balancers Forward Proxies API Traffic

Dashboarding: for preparation of major events Dashboard deep dive

Preparations for National Event Dashboarding: for preparation of major events Devising a custom dashboard for event based on the below principles: The dashboard will show the business metrics ( API metrics ) which are important for the event (i.e., API status codes , API latency ) Links to other critical data which will allow troubleshooting are also embedded in the dashboard Participating in end-to-end load tests with the API consumer and publisher and testing the usefulness of the dashboard and to ascertain the performance of the API backend server. SRE Workbook Pg 39 Modelling User Journeys . You can use critical user journeys to help capture the experience of your customers.

Preparations for National Event - Dashboarding Dashboarding: for preparation of major events

Preparations for National Event - Dashboarding Dashboarding: for preparation of major events

Preparations for National Event - Dashboarding Dashboarding: for preparation of major events SRE Workbook Pg 39 Modelling User Journeys . You can use critical user journeys to help capture the experience of your customers.

Dashboarding: for authentication of APIs Dashboard deep dive

JWT Authentication Dashboarding: for authentication of APIs What is JWT Authentication and how is it used? JWT Authentication is a client-assertion-based security mechanism for our APIs that incorporates authentication, authorisation , data integrity and non-repudiation. It is loosely based on a JWT authorisation header, which the API consumer signs and APEX system verifies the claims and signature of the signed JWT.

JWT Authentication Dashboarding: for authentication of APIs Opportunities Our monitoring setup and uniquely defined error codes allowed the APEX Operations team to be lean and focus on other higher-value work instead of being bogged down by the time-consuming work of studying logs to diagnose the root cause of issues. Problems As with API gateway systems, authentication and authorisation errors form a good percentage of the troubleshooting tickets and were time-consuming to investigate.

Customised Error Codes of JWT Authentication Dashboarding: for authentication of APIs 403/432 Invalid JWKS endpoint of API Key 433 Invalid JWKS 434 JWT header missing 435 Invalid JWT format 436 Missing ‘ iss ’ claim 437 Unable to find matching Key Id 438 Missing or invalid ‘ alg ’ claim 439 Missing or invalid ‘ typ ’ claim 440 Invalid ‘ iss ’ claim 441 Missing or invalid ‘ iat ’ claim 442 Missing or invalid ‘ aud ’ claim 443 Missing or invalid ‘ jti ’ claim 445 Missing or invalid ‘ sub ’ claim 446 Missing or invalid ‘ data ’ claim 447 Missing or invalid ‘ exp ’ claim 448 Invalid JWT format 449 Missing or invalid ‘ iss ’ claim 450 Invalid API Key 452 Invalid JWT Signature 4XX Reuse of nonce 4XX > 2 API Keys detected in ‘ iss ’ claim 4XX Attempt to use none in ‘ alg ’ claim 4XX Attempt to use claims – ‘jku’,’x5u’,’x5t’,’x5c’

Error Codes of JWT Authentication Dashboarding Dashboarding: for authentication of APIs

Outcomes of JWT Authentication Dashboarding Dashboarding: for authentication of APIs This empowered the users to diagnose errors and rectify them using a self-serve model and our DevOps teams to identify issues without needing to read detailed logs. DevOps team was also able to diagnose if there were potential security threats to the API gateway by looking out for error codes which were reserved for security errors . SRE Workbook Pg 101 Engineer Toil Out of the System The optimal strategy for handling toil is to eliminate it at the source… working with product development teams to develop operationally friendly software that is not only less toilsome, but also more scalable, secure, and resilient.

The floor is open for Q&A Conclusion

Thank You! Eugene Wong / Senior DevOps Engineer Ryan Ashneil / Software Engineer 18 April 2024 API Monitoring X SRE