Time-State Analytics: MinneAnalytics 2024 Talk

EvanChan2 302 views 36 slides Jul 10, 2024
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

Slides from my talk at MinneAnalytics 2024 - June 7, 2024

https://datatech2024.sched.com/event/1eO0m/time-state-analytics-a-new-paradigm

Across many domains, we see a growing need for complex analytics to track precise metrics at Internet scale to detect issues, identify mitigations, and analyze p...


Slide Content

Time-State Analytics
Evan Chan, Principal Engineer

© 2024 Conviva. All Rights Reserved.
In every app there are critical user flows that highly impact
growth, engagement, and retention in real-time
Can’t access
high-value content
Can’t catch
a ride
Can't subscribe to
your service
Can't log into
the app
2

Monitoring Infrastructure & Services is not Enough to
Ensure a Great User Experience
© 2024 Conviva. All Rights Reserved. 3
App
App
App
App
Sampled view of application
performance (RUM) with no user
quality of experience visibility
+
Custom solutions with high cost, long
time to insights
Noisy alerts from server-side telemetry
(APM, Infra, etc.) with no definitive
understanding of user impact
MTTR for QoE issues
is hours to days
Blind to 80% of
QoE issues
Engineering
Operations
Product
Focus and investments
are misaligned with QoE
Infrastructure & Services
Impact of poor user
QoE is 7x higher than
the impact of outages

Conviva Combines Server and Client Side Analytics to
Measure End User QoE
© 2023 Conviva. All Rights Reserved. 4
User Data
Measure product activity, 10s of
events per minute, per endpoint
Machine Data
Measure machine activity, 1,000s of
events per minute, per endpoint
Infrastructure
Scale
Thousands of
endpoints
Network, server & cloud monitoring
focused on real-time mission-critical
operational use cases
Internet Scale
Millions of
endpoints
Marketing & Product analytics on
non-real-time BI use cases
Complete user experience & app performance based
on full-census client-side telemetry
7 Billion
Sensors deployed
5 Trillion
Events processed
every day

Stateful Context-Sensitive Metrics
are Critical for User Experience
© 2023 Conviva. All Rights Reserved. 5
Connection Induced Rebuffering
Change CDNHow much time did this session spend
buffering while using CDN C1?
Rebuffering Related Exits
ChurnWhich sessions quit immediately after
a connection induced rebuffer?
Login/Discovery
Simplify LoginHow long does login take and how
often did users go to the next page?
Signup
Fix Signup PageHow long does the signup process take,
and what is the success rate?

OTT Video
App/Web

© 2024 Conviva. All Rights Reserved. 6
The world is a state machine!
•Need real-time stateful computation at Internet
scale to compute QoE metrics from client-side
event streams
•Need real-time stateful computation to connect
the dots across multipletelemetry sources
•Need real-timehigh cardinality to perform root-
causeanalysis acrossmillions of dimension
combinations
EfficientInternet
Scale
Real Time
Stateful
Computation
High
Cardinality
Computing QoE at
Scale: A Big
Data/Streaming
Challenge

© 2023 Conviva. All Rights Reserved. 7
Agenda
How we
tackled the
problem
Limitations
of Status
Quo
Primer on
Stateful
AnalyticsQuick Intro
to Conviva

Conviva Stateful Metric: Connection-Induced Rebuffering
© 2023 Conviva. All Rights Reserved. 8
buffering
Player StateBitrateCDNSeek
BufferPausePlayBuffer
B1 B3B1
SeekBuffer
t1t2t3t4t5t6t7t8t9t10t11t12t13t14t15
Play
Raw measurements from a video session
Play
Play
C1 C2B2
connection-
Induced rebuffering
with C1
How much time did this session spend
in a connection-induced rebufferingstate
while using CDN C1?
Count the duration where:
1.Currently buffering &
2.Play has already initialized &
3.Hasn’t seeked in last 5 seconds &
4.Using CDN C1

9
TimeStress changeActivity changeSleep statsVO2Heart rate (bpm)
6:005RestDeep-58
7:00 --52
8:00 Light-61
9:306Work--94
1:308 --122
5:007 --105
6:409Run: 4 mi/hr-44143
6:50Run: 5 mi/hr-44163
An easierexample:
Events from a Fitness Tracker

Simple Stateless Analytics: “Count Events”
© 2023 Conviva. All Rights Reserved. 10
Stateless è Agnostic to Sequence, Timing, System State
6:00AM9:30AM 6:40PM
WorkRest Run
Activity Events
Stress Events
6:00AM9:30AM
56
9 79
1:30PM5:00PM6:40PM

Stateful Analysis Critical forBetter Outcomes!
© 2023 Conviva. All Rights Reserved. 11
6:00AM9:30AM 6:40PM
Activity
Stress Levels
Rest
Work
Run
Is my avg stress high when “Working”? à Have fewer meetings
How long was I in a high stress? à Take mini breaks
9
8
7
6
5

Stateful Analytics Appear in Every Domain
© 2023 Conviva. All Rights Reserved. 12
Video
Change CDNHow much time did this session spend
buffering while using CDN C1?
Finance
Fraud alert, block
Which credit card users made purchases at
geographically-separate locations in the
last 5 minutes?
Cybersecurity
Block URL
Quarantine host
Which Android users sent a sequence of
anomalous DNS requests after visiting
website xyz.com in the last hour?
Manufacturing
Rebalance load,
Repair
How many machines from vendor X are
showing degrading health status over time?

© 2023 Conviva. All Rights Reserved. 13
Stateful Real-Time Analytics is Critical for Business
Operational Outcomes
(e.g., Engagement, Fraud/Attack Detection, etc.)
Event Streams / Telemetry
Actionable insights
in dynamic systems
needs Stateful Analysis:
Stateful
Analytics
⬢Sequence
⬢Timing
⬢State of systemVideo
App
Infra O11Y
E-commerce
Health/Fitness
Security
Fintech
….

© 2023 Conviva. All Rights Reserved. 14
Agenda
How we
tackled the
problem
Primer on
Stateful
AnalyticsQuick Intro
to Conviva
Limitations
of Status
Quo

Stateful
Analytics is
Hard: Fitness
© 2023 Conviva. All Rights Reserved. 15
High dev effort
High cost
Count duration of
high stress (>=6)
when working

Stateful
Analytics is
Hard: Video
© 2023 Conviva. All Rights Reserved. 16
High dev effort
High cost
Count the duration where:
1.Currently buffering &
2.Play has already initialized &
3.Hasn’t seeked in last 5 seconds &
4.Using CDN C1

Most data tech stems from
1970s abstraction:
Relational Model & Algebra
Data is modeled as relations (tables)
Algebraic relational operators to compute relations
© 2023 Conviva. All Rights Reserved. 17
Why?

TimestampPlayer StateBitrateCDNSeek
t1Buffer
t2 C1
t3 B1
t4Play
t5Buffer
t6Play
t7 Seek
t8 B2
t9Paused
t10Play
t11 B3
t12Buffer
t13 C2
t14Play
t15 B1
Tabular Model
Isn’t Well-Suited
for Stateful
© 2023 Conviva. All Rights Reserved. 18
Player StateBitrateCDNSeek
BufferPausePlayBuffer
B1 B3B1
SeekBuffer
t1t2t3t4t5t6t7t8t9t10t11t12t13t14t15
Play
Raw measurements from a video session
Play
Play
C1 C2B2
Count the duration where:
1.Currently buffering &
2.Play has already initialized &
3.Hasn’t seeked in last 5 seconds &
4.Using CDN C1

TimestampPlayer StateBitrateCDNSeek
t1Buffer
t2 C1
t3 B1
t4Play
t5Buffer
t6Play
t7 Seek
t8 B2
t9Paused
t10Play
t11 B3
t12Buffer
t13 C2
t14Play
t15 B1
State and
Context Over
Continuous Time
is Hard
© 2023 Conviva. All Rights Reserved. 19
TimestampPlayer StateBitrateCDNSeek
t1Buffer
t2BufferC1
t3BufferB1C1
t4PlayB1C1
t5BufferB1C1
t6PlayB1C1
t7PlayB1C1Seek
t8PlayB2C1
t9PausedB2C1
t10PlayB2C1
t11PlayB3C1
t12BufferB3C1
t13BufferB3C2
t14PlayB3C2
t15PlayB1C2
Duration?
t7 + 5
seconds?Count the duration where:
1.Currently buffering &
2.Play has already initialized &
3.Hasn’t seeked in last 5 seconds &
4.Using CDN C1
[…]
[…]
[…]

© 2023 Conviva. All Rights Reserved. 20
AgendaPrimer on
Stateful
AnalyticsQuick Intro
to Conviva
Limitations
of Status
Quo
How we
tackled the
problem

21
TimeStress changeActivity changeSleep statsVO2Heart rate (bpm)
6:005RestDeep-58
7:00 --52
8:00 Light-61
9:306Work--94
1:308 --122
5:007 --105
6:409Run: 4 mi/hr-44143
6:50Run: 5 mi/hr-44163
Let’s go back to first principles

A visual Timeline interpretation of the data
© 2023 Conviva. All Rights Reserved. 22
“Geometric abstractions are powerful tools” – Fred Brooks
6:006:307:007:308:008:309:009:3010:0010:30
Stress level
4
5
6
7
3

Geometric view makes it easy to do stateful analytics!
© 2023 Conviva. All Rights Reserved. 23
6:006:307:007:308:008:309:009:3010:0010:30
Stress level
4
5
6
7
3
How long was I in High Stress (>= 6) state?
Total Duration = 3 hours

© 2023 Conviva. All Rights Reserved. 24
Intuitive
“Geometric”
Logic
How long was I in High Stress (>= 6) state?
Interpret the Stress column
as a “states” over time
Check when
Stress State >= 6
Calculate Duration
when condition was True

Time-State Analytics, in a Nutshell
© 2023 Conviva. All Rights Reserved. 25
Step FunctionEvent Continuously Evolving
Data abstraction with 3 types of timeline dynamics
Connectors with external
data sources/sinksMetric
Compositional language for
defining DAG of operators
MetricRaw events
Library of operators

© 2023 Conviva. All Rights Reserved. 26
Example Operator: Greater than equal to
6:006:307:007:308:008:309:009:3010:0010:30
Stress
level
4
5
6
7
3
6:006:307:007:308:008:309:009:3010:0010:30
False
True

Example Operator: DurationTrue
© 2023 Conviva. All Rights Reserved. 27
6:006:307:007:308:008:309:009:3010:0010:30
False
True
6:006:307:007:308:008:309:009:3010:0010:30
1 hour
1.5 hours
2 hours

Stateful Metrics == Operator Composition
© 2023 Conviva. All Rights Reserved. 28
Interpret the Stress column
as a “states” over time
Check when
Stress State >= 6
Calculate Duration
when condition was True
How long was I in High Stress (>= 6) state?
Event
DataGetState(“Stress Level”)
Greater than equal to
(“6”)
DurationTrue

Equals
(“Working”)
How Long Until Stress from Work?
© 2023 Conviva. All Rights Reserved. 29
Event
Data
GetState
(“Stress”)
Greater than
equal to (“6”)
Merge
Events
TwoEventPattern
Duration
(Work -> Stress)
Duration between start of Work and High Stress (>= 6) state
GetState
(“Activities”)
State
Change
Events
State
Change
Events

Zoom in: Operators Transform Timelines
© 2023 Conviva. All Rights Reserved. 30
GetState
(“Stress”)
Greater than
equal to (“6”)
State
Change
Events
6:00307:308:008:309:0010:0010:30
St
re
s
s
le
v
el
4
5
6
7
3
6:007:008:009:0010:30
F
T
6:007:008:009:0010:30
Stress
Event

6:007:008:009:0010:30
Zoom in: Merging Timelines and Pattern Recognition
© 2023 Conviva. All Rights Reserved. 31
State
Change
Events
6:007:008:009:0010:30
Stress
Event
Merge
Events
TwoEventPattern
Duration
(Work -> Stress)
State
Change
Events
6:007:008:009:0010:30
Start
Work
Stress
EventStart
Work

Abstraction Reduces Dev Effort to Support New Metrics
© 2023 Conviva. All Rights Reserved. 32
Onboarding: Weeks à Days
Semantic Bugs: Dropped by 80%
Prototype query language

9.249.259.82
12.01
Abstraction
Enables High
Performance:
Outperforms
State-of-Art
Systems
© 2023 Conviva. All Rights Reserved. 33
Conviva
Time-State
Spark
Streaming
Spark
Streaming
+
Clickhouse
Flink
Streaming
+
Clickhouse
Flink
Streaming
Normalized
Cost
(Using CPU
-seconds as proxy)
1.00

Rich Operator Set for Rich Analytics
© 2023 Conviva. All Rights Reserved. 34
Duration and Time
Management
State Management
Pattern Matching
AI/ML/Feature Eng
+ - * / And Or Not
If, Flow Control
ETL, Data
Transformation
… more coming

Takeaways on TimeState Analytics
© 2023 Conviva. All Rights Reserved. 35
Need for
Real-Time
Stateful Analytics
Fundamentally
hard problem!
Existing systems
are not effective
Why:
Classical tabular
model is ineffective
High cost, low
performance
High dev effort, bugs
Conviva TimeState:
A geometric basis for
Stateful Analytics
10X better
cost/performance
10X reduced effort
This is general
industry wide
problem
Would love to hear
about your
Stateful Analytics
and Experience
use cases
Democratizes
Stateful Analytics
@ Scale

To reach me:
[email protected]
•Twitter @evanfchan
•Linked In / Instagram (@platypus.arts)
Please come by the Conviva booth and talk to us!
© 2023 Conviva. All Rights Reserved. 36
We are hiring!