Mathematical Approaches and Algorithms for Data Stream Analysis by Arthur Tabatchnic

DevClub_lv 804 views 31 slides Sep 03, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Data streams can be challenging to analyze on the fly. Hardware constraints, large volumes of data, and rapid changes in patterns can all lead to difficult problems. In this talk, I will provide an overview of the different types of analyses that can be performed on data streams and the theory behin...


Slide Content

Mathematical Approaches and Algorithms for Data Stream Analysis

Topics C lassification Change diagnosis Distributed data mining

Classification

Problems High speed Unbounded requirements Drift Accuracy vs Efficiency Distributed processing

Data-based techniques Sampling Load shedding Aggregation Sketching

Classification Algorithms

Very Fast Decision Trees Pros Very Fast Memory usage* Cons Quite drifty

CluStream/On Demand Classification Pros All of them? Distribute friendly Cons None of them?

CluStream/On Demand Classification Continued work DenStream ClusCTA

ANNCAD (Adaptive Nearest Neighbor Classification Algorithm for Data Streams) Pros Good with slow drift Cons Bad for real-time Hungry

Ensemble Based Classification (Wang) Pros Good with drift Cons Bad for real-time Hungry

2. Change diagnosis

Velocity density

Goals Find significant changes Dissolution Coagulation Shift Keep the model up-to-date

Spatial velocity

3. Distributed data mining

Problems Many independent sites of observation Data stream per site Locally non-obvious variable relationships to other sites Performance

Bayesian Network Learning

Steps Learn local BNs Identify key observations Combine observations and learn a non-local BN Communicate updated probabilities back to local sites Obtain a collective BN

BNs Pros Missing data Transparent Cause & effect Causal intervention Resource-friendly Pretty easy Cons Assumptions Assumptions Assumptions

Q&A