A Short Course in Data Stream Mining

abifet 8,818 views 104 slides Mar 22, 2015
Slide 1
Slide 1 of 104
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104

About This Presentation

Slides for a short course in data stream mining, showing classification, adaptive methods, clustering and frequent pattern methods.


Slide Content

AShortCourseinData
StreamMining
Albert Bifet
Shenzhen, 23 January 2015
Huawei Noah's Ark Lab

Realtimeanalytics

Realtimeanalytics

Introduction: DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Letpbe a permutation off1;:::;ng.
Letp1bepwith one element missing.
p1[i]arrives in increasing order
Task: Determine the missing number

Introduction: DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Letpbe a permutation off1;:::;ng.
Letp1bepwith one element missing.
p1[i]arrives in increasing order
Task: Determine the missing number
Use an-bit vector
to memorize all the
numbers (O(n)
space)

Introduction: DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Letpbe a permutation off1;:::;ng.
Letp1bepwith one element missing.
p1[i]arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n))space.

Introduction: DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Letpbe a permutation off1;:::;ng.
Letp1bepwith one element missing.
p1[i]arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n))space.
Store
n(n+1)
2
Ã¥
ji
p1[j]:

DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Tools:
approximation
randomization, sampling
sketching

DataStreams
Data Streams
Sequence is potentially innite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Approximation algorithms
Small error rate with high probability
An algorithm(e;d)approximatesFif it outputs˜Ffor which
Pr[j˜FFj>eF]<d.

DataStreamsApproximation
Algorithms
1011000111 1010101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

DataStreamsApproximation
Algorithms
10110001111 0101011
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

DataStreamsApproximation
Algorithms
101100011110 1010111
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

DataStreamsApproximation
Algorithms
1011000111101 0101110
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

DataStreamsApproximation
Algorithms
10110001111010 1011101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

DataStreamsApproximation
Algorithms
101100011110101 0111010
Sliding Window
We can maintain simple statistics over sliding windows, using
O(
1
e
log
2
N)space, where
Nis the length of the sliding window
eis the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

Classication

Classication
Denition
GivennCdifferent classes, a classier algorithm builds a model that
predicts for every unlabelled instanceIthe classCto which it belongs
with accuracy.
Example
A spam lter
Example
Twitter Sentiment analysis: analyze tweets with positive or negative
feelings

Datastreamclassicationcycle
1
Process an example at a time, and
inspect it only once (at most)
2
Use a limited amount of memory
3
Work in a limited amount of time
4
Be ready to predict at any point

Classication
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day

BayesClassiers
Na¨ve Bayes
Based on Bayes Theorem:
P(cjd) =
P(c)P(djc)
P(d)
posterior=
priorlikelikood
evidence
Estimates the probability of observing attributeaand the prior
probabilityP(c)
Probability of classcgiven an instanced:
P(cjd) =
P(c)Õa2dP(ajc)
P(d)

BayesClassiers
Multinomial Na¨ve Bayes
Considers a document as a bag-of-words.
Estimates the probability of observing wordwand the prior
probabilityP(c)
Probability of classcgiven a test documentd:
P(cjd) =
P(c)Õw2dP(wjc)
nwd
P(d)

Perceptron
Attribute 1Attribute 2Attribute 3Attribute 4Attribute 5Outputh~w(~xi)w1w2w3w4w5
Data stream:h~xi;yii
Classical perceptron:h~w(~xi) =sgn(~w
T
~xi),
Minimize Mean-square error:J(~w) =
1
2
Ã¥(yih~w(~xi))
2

Perceptron
Attribute 1Attribute 2Attribute 3Attribute 4Attribute 5Outputh~w(~xi)w1w2w3w4w5
We use sigmoid functionh~w=s(~w
T
~x)where
s(x) =1=(1+e
x
)
s
0
(x) =s(x)(1s(x))

Perceptron
Minimize Mean-square error:J(~w) =
1
2
Ã¥(yih~w(~xi))
2
Stochastic Gradient Descent:~w=~whÑJ~xi
Gradient of the error function:
ÑJ=å
i
(yih~w(~xi))Ñh~w(~xi)
Ñh~w(~xi) =h~w(~xi)(1h~w(~xi))
Weight update rule
~w=~w+hå
i
(yih~w(~xi))h~w(~xi)(1h~w(~xi))~xi

Perceptron
PERCEPTRONLEARNING(Stream;h)
1foreach class
2 doPERCEPTRONLEARNING(Stream;class;h)
PERCEPTRONLEARNING(Stream;class;h)
1Letw0and~wbe randomly initialized
2foreach example(~x;y)in Stream
3 do ifclass=y
4 thend= (1h~w(~x))h~w(~x)(1h~w(~x))
5 elsed= (0h~w(~x))h~w(~x)(1h~w(~x))
6 ~w=~w+hd~x
PERCEPTRONPREDICTION(~x)
1returnargmaxclassh~wclass
(~x)

Classication
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day

Classication
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day
Time Contains “Money”YESYesNONoDayYESNight

DecisionTrees
Basic induction strategy:
A the “best” decision attribute for nextnode
AssignAas decision attribute fornode
For each value ofA, create new descendant ofnode
Sort training examples to leaf nodes
If training examples perfectly classied, Then STOP, Else iterate
over new leaf nodes

HoeffdingTrees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
With high probability, constructs an identical model that a
traditional (greedy) method would learn
With theoretical guarantees on the error rate
Time Contains “Money”YESYesNONoDayYESNight

HoeffdingBoundInequality
Probability of deviation of its expected value.

HoeffdingBoundInequality
LetX=Ã¥iXiwhereX1;:::;Xnare independent and indentically
distributed in[0;1]. Then
1
ChernoffFor eache<1
Pr[X>(1+e)E[X]]exp


e
2
3
E[X]

2
HoeffdingFor eacht>0
Pr[X>E[X] +t]exp

2t
2
=n

3
BernsteinLets
2
=Ã¥is
2
i
the variance ofX. IfXiE[Xi]bfor
eachi2[n]then for eacht>0
Pr[X>E[X] +t]exp


t
2
2s
2
+
2
3
bt
!

HoeffdingTreeorVFDT
HT(Stream;d)
1Let HT be a tree with a single leaf(root)
2Init countsnijkat root
3foreach example(x;y)in Stream
4 doHTGROW((x;y);HT;d)
HTGROW((x;y);HT;d)
1Sort(x;y)to leaflusingHT
2Update countsnijkat leafl
3ifexamples seen so far atlare not all of the same class
4 thenComputeGfor each attribute
5 ifG(Best Attr.)G(2nd best)>
q
R
2
ln 1=d
2n
6 thenSplit leaf on best attribute
7 foreach branch
8 doStart new leaf and initiliatize counts

HoeffdingTreeorVFDT
HT(Stream;d)
1Let HT be a tree with a single leaf(root)
2Init countsnijkat root
3foreach example(x;y)in Stream
4 doHTGROW((x;y);HT;d)
HTGROW((x;y);HT;d)
1Sort(x;y)to leaflusingHT
2Update countsnijkat leafl
3ifexamples seen so far atlare not all of the same class
4 thenComputeGfor each attribute
5 ifG(Best Attr.)G(2nd best)>
q
R
2
ln 1=d
2n
6 thenSplit leaf on best attribute
7 foreach branch
8 doStart new leaf and initiliatize counts

HoeffdingTrees
HT features
With high probability, constructs an identical model that a
traditional (greedy) method would learn
Ties: when two attributes have similarG, split if
G(Best Attr.)G(2nd best)<
r
R
2
ln1=d
2n
<t
ComputeGeverynmininstances
Memory: deactivate least promising nodes with lowerplel
plis the probability to reach leafl
elis the error in the node

HoeffdingNaiveBayesTree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
monitors accuracy of a Majority Class learner
monitors accuracy of a Naive Bayes learner
predicts using the most accurate method

Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classier 1: B, A, C, B
Classier 2: D, B, A, D
Classier 3: B, A, C, B
Classier 4: B, C, B, B
Classier 5: D, C, A, C
Bagging builds a set ofMbase models, with a bootstrap sample
created by drawing random samples with
replacement.

Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classier 1: A, B, B, C
Classier 2: A, B, D, D
Classier 3: A, B, B, C
Classier 4: B, B, B, C
Classier 5: A, C, C, D
Bagging builds a set ofMbase models, with a bootstrap sample
created by drawing random samples with
replacement.

Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classier 1: A, B, B, C: A(1) B(2) C(1) D(0)
Classier 2: A, B, D, D: A(1) B(1) C(0) D(2)
Classier 3: A, B, B, C: A(1) B(2) C(1) D(0)
Classier 4: B, B, B, C: A(0) B(3) C(1) D(0)
Classier 5: A, C, C, D: A(1) B(0) C(2) D(1)
Each base model's training set contains each of the original training
exampleKtimes whereP(K=k)follows a binomial distribution.

Bagging
Figure :
Each base model's training set contains each of the original training
exampleKtimes whereP(K=k)follows a binomial distribution.

OzaandRussell'sOnlineBagging
forMmodels
1:Initialize base modelshmfor allm2 f1;2;:::;Mg
2:for alltraining examplesdo
3:form=1;2;:::;Mdo
4: Setw=Poisson(1)
5: Updatehmwith the current example with weightw
6:anytime output:
7:returnhypothesis:hn(x) =argmaxy2YÃ¥
T
t=1
I(ht(x) =y)

Evolving Stream Classication

DataMiningAlgorithmswith
ConceptDrift
No Concept Drift
-
input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-
input output
DM Algorithm
Static Model
-
Change Detect.
-
6

DataMiningAlgorithmswith
ConceptDrift
No Concept Drift
-
input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-
input output
DM Algorithm
-
Estimator1
Estimator2
Estimator3
Estimator4
Estimator5

OptimalChangeDetectorand
Predictor
High accuracy
Low false positives and false negatives ratios
Theoretical guarantees
Fast detection of change
Low computational cost: minimum space and time needed
No parameters needed

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 1
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 1W1=01010110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 10W1=1010110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 101W1=010110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 1010W1=10110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 10101W1=0110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 101010W1=110111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 1010101W1=10111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111
W0= 10101011W1=0111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 101010110111111jˆmW0
ˆmW1
j ec:CHANGE DET.!
W0= 101010110W1=111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W=01010110111111 Drop elements from the tail ofW
W0= 101010110W1=111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Example
W= 01010110111111Drop elements from the tail ofW
W0= 101010110W1=111111
ADWIN:ADAPTIVEWINDOWINGALGORITHM
1 Initialize WindowW
2foreacht>0
3 doW W[ fxtg(i.e., addxtto the head ofW)
4 repeatDrop elements from the tail ofW
5 untiljˆmW0
ˆmW1
j echolds
6 for every split of WintoW=W0W1
7 Output ˆmW

AlgorithmADaptiveSliding
WINdow
Theorem
At every time step we have:
1
(False positive rate bound).Ifmtremains constant within W, the
probability thatADWINshrinks the window at this step is at most
d.
2
(False negative rate bound).Suppose that forsomepartition of
W in two parts W0W1(where W1contains the most recent items)
we havejmW0
mW1
j>2ec. Then with probability1dADWIN
shrinks W to W1, or shorter.
ADWINtunes itself to the data stream at hand, with no need for the
user to hardwire or precompute parameters.

AlgorithmADaptiveSliding
WINdow
ADWINusing a Data Stream Sliding Window Model,
can provide the exact counts of 1's inO(1)time per point.
triesO(logW)cutpoints
usesO(
1
e
logW)memory words
the processing time per example isO(logW)(amortized and
worst-case).
Sliding Window Model
10101011011111
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1

VFDT/CVFDT
Concept-adapting Very Fast Decision Trees: CVFDT
G. Hulten, L. Spencer, and P. Domingos.
Mining time-changing data streams. 2001
It keeps its model consistent with a sliding window of examples
Construct “alternative branches” as preparation for changes
If the alternative branch becomes more accurate, switch of tree
branches occurs Time Contains “Money”YESYesNONoDayYESNight

DecisionTrees: CVFDT
Time Contains “Money”YESYesNONoDayYESNight
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
1
W: is the example window size.
2
T0: number of examples used to check at each node if the
splitting attribute is still the best.
3
T1: number of examples used to build the alternate tree.
4
T2: number of examples used to test the accuracy of the alternate
tree.

DecisionTrees: HoeffdingAdaptive
Tree
Hoeffding Adaptive Tree:
replace frequency statistics counters by estimators
don't need a window to store examples, due to the fact that we
maintain the statistics data needed with estimators
change the way of checking the substitution of alternate subtrees,
using a change detector with theoretical guarantees
Advantages over CVFDT:
1
Theoretical guarantees
2
No Parameters

ADWINBagging(KDD'09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWINhas rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and change
rates
ADWINBagging
When a change is detected, the worst classier is removed and a new
classier is added.

LeveragingBaggingforEvolving
DataStreams(ECML-PKDD'10)
Randomization as a powerful tool to increase accuracy and diversity
There are three ways of using randomization:
Manipulating the input data
Manipulating the classier algorithms
Manipulating the output targets

LeveragingBaggingforEvolving
DataStreams
Leveraging Bagging
UsingPoisson(l)
Leveraging Bagging MC
UsingPoisson(l)and Random Output Codes
Fast Leveraging Bagging ME
if an instance is misclassied: weight = 1
if not: weight =eT=(1eT),

Empiricalevaluation
AccuracyRAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWINBagging 79.24% 1.48
Leveraging Bagging 85.54% 20.17
Leveraging Bagging MC85.37% 22.04
Leveraging Bagging ME80.77% 0.87
Leveraging Bagging
Leveraging Bagging
UsingPoisson(l)
Leveraging Bagging MC
UsingPoisson(l)and Random Output Codes
Leveraging Bagging ME
Using weight 1 if misclassied, otherwiseeT=(1eT)

Clustering

Clustering
Denition
Clustering is the distribution of a set of instances of examples into
non-known groups according to some common relations or afnities.
Example
Market segmentation of customers
Example
Social network communities

Clustering
Denition
Given
a set of instancesI
a number of clustersK
an objective functioncost(I)
a clustering algorithm computes an assignment of a cluster for each
instance
f:I! f1;:::;Kg
that minimizes the objective functioncost(I)

Clustering
Denition
Given
a set of instancesI
a number of clustersK
an objective functioncost(C;I)
a clustering algorithm computes a setCof instances withjCj=Kthat
minimizes the objective function
cost(C;I) =Ã¥
x2I
d
2
(x;C)
where
d(x;c): distance function betweenxandc
d
2
(x;C) =minc2Cd
2
(x;c): distance from x to the nearest point in
C

k-means
1. Choosekinitial centersC=fc1;:::;ckg
2. while stopping criterion has not been met
Fori=1;:::;N
nd closest centerck2Cto each instancepi
assign instancepito clusterCk
Fork=1;:::;K
setckto be the center of mass of all points inCi

k-means++
1. Choose a initial centerc1
Fork=2;:::;K
selectck=p2Iwith probabilityd
2
(p;C)=cost(C;I)
2. while stopping criterion has not been met
Fori=1;:::;N
nd closest centerck2Cto each instancepi
assign instancepito clusterCk
Fork=1;:::;K
setckto be the center of mass of all points inCi

PerformanceMeasures
Internal Measures
Sum square distance
Dunn indexD=
dmin
dmax
C-IndexC=
SSmin
SmaxSmin
External Measures
Rand Measure
F Measure
Jaccard
Purity

BIRCH
BALANCEDITERATIVEREDUCING ANDCLUSTERING USING
HIERARCHIES
Clustering FeaturesCF= (N;LS;SS)
N: number of data points
LS: linear sum of the N data points
SS: square sum of the N data points
Properties:
Additivity:CF1+CF2= (N1+N2;LS1+LS2;SS1+SS2)
Easy to compute: average inter-cluster distance
and average intra-cluster distance
Uses CF tree
Height-balanced tree with two parameters
B: branching factor
T: radius leaf threshold

BIRCH
BALANCEDITERATIVEREDUCING ANDCLUSTERING USING
HIERARCHIES
Phase 1:
Phase 2:
tree (optional)
Phase 3:
Phase 4:
passes)

Clu-Stream
Clu-Stream
Uses micro-clusters to store statistics on-line
Clustering FeaturesCF= (N;LS;SS;LT;ST)
N: numer of data points
LS: linear sum of the N data points
SS: square sum of the N data points
LT: linear sum of the time stamps
ST: square sum of the time stamps
Uses pyramidal time frame

Clu-Stream
On-line Phase
For each new point that arrives
the point is absorbed by a micro-cluster
the point starts a new micro-cluster of its own
delete oldest micro-cluster
merge two of the oldest micro-cluster
Off-line Phase
Apply k-means using microclusters as points

StreamKM++: Coresets
Coreset of a setPwith respect to some problem
Small subset that approximates the original setP.
Solving the problem for the coreset provides an approximate
solution for the problem onP.
(k;e)-coreset
A(k;e)-coresetSofPis a subset ofPthat for eachCof sizek
(1e)cost(P;C)costw(S;C)(1+e)cost(P;C)

StreamKM++: Coresets
Coreset Tree
Choose a leaflnode at random
Choose a new sample point denoted byqt+1fromPlaccording to
d
2
Based onqlandqt+1, splitPlinto two subclusters and create two
child nodes
StreamKM++
MaintainL=dlog
2
(
n
m
) +2ebucketsB0;B1;:::;BL1

Frequent Pattern Mining

FrequentPatterns
SupposeDis a dataset of patterns,t2D, andminsupis a constant.
Denition
Support(t): number of patterns
inDthat are superpatterns oft.
Denition
Patterntisfrequentif
Support(t)minsup.
Frequent Subpattern Problem
GivenDandminsup, nd all frequent subpatterns of patterns inD.

FrequentPatterns
SupposeDis a dataset of patterns,t2D, andminsupis a constant.
Denition
Support(t): number of patterns
inDthat are superpatterns oft.
Denition
Patterntisfrequentif
Support(t)minsup.
Frequent Subpattern Problem
GivenDandminsup, nd all frequent subpatterns of patterns inD.

FrequentPatterns
SupposeDis a dataset of patterns,t2D, andminsupis a constant.
Denition
Support(t): number of patterns
inDthat are superpatterns oft.
Denition
Patterntisfrequentif
Support(t)minsup.
Frequent Subpattern Problem
GivenDandminsup, nd all frequent subpatterns of patterns inD.

FrequentPatterns
SupposeDis a dataset of patterns,t2D, andminsupis a constant.
Denition
Support(t): number of patterns
inDthat are superpatterns oft.
Denition
Patterntisfrequentif
Support(t)minsup.
Frequent Subpattern Problem
GivenDandminsup, nd all frequent subpatterns of patterns inD.

PatternMining
Dataset Example
Document Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent
d1,d2,d3,d4,d5,d6 c
d1,d2,d3,d4,d5 e,ce
d1,d3,d4,d5 a,ac,ae,ace
d1,d3,d5,d6 b,bc
d2,d4,d5,d6 d,cd
d1,d3,d5 ab,abc,abe
be,bce,abce
d2,d4,d5 de,cde
minimal support = 3

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent
6 c
5 e,ce
4 a,ac,ae,ace
4 b,bc
4 d,cd
3 ab,abc,abe
be,bce,abce
3 de,cde

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce
3 de,cde de cde

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1 abce
d2de
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1 abce
d2de
d3 abce
d4 acde
d5 abcde
d6 bcd
e!ce
Support Frequent Gen Closed Max
6 c c c
5
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1 abce
d2de
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1bce
d2 cde
d3bce
d4de
d5bcde
d6 bcd
a!ace
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ItemsetMining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

ClosedPatterns
Usually, there are too many frequent patterns. We can compute a
smaller set, while keeping the same information.
Example
A set of 1000 items, has 2
1000
10
301
subsets, that is more than the
number of atoms in the universe10
79

ClosedPatterns
A prioriproperty
Ift
0
is a subpattern oft, thenSupport(t
0
)Support(t).
Denition
A frequent patterntisclosedif none of its proper superpatterns has
the same support as it has.
Frequent subpatterns and their supports can be generated from closed
patterns.

MaximalPatterns
Denition
A frequent patterntismaximalif none of its proper superpatterns is
frequent.
Frequent subpatterns can be generated from maximal patterns, but not
with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.

Nonstreamingfrequentitemset
miners
Representation:
Horizontal layout
T1: a, b, c
T2: b, c, e
T3: b, d, e
Vertical layout
a: 1 0 0
b: 1 1 1
c: 1 1 0
Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth

MiningPatternsoverDataStreams
Requirements: fast, use small amount of memory and adaptive
Type:
Exact
Approximate
Per batch, per transaction
Incremental, Sliding Window, Adaptive
Frequent, Closed, Maximal patterns

Moment
Computesclosedfrequents itemsets in a sliding window
Uses Closed Enumeration Tree
Uses 4 type of Nodes:
Closed Nodes
Intermediate Nodes
Unpromising Gateway Nodes
Infrequent Gateway Nodes
Adding transactions: closed items remains closed
Removing transactions: infrequent items remains infrequent

FP-Stream
Mining Frequent Itemsets at Multiple Time Granularities
Based in FP-Growth
Maintains
pattern tree
tilted-time window
Allows to answer time-sensitive queries
Places greater information to recent data
Drawback: time and memory complexity

TreeandGraphMining: Dealing
withtimechanges
Keep a window on recent stream elements
Actually, just its lattice of closed sets!
Keep track of number of closed patterns in lattice,N
Use some change detector onN
When change is detected:
Drop stale part of the window
Update lattice to reect this deletion, using deletion rule
Alternatively, sliding window of some xed size

Summary

AshortcourseinDataStream
Mining
Short Course Summary
1
Introduction
2
Classication
3
Ensemble Methods
4
Clustering
5
Frequent Pattern Mining
Open Source Software
1
MOA: http://moa.cms.waikato.ac.nz/
2
SAMOA: http://samoa-project.net/