Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez Orallo @ PAPIs Connect

papisdotio 1,164 views 29 slides Mar 14, 2016
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. Howeve...


Slide Content

MACHINE LEARNING
PERFORMANCE EVALUATION:
TIPS AND PITFALLS
José Hernández-Orallo
DSIC, ETSINF, UPV, [email protected]

OUTLINE
ML evaluation basics: the golden rule
Test vs. deployment. Context change
Cost and data distribution changes
ROC Analysis
Metrics for a range of contexts
Beyond binary classification
Lessons learnt
2

ML EVALUATION BASICS: THE GOLDEN RULE
Creating ML models is easy.




Creating good ML models is not that easy.
oEspecially if we are not crystal clear about the
criteria to tell how good our models are!

So, good for what?




3
ML models should perform
well during deployment.
TRAIN
Press here:
TIP

ML EVALUATION BASICS: THE GOLDEN RULE
We need performance metrics and evaluation
procedures that best match the deployment
conditions.

Classification, regression, clustering, association
rules, … use different metrics and procedures.

Estimating how good a model is crucial:




4
Golden rule: never overstate the performance
that a ML model is expected to have during
deployment because of good performance in
optimal “laboratory conditions”
TIP

ML EVALUATION BASICS: THE GOLDEN RULE
Caveat: Overfitting and underfitting




oIn predictive tasks, the golden rule is simplified to:

5
Golden rule for predictive tasks:
Never use the same examples for
training the model and evaluating it
training
test
Models
Evaluation
Best model 


Sx
S
xhxf
n
herror
2
))()((
1
)(
data
Algorithms
TIP

ML EVALUATION BASICS: THE GOLDEN RULE
Caveat: What if there is not much data available?
oBootstrap or cross-validation









6
oWe take all possible
combinations with n‒1 for
training and the remaining fold
for test.
oThe error (or any other metric)
is calculated n times and then
averaged.
oA final model is trained with all
the data.




No need to use cross-validation
for large datasets
TIP

TEST VS. DEPLOYMENT : CONTEXT CHANGE
Is this enough?










Caveat: the simplified golden rule assumes that the
context is the same for testing conditions as for
deployment conditions.
7
Context is everything
Testing conditions (lab) Deployment conditions (production)

TEST VS. DEPLOYMENT : CONTEXT CHANGE
Contexts change repeatedly...








oCaveat: The evaluation for a context can be very optimistic,
or simply wrong, if the deployment context changes
8
Context A
Training
Data

Model
Training
Context B
Deployment
Data
Deployment
Output
Model

Context C
Deployment
Data
Deployment
Output
Model

Context D
Deployment
Data
Deployment
Output
Model

… ? ?
Take context change into account from the start. TIP

TEST VS. DEPLOYMENT : CONTEXT CHANGE
Types of contexts in ML
oData shift (covariate, prior probability, concept drift, …).
Changes in p(X), p(Y), p(X|Y), p(Y|X), p(X,Y)
oCosts and utility functions.
Cost matrices, loss functions, reject costs, attribute costs, error
tolerance…
oUncertain, missing or noisy information
Noise or uncertainty degree, %missing values, missing attribute
set, ...
oRepresentation change, constraints, background
knowledge.
Granularity level, complex aggregates, attribute set, etc.
oTask change
Regression cut-offs, bins, number of classes or clusters,
quantification, …


9

COST AND DATA DISTRIBUTION CHANGES
Classification. Example: 100,000 instances
o High imbalance (π
0=Pos/(Pos+Neg)=0.005).


10
10
c
1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred.
c
3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c
2 open close
OPEN 0 0
CLOSE 500 99500
Actual
ERROR: 0,7%

TPR= 300 / 500 = 60%
FNR= 200 / 500 = 40%
TNR= 99000 / 99500 = 99,5%
FPR= 500 / 99500 = 0.5%
PPV= 300 / 800 = 37.5%
NPV= 99000 / 99200 = 99.8%

Macroavg= (60 + 99.5 ) / 2 =
79.75%
ERROR: 0,5%

TPR= 0 / 500 = 0%
FNR= 500 / 500 = 100%
TNR= 99500 / 99500 = 100%
FPR= 0 / 99500 = 0%
PPV= 0 / 0 = UNDEFINED
NPV= 99500 / 10000 = 99.5%

Macroavg= (0 + 100 ) / 2 =
50%
ERROR: 5,5%

TPR= 400 / 500 = 80%
FNR= 100 / 500 = 20%
TNR= 94100 / 99500 = 94.6%
FPR= 5400 / 99500 = 5.4%
PPV= 400 / 5800 = 6.9%
NPV= 94100 / 94200 = 99.9%

Macroavg= (80 + 94.6 ) / 2 =
87.3%
Which classifier is best?
Specificity

Sensitivity

Recall
Precision

COST AND DATA DISTRIBUTION CHANGES
Caveat: Not all errors are equal.
oExample: keeping a valve closed in a nuclear plant when
it should be open can provoke an explosion, while opening
a valve when it should be closed can provoke a stop.
oCost matrix:





11
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
TIP
The best classifier is not the most
accurate, but the one with lowest cost

COST AND DATA DISTRIBUTION CHANGES
Classification. Example: 100,000 instances
o High imbalance (π
0=Pos/(Pos+Neg)=0.005).


12
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
c
1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred
c
3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c
2 open close
OPEN 0 0
CLOSE 500 99500
Actual
c
1 open close
OPEN 0€ 50,000€
CLOSE 400,000€ 0€
c
3 open close
OPEN 0€ 540,000€
CLOSE 200,000€ 0€
c
2 open close
OPEN 0€ 0€
CLOSE 1,000,000€ 0€
TOTAL COST: 450,000€ TOTAL COST: 1,000,000€ TOTAL COST: 740,000€
Confusion Matrices
Cost
Matrix
Resulting Matrices
For two classes, the value “slope” (with FNR and FPR)
is sufficient to tell which classifier is best.
This is the operating condition, context or skew.
TIP

ROC ANALYSIS
The context or skew (the class distribution and the
costs of each error) determines classifier goodness.

oCaveat:
In many circumstances, until deployment time, we do not know
the class distribution and/or it is difficult to estimate the cost
matrix.
E.g. a spam filter.
But models are usually learned before.

oSOLUTION:
ROC (Receiver Operating Characteristic) Analysis.


13

ROC ANALYSIS
The ROC Space
oUsing the normalised terms of the confusion matrix:
 TPR, FNR, TNR, FPR:



14
14 ROC Space
0,000
0,200
0,400
0,600
0,800
1,000
0,0000,2000,4000,6000,8001,000
False Positives
True Positives
open close
OPEN 400 12000
CLOSE 100 87500
Actual
Pred
open close
OPEN 0.8 0.121
CLOSE 0.2 0.879
Actual
Pred
TPR= 400 / 500 = 80%
FNR= 100 / 500 = 20%
TNR= 87500 / 99500 = 87.9%
FPR= 12000 / 99500 = 12.1%

ROC ANALYSIS
Good and bad classifiers


15
0 1
1
0
FPR
TPR
•Good classifier.
–High TPR.
–Low FPR.
0 1
1
0
FPR
TPR
0 1
1
0
FPR
TPR
•Bad classifier.
–Low TPR.
–High FPR.
•Bad classifier
(more realistic).

ROC ANALYSIS
The ROC “Curve”: “Continuity”.


16
ROC diagram
0 1
1
0
FPR
TPR
oGiven two classifiers:
We can construct any
“intermediate” classifier just
randomly weighting both
classifiers (giving more or
less weight to one or the
other).
This creates a “continuum”
of classifiers between any
two classifiers.

ROC ANALYSIS
The ROC “Curve”: Construction


17
ROC diagram
0 1
1
0
FPR
TPR
The diagonal
shows the worst
situation
possible.
We can discard those which are below because
there is no context (combination of class distribution
/ cost matrix) for which they could be optimal.
oGiven several classifiers:
We construct the convex hull of
their points (FPR,TPR) as well as
the two trivial classifiers (0,0) and
(1,1).
The classifiers below the ROC
curve are discarded.
The best classifier (from those
remaining) will be selected in
application time…


TIP

ROC ANALYSIS
In the context of application, we choose the optimal
classifier from those kept. Example 1:


18 2
1

FNcost
FPcost 
Neg
Pos
4 2
2
4
slope
Context (skew): 0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
false positive rate
true positive rate

ROC ANALYSIS
In the context of application, we choose the optimal
classifier from those kept. Example 2:


19 
FPcost
FNcost
1
8 
Neg
Pos
4 
slope
4
8.5
Context (skew): 0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
false positive rate
true positive rate

ROC ANALYSIS
Crisp and Soft Classifiers:
oA “hard” or “crisp” classifier predicts a class between a set
of possible classes.
Caveat: crisp classifiers are not versatile to changing contexts.



oA “soft” or “scoring” (probabilistic) classifier predicts a
class, but accompanies each prediction with an estimation
of the reliability (confidence) of each prediction.
Most learning methods can be adapted to generate soft classifiers.

oA soft classifier can be converted into a crisp classifier
using a threshold.
Example: “if score > 0.7 then class A, otherwise class B”.
With different thresholds, we have different classifiers, giving
more or less relevance to each of the classes




20
Soft or scoring classifiers can be
reframed to each context.
TIP

ROC ANALYSIS
ROC Curve of a Soft Classifier:
oWe can consider each threshold as a different classifier and
draw them in the ROC space. This generates a curve…



21
We have a “curve” for just one soft classifier
21
Actual Class
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
Predicted Class

p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
p
p
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
...


© Tom Fawcett

ROC ANALYSIS
ROC Curve of a soft classifier.



22

ROC ANALYSIS
ROC Curve of a soft classifier.



23
In this zone the best
classifier is “insts”


In this zone the best
classifier is“insts2”


© Robert Holte
We must preserve the classifiers that have at least
one “best zone” (dominance) and then behave in
the same way as we did for crisp classifiers.
TIP

METRICS FOR A RANGE OF CONTEXTS
What if we want to select just one soft classifier?
oThe classifier with greatest Area Under the ROC Curve
(AUC) is chosen.











24
AUC does not consider calibration. If calibration is
important, use other metrics, such as the Brier score.
TIP
AUC is useful but it is always better to draw the curves
and choose depending on the operating condition.
TIP

BEYOND BINARY CLASSIFICATION
Cost-sensitive evaluation is perfectly extensible for
classification with more than two classes.






For regression, we only need a cost function
oFor instance, asymmetric absolute error:



25 ERROR actual
low medium high
low 20 0 13
medium 5 15 4

predicted
high 4 7 60
COST actual
low medium high
low 0€ 5€ 2€
medium 200€ -2000€ 10€

predicted
high 10€ 1€ -15€

Total cost:
-29787€

BEYOND BINARY CLASSIFICATION
ROC analysis for multiclass problems is troublesome.
oGiven n classes, there is a n  (n‒1) dimensional space.
oCalculating the convex hull impractical.

The AUC measure has been extended:
oAll-pair extension (Hand & Till 2001).



oThere are other extensions.
26 
 

c
i
c
ijj
HT
jiAUC
cc
AUC
1,1
),(
)1(
1

BEYOND BINARY CLASSIFICATION
ROC analysis for regression (using shifts).
oThe operating condition is the asymmetry factor α. For
instance if α=2/3 means that underpredictions are twice
as expensive than overpredictions.







oThe area over the curve (AOC) is the error variance. If
the model is unbiased, then it is ½ MSE.
27

LESSONS LEARNT
Model evaluation goes much beyond split or cross-
validation + metric (accuracy or MSE).
Models can be generated once but then applied to
different contexts / operating conditions.
Drawing models for different operating conditions
allow us to determine dominance regions and the
optimal threshold to make optimal decisions.
Soft (scoring) models are much more powerful than
crisp models. ROC analysis really makes sense for
soft models.
Areas under/over the curves are an aggregate of the
performance on a range of operating conditions, but
should not replace ROC analysis.
28

LESSONS LEARNT
We have just seen an example with one kind of
context change: cost changes and output distribution.
Similar approaches exist with other types of context
changes
oUncertain, missing or noisy information
oRepresentation change, constraints, background
knowledge.
oTask change
29

http://www.reframe-d2k.org/