Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez Orallo @ PAPIs Connect
papisdotio
1,164 views
29 slides
Mar 14, 2016
Slide 1 of 29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
About This Presentation
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. Howeve...
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
Size: 1.42 MB
Language: en
Added: Mar 14, 2016
Slides: 29 pages
Slide Content
MACHINE LEARNING
PERFORMANCE EVALUATION:
TIPS AND PITFALLS
José Hernández-Orallo
DSIC, ETSINF, UPV, [email protected]
OUTLINE
ML evaluation basics: the golden rule
Test vs. deployment. Context change
Cost and data distribution changes
ROC Analysis
Metrics for a range of contexts
Beyond binary classification
Lessons learnt
2
ML EVALUATION BASICS: THE GOLDEN RULE
Creating ML models is easy.
Creating good ML models is not that easy.
oEspecially if we are not crystal clear about the
criteria to tell how good our models are!
So, good for what?
3
ML models should perform
well during deployment.
TRAIN
Press here:
TIP
ML EVALUATION BASICS: THE GOLDEN RULE
We need performance metrics and evaluation
procedures that best match the deployment
conditions.
Classification, regression, clustering, association
rules, … use different metrics and procedures.
Estimating how good a model is crucial:
4
Golden rule: never overstate the performance
that a ML model is expected to have during
deployment because of good performance in
optimal “laboratory conditions”
TIP
ML EVALUATION BASICS: THE GOLDEN RULE
Caveat: Overfitting and underfitting
oIn predictive tasks, the golden rule is simplified to:
5
Golden rule for predictive tasks:
Never use the same examples for
training the model and evaluating it
training
test
Models
Evaluation
Best model
Sx
S
xhxf
n
herror
2
))()((
1
)(
data
Algorithms
TIP
ML EVALUATION BASICS: THE GOLDEN RULE
Caveat: What if there is not much data available?
oBootstrap or cross-validation
6
oWe take all possible
combinations with n‒1 for
training and the remaining fold
for test.
oThe error (or any other metric)
is calculated n times and then
averaged.
oA final model is trained with all
the data.
No need to use cross-validation
for large datasets
TIP
TEST VS. DEPLOYMENT : CONTEXT CHANGE
Is this enough?
Caveat: the simplified golden rule assumes that the
context is the same for testing conditions as for
deployment conditions.
7
Context is everything
Testing conditions (lab) Deployment conditions (production)
TEST VS. DEPLOYMENT : CONTEXT CHANGE
Contexts change repeatedly...
oCaveat: The evaluation for a context can be very optimistic,
or simply wrong, if the deployment context changes
8
Context A
Training
Data
Model
Training
Context B
Deployment
Data
Deployment
Output
Model
Context C
Deployment
Data
Deployment
Output
Model
Context D
Deployment
Data
Deployment
Output
Model
… ? ?
Take context change into account from the start. TIP
TEST VS. DEPLOYMENT : CONTEXT CHANGE
Types of contexts in ML
oData shift (covariate, prior probability, concept drift, …).
Changes in p(X), p(Y), p(X|Y), p(Y|X), p(X,Y)
oCosts and utility functions.
Cost matrices, loss functions, reject costs, attribute costs, error
tolerance…
oUncertain, missing or noisy information
Noise or uncertainty degree, %missing values, missing attribute
set, ...
oRepresentation change, constraints, background
knowledge.
Granularity level, complex aggregates, attribute set, etc.
oTask change
Regression cut-offs, bins, number of classes or clusters,
quantification, …
9
COST AND DATA DISTRIBUTION CHANGES
Classification. Example: 100,000 instances
o High imbalance (π
0=Pos/(Pos+Neg)=0.005).
10
10
c
1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred.
c
3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c
2 open close
OPEN 0 0
CLOSE 500 99500
Actual
ERROR: 0,7%
Macroavg= (80 + 94.6 ) / 2 =
87.3%
Which classifier is best?
Specificity
Sensitivity
Recall
Precision
COST AND DATA DISTRIBUTION CHANGES
Caveat: Not all errors are equal.
oExample: keeping a valve closed in a nuclear plant when
it should be open can provoke an explosion, while opening
a valve when it should be closed can provoke a stop.
oCost matrix:
11
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
TIP
The best classifier is not the most
accurate, but the one with lowest cost
COST AND DATA DISTRIBUTION CHANGES
Classification. Example: 100,000 instances
o High imbalance (π
0=Pos/(Pos+Neg)=0.005).
12
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
c
1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred
c
3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c
2 open close
OPEN 0 0
CLOSE 500 99500
Actual
c
1 open close
OPEN 0€ 50,000€
CLOSE 400,000€ 0€
c
3 open close
OPEN 0€ 540,000€
CLOSE 200,000€ 0€
c
2 open close
OPEN 0€ 0€
CLOSE 1,000,000€ 0€
TOTAL COST: 450,000€ TOTAL COST: 1,000,000€ TOTAL COST: 740,000€
Confusion Matrices
Cost
Matrix
Resulting Matrices
For two classes, the value “slope” (with FNR and FPR)
is sufficient to tell which classifier is best.
This is the operating condition, context or skew.
TIP
ROC ANALYSIS
The context or skew (the class distribution and the
costs of each error) determines classifier goodness.
oCaveat:
In many circumstances, until deployment time, we do not know
the class distribution and/or it is difficult to estimate the cost
matrix.
E.g. a spam filter.
But models are usually learned before.
ROC ANALYSIS
The ROC Space
oUsing the normalised terms of the confusion matrix:
TPR, FNR, TNR, FPR:
14
14 ROC Space
0,000
0,200
0,400
0,600
0,800
1,000
0,0000,2000,4000,6000,8001,000
False Positives
True Positives
open close
OPEN 400 12000
CLOSE 100 87500
Actual
Pred
open close
OPEN 0.8 0.121
CLOSE 0.2 0.879
Actual
Pred
TPR= 400 / 500 = 80%
FNR= 100 / 500 = 20%
TNR= 87500 / 99500 = 87.9%
FPR= 12000 / 99500 = 12.1%
16
ROC diagram
0 1
1
0
FPR
TPR
oGiven two classifiers:
We can construct any
“intermediate” classifier just
randomly weighting both
classifiers (giving more or
less weight to one or the
other).
This creates a “continuum”
of classifiers between any
two classifiers.
ROC ANALYSIS
The ROC “Curve”: Construction
17
ROC diagram
0 1
1
0
FPR
TPR
The diagonal
shows the worst
situation
possible.
We can discard those which are below because
there is no context (combination of class distribution
/ cost matrix) for which they could be optimal.
oGiven several classifiers:
We construct the convex hull of
their points (FPR,TPR) as well as
the two trivial classifiers (0,0) and
(1,1).
The classifiers below the ROC
curve are discarded.
The best classifier (from those
remaining) will be selected in
application time…
TIP
ROC ANALYSIS
In the context of application, we choose the optimal
classifier from those kept. Example 1:
ROC ANALYSIS
Crisp and Soft Classifiers:
oA “hard” or “crisp” classifier predicts a class between a set
of possible classes.
Caveat: crisp classifiers are not versatile to changing contexts.
oA “soft” or “scoring” (probabilistic) classifier predicts a
class, but accompanies each prediction with an estimation
of the reliability (confidence) of each prediction.
Most learning methods can be adapted to generate soft classifiers.
oA soft classifier can be converted into a crisp classifier
using a threshold.
Example: “if score > 0.7 then class A, otherwise class B”.
With different thresholds, we have different classifiers, giving
more or less relevance to each of the classes
20
Soft or scoring classifiers can be
reframed to each context.
TIP
ROC ANALYSIS
ROC Curve of a Soft Classifier:
oWe can consider each threshold as a different classifier and
draw them in the ROC space. This generates a curve…
21
We have a “curve” for just one soft classifier
21
Actual Class
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
Predicted Class
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
p
p
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
...
METRICS FOR A RANGE OF CONTEXTS
What if we want to select just one soft classifier?
oThe classifier with greatest Area Under the ROC Curve
(AUC) is chosen.
24
AUC does not consider calibration. If calibration is
important, use other metrics, such as the Brier score.
TIP
AUC is useful but it is always better to draw the curves
and choose depending on the operating condition.
TIP
BEYOND BINARY CLASSIFICATION
Cost-sensitive evaluation is perfectly extensible for
classification with more than two classes.
For regression, we only need a cost function
oFor instance, asymmetric absolute error:
25 ERROR actual
low medium high
low 20 0 13
medium 5 15 4
predicted
high 4 7 60
COST actual
low medium high
low 0€ 5€ 2€
medium 200€ -2000€ 10€
predicted
high 10€ 1€ -15€
Total cost:
-29787€
BEYOND BINARY CLASSIFICATION
ROC analysis for multiclass problems is troublesome.
oGiven n classes, there is a n (n‒1) dimensional space.
oCalculating the convex hull impractical.
The AUC measure has been extended:
oAll-pair extension (Hand & Till 2001).
oThere are other extensions.
26
c
i
c
ijj
HT
jiAUC
cc
AUC
1,1
),(
)1(
1
BEYOND BINARY CLASSIFICATION
ROC analysis for regression (using shifts).
oThe operating condition is the asymmetry factor α. For
instance if α=2/3 means that underpredictions are twice
as expensive than overpredictions.
oThe area over the curve (AOC) is the error variance. If
the model is unbiased, then it is ½ MSE.
27
LESSONS LEARNT
Model evaluation goes much beyond split or cross-
validation + metric (accuracy or MSE).
Models can be generated once but then applied to
different contexts / operating conditions.
Drawing models for different operating conditions
allow us to determine dominance regions and the
optimal threshold to make optimal decisions.
Soft (scoring) models are much more powerful than
crisp models. ROC analysis really makes sense for
soft models.
Areas under/over the curves are an aggregate of the
performance on a range of operating conditions, but
should not replace ROC analysis.
28
LESSONS LEARNT
We have just seen an example with one kind of
context change: cost changes and output distribution.
Similar approaches exist with other types of context
changes
oUncertain, missing or noisy information
oRepresentation change, constraints, background
knowledge.
oTask change
29