03a_Regression in Machine learning For BDA.pdf

MehediHasanRidoy4 10 views 48 slides Mar 01, 2025
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

regression algorithm is a algorithm of machine learning


Slide Content

Statistical Learning for Big Data
Chapter 3.1: Supervised Learning — Regression

Loss Functions for Regression
A. Marx, P. Bürkner © winter term 24/25

REGRESSION LOSSES - L2 / SQUARED ERROR
L(y,f(x)) =
1/2(y−f(x))
2
Convex, differentiable(gradient no problem in loss minimization)
Derivative is prop. to residual:
∂1/2(y−f(x))
2
∂f(x)
=f(x)−y=−ε
Connection to Gaussian distribution
Tries toreduce large residuals: if residual doubles, loss becomes
4 times as large, henceoutliers in y can become a problem6.65
1.15
1
2
3
4
5
6
7
0 2 4
x
y
Data & Model
6.65
1.15
0
5
10
15
−4 −2 0 2 4
Residuals=y-f(x)
L
(
f(
x
)
,

y
)
Loss
A. Marx, P. Bürkner © winter term 24/25

REGRESSION LOSSES - L2 / SQUARED ERROR
What is the optimal constant predictionc, i.e., the sameˆyfor allx?
L(y,f(x)) =
1/2(y−f(x))
2
=
1/2(y−c)
2
We search for thecthat minimizes the empirical risk.
ˆc=arg min
c∈R
Remp(c) =arg min
c∈R
1
n
n
X
i=1
1/2(y
(i)
−c)
2
We set the derivative of the empirical risk to zero and solve forc:
∂Remp(c)
∂c
=−
1
n
n
X
i=1
(y
(i)
−c) =0
⇔c=
1
n
n
X
i=1
y
(i)
A. Marx, P. Bürkner © winter term 24/25

REGRESSION LOSSES - L1 / ABSOLUTE ERROR
L(y,f(x)) =|y−f(x)|
Convexbutno derivatives at0, i.e., wheny=f(x)
Optimization becomes harder
ˆ
f(x) =median ofy|x
Connection to Laplace distribution
Morerobust to outliers iny2.58
1.07
1
2
3
4
5
6
7
−2 0 2 4
x
y
Data & Model
2.58
1.07
0
5
10
15
−4 −2 0 2 4
Residuals=y-f(x)
L
(
f(
x
)
,

y
)
Loss
A. Marx, P. Bürkner © winter term 24/25

REGRESSION LOSSES - L1 / ABSOLUTE ERROR
L(y,f(x)) =|y−f(x)|
Convexbutno derivatives at0, i.e., wheny=f(x)
Optimization becomes harder
ˆ
f(x) =median ofy|x
Connection to Laplace distribution
Morerobust to outliers iny, less influential than for L22.58
1.07
1
2
3
4
5
6
7
−2 0 2 4
x
y
Data & Model
2.58
1.07
6.65
1.15
0
5
10
15
−4 −2 0 2 4
Residuals=y-f(x)
L
(
f(
x
)
,

y
)
Loss
A. Marx, P. Bürkner © winter term 24/25

REGRESSION LOSSES - HUBER LOSS
Huber loss
Lδ(y,f(x)) =
(
1
2
(y−f(x))
2
if|y−f(x)| ≤δ
δ|y−f(x)| −
1
2
δ
2
otherwise
Piecewise combination of L1 and L2 losses
Combines advantages of L1 and L2 losses:Convex,
differentiable, robust to outliers0.0
0.5
1.0
1.5
−2 −1 0 1 2
y-f(x)
L
(
y
-
f(
x
)
)
A. Marx, P. Bürkner © winter term 24/25

Linear Regression Models
A. Marx, P. Bürkner © winter term 24/25

LINEAR REGRESSION: HYPOTHESIS SPACE
We want to predict a numerical target variable by alinear
transformationof the featuresx∈R
p
.
So withθ∈R
p
this mapping can be written as:
y=f(x) =θ0+θ
T
x
=θ0+θ1x1+· · ·+θpxp
Defines thehypothesis spaceHas the set of all linear functions inθ:
H={θ0+θ
T
x|(θ0,θ)∈R
p+1
}
A. Marx, P. Bürkner © winter term 24/25

LINEAR REGRESSION: HYPOTHESIS SPACE1 Unit1 Unit
q
1=slope=0.5q
1=slope=0.5
q
0=intercept=1q
0=intercept=1
0
1
2
3
0 1 2 3 4
x
y
y=θ0+θ1·x
A. Marx, P. Bürkner © winter term 24/25

LINEAR REGRESSION: HYPOTHESIS SPACE
We assume from now on thatθ0is included inθ.
Given observed labeled dataD, how can we findθ?
This is what we define asparameter estimation. Recall:
Learning = Hypothesis Space + Risk + Optimization
The learner performs this byempirical risk minimization
over the hypothesis spaceH.
A. Marx, P. Bürkner © winter term 24/25

LINEAR REGRESSION: RISK
We could measure training error as the sum of squared prediction
errors (SSE). This is the risk that corresponds toL2 loss:
Remp(θ) = SSE(θ) =
n
X
i=1
L
ˇ
y
(i)
,f
ˇ
x
(i)

ıı
=
n
X
i=1
ˇ
y
(i)
−θ
T
x
(i)
ı
2−1
0
1
1 2 3 4 5 6
x
y
Minimizing the squared error is computationally much simpler than
minimizing the absolute differences (L1 loss).
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
We want to find the parametersθof the linear model, i.e., an element of
the hypothesis spaceHthat fits the data optimally.
So weevaluate different candidatesforθ.
A first (random) try yields a rather large SSE (Evaluation).0 2 4 6 8
−2
0
2
4
6
q=( 1.8 , 0.3 )
SSE: 16.85
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
We want to find the parametersθof the linear model, i.e., an element of
the hypothesis spaceHthat fits the data optimally.
So weevaluate different candidatesforθ.
Another try yields an even larger SSE (Evaluation). Therefore, this one
is worse in terms of empirical risk. So let us try again...0 2 4 6 8
−2
0
2
4
6
q=( 1.8 , 0.3 )
SSE: 16.85
0 2 4 6 8
−2
0
2
4
6
q=( 1 , 0.1 )
SSE: 24.3
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
We want to find the parametersθof the linear model, i.e., an element of
the hypothesis spaceHthat fits the data optimally.
So weevaluate different candidatesforθ.
Another try yields a reduced SSE (Evaluation). This is currently the
best candidate. Can we find an even better one?0 2 4 6 8
−2
0
2
4
6
q=( 1.8 , 0.3 )
SSE: 16.85
0 2 4 6 8
−2
0
2
4
6
q=( 1 , 0.1 )
SSE: 24.3
0 2 4 6 8
−2
0
2
4
6
q=( 0.5 , 0.8 )
SSE: 10.61
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
Since everyθresults in a specific value ofRemp(θ), and we try to find
arg min
θ
Remp(θ), let’s look at what we have so far:0 2 4 6 8
−2
0
2
4
6
q=( 1.8 , 0.3 )
SSE: 16.85
0 2 4 6 8
−2
0
2
4
6
q=( 1 , 0.1 )
SSE: 24.3
0 2 4 6 8
−2
0
2
4
6
q=( 0.5 , 0.8 )
SSE: 10.61
Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
SSE
20
40
60
80
100
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
Instead of guessing, we useoptimizationto find the bestθ:Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
SSE
20
40
60
80
100
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
Instead of guessing, we useoptimizationto find the bestθ:Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
SSE
20
40
60
80
100
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
Instead of guessing, we useoptimizationto find the bestθ:0 2 4 6 8
−2
0
2
4
6
q=( 1.8 , 0.3 )
SSE: 16.85
0 2 4 6 8
−2
0
2
4
6
q=( 1 , 0.1 )
SSE: 24.3
0 2 4 6 8
−2
0
2
4
6
q=( 0.5 , 0.8 )
SSE: 10.61
0 2 4 6 8
−2
0
2
4
6
q=( −1.7 , 1.3 )
SSE: 5.88
Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
SSE
20
40
60
80
100
A. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION
For L2 regression, we can find this optimal value analytically:
ˆθ=arg min
θ
Remp(θ) =arg min
θ
n
X
i=1
ˇ
y
(i)
−θ
T
x
(i)
ı
2
=arg min
θ
∥y−Xθ∥
2
2
whereX=




1x
(1)
1
...x
(1)
p
1x
(2)
1
...x
(2)
p
.
.
.
.
.
.
.
.
.
1x
(n)
1
...x
(n)
p




is then×(p+1)-design matrix.
This yields the so-called normal equations for the LM:

∂θ
Remp(θ) =0=⇒ ˆθ=
Γ
X
T
X
˙
−1
X
T
yA. Marx, P. Bürkner © winter term 24/25

LINEAR MODEL: OPTIMIZATION (RECAP)
The empirical risk in linear regression using the L2 loss can be
denoted by
Remp(θ) =
n
X
i=1
ˇ
y
(i)
−θ
T
x
(i)
ı
2
=∥y−Xθ∥
2
2
= (y−Xθ)
T
(y−Xθ) =y
T
y−2y
T
Xθ+θ
T
X
T

The gradient (w.r.t.θ) of this expression is
−2X
T
y+2X
T

Setting the gradient to zero leads to the equation
X
T
y=X
T

and the well-known analytic solution:
ˆθ=

X
T
X
˙
−1
X
T
y
A. Marx, P. Bürkner © winter term 24/25

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS
We could also minimize the L1 loss. This changes the risk and
optimization steps:
Remp(θ) =
n
X
i=1
L
ˇ
y
(i)
,f
ˇ
x
(i)

ıı
=
n
X
i=1


y
(i)
−θ
T
x
(i)


(Risk)Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
Sum of Absolute Errors
5
10
15
20
L1 Loss Surface
Intercept
−2
−1
0
1
2
Slope
0.0
0.5
1.0
1.5
SSE
20
40
60
80
100
L2 Loss Surface
L1 loss is harder to optimize, but the model is less sensitive to outliers.
A. Marx, P. Bürkner © winter term 24/25

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS
The line fitted with L2 is similar to L1:0
25
50
75
100
2 4 6 8 10
x1
y
Loss
L1
L2
L1 vs L2 Without Outlier
A. Marx, P. Bürkner © winter term 24/25

EXAMPLE: REGRESSION WITH L1 VS L2 LOSS
The line fitted with L2 is pulled towards an outlier (highlighted red):0
25
50
75
100
0.0 2.5 5.0 7.5 10.0
x1
y
Loss
L1
L2
L1 vs L2 With Outlier
A. Marx, P. Bürkner © winter term 24/25

LEARNING WITH LINEAR REGRESSION
Learning = Hypothesis Space + Risk + Optimization
Hypothesis Space:Linear functionsx
T
θof featuresx∈ X.
Risk:Any regression loss function (L2, L1, Huber, etc.).
Optimization:Direct analytical solution for L2 loss, numerical
optimization for L1 and others.
A. Marx, P. Bürkner © winter term 24/25

Polynomial Regression Models
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
We can make linear regression models much more flexible by using
polynomialsx
d
j
or any otherderived featuressuch astransformations
sin(xj)orinteractions(xj·xk)as additional features.
Theoptimizationandriskof the learnerremain the same.
Only thehypothesis spaceof the learnerchanges:
Instead of linear functions of the original features,
f
ˇ
x
(i)

ı
=θ0+θ1x
(i)
1
+θ2x
(i)
2
+. . .
it nowincludes linear functions of the derived featuresas well, e.g.
f
ˇ
x
(i)

ı
=θ0+
d
X
k=1
θ1k
ˇ
x
(i)
1
ı
k
+
d
X
k=1
θ2k
ˇ
x
(i)
2
ı
k
+. . .
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
Polynomial regression example
Models of differentcomplexity, i.e., of different polynomial orderd, are
fitted to the data:0.5 1.0 1.5 2.0
−1.0
0.0
0.5
1.0
1.5
2.0
x
y
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
Polynomial regression example
Models of differentcomplexity, i.e., of different polynomial orderd, are
fitted to the data:0.5 1.0 1.5 2.0
−1.0
0.0
0.5
1.0
1.5
2.0
x
y
f(x) for d = 1 (linear)
f(x) for d = 5 
f(x) for d = 25 
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
Polynomial regression example
Models of differentcomplexity, i.e., of different polynomial orderd, are
fitted to the data:0.5 1.0 1.5 2.0
−1.0
0.0
0.5
1.0
1.5
2.0
x
y
f(x) for d = 1 (linear)
f(x) for d = 5 
f(x) for d = 25 
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
Polynomial regression example
Models of differentcomplexity, i.e., of different polynomial orderd, are
fitted to the data:0.5 1.0 1.5 2.0
−1.0
0.0
0.5
1.0
1.5
2.0
x
y
f(x) for d = 1 (linear)
f(x) for d = 5 
f(x) for d = 25 
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS
As the degreedgrows, the learner has morecapacityto learn complex
functions ofx, but it alsoincreases the danger of overfitting:
The model spaceHcontains sufficiently complex functions so as
to approximate the training data arbitrarily well (i.e., the best model
learns the data by heart).
However, predictions on new data are not successful because our
model has also learnt the random noise that is present in the
training data (more on this in Week 7).
A. Marx, P. Bürkner © winter term 24/25

REGRESSION: POLYNOMIALS0.5 1.0 1.5 2.0
−1.0
0.0
1.0
2.0
x
y
Training Data
0.5 1.0 1.5 2.0
−1.0
0.0
1.0
2.0
x
y
Test Data
A. Marx, P. Bürkner © winter term 24/25

k-Nearest Neighbors Regression
A. Marx, P. Bürkner © winter term 24/25

NEAREST NEIGHBORS: INTUITION
Assume we aregiven locations on a mapand wewant to know to which
country the location belongs to. Suppose that
Wedo not knowthe borders of the countries
Weknowthe locations of cities in the different countries
Weknowwhich city belongs to which country
Nearest neighbor rule:
Every location belongs to the same country
as the closest (known) city.
k-nearest neighbors rule:
Vote over thekclosest cities (smoother).
A. Marx, P. Bürkner © winter term 24/25

K-NEAREST-NEIGHBORS
k-NNgenerates predictionsˆyfor a givenxby comparing to the set
ofkobservations that are closest toxin the training data
"Closeness" requires adistance or similarity measure.
We denote the distance ofxto itskclosest neighbour asδk(x).
The set containing thekclosest pointsx
(i)
toxin the training data
is called thek-neighborhoodNk(x)ofx. That is, with distance
d:X × X →R,
Nk(x) :={x
(i)
|d(x,x
(i)
)≤δk(x),x̸=x
(i)
}.−1
0
1
2
−2 −1 0 1
x1
x2
k=15
−1
0
1
2
−2 −1 0 1
x1
x2
k=7
−1
0
1
2
−2 −1 0 1
x1
x2
k=3
A. Marx, P. Bürkner © winter term 24/25

DISTANCE MEASURES
How to calculate distances?
For numerical features (most popular):Euclidean distance
Imagine two data pointsx= (x1, ...,xp)and˜x= (˜x1, ...,˜xp)withp
features, sox,˜x∈R
p
, theirEuclidean distanceis defined as:
dEuclidean(x,˜x) =
v
u
u
t
p
X
j=1
(xj−˜xj)
2
A. Marx, P. Bürkner © winter term 24/25

DISTANCE MEASURES
Example:
Three data points with two metric features each:
a= (1,3),b= (4,5)andc= (7,8)
Which point is the 1-nearest neighbor ofbin terms of the
Euclidean distance?
d(b,a) =
p
(4−1)
2
+ (5−3)
2
=3.61
d(b,c) =
p
(4−7)
2
+ (5−8)
2
=4.24
⇒ais the 1-nearest neighbor forb.
Alternative distance measures are:
Manhattan distance
dmanhattan(x,˜x) =
p
X
j=1
|xj−˜xj|
Mahalanobis distance(introduces covariances inX)
A. Marx, P. Bürkner © winter term 24/25

DISTANCE MEASURES
Comparison between Euclidean and Manhattan distance measures:0 1 2 3 4 5 6
0
1
2
3
4
5
Dimension 1
Dimension 2
x
x
~
Manhattan
Euclidean
d(x, x
~
) = |5−1| + |4−1| = 7
d(x, x~) = (5
-
1)
2+(4
-
1)
2 = 5
A. Marx, P. Bürkner © winter term 24/25

DISTANCE MEASURES VS. METRIC
Many distance measuresd:X × X →R, e.g., the Euclidian distance
also fulfill the properties required for ametricordistance function.
That is, forx,y,z∈ X:
d(x,x) =0
Ifx̸=y, thend(x,y)>0 (positivity)
d(x,y) =d(y,x)(symmetry)
d(x,z)≤d(x,y) +d(y,z)(triangle inequality)
Note that not all distances used fork-NN fulfill these properties. For
example, the cosine similarity does not fulfill the triangle inequality.
A. Marx, P. Bürkner © winter term 24/25

DISTANCE MEASURES (WEIGHTS)
Weights can be used to address two problems in distance calculation:
Standardization:Two features may have values on a different
scale. Many distance formulas would place a higher importance on
a feature with higher values, leading to an imbalance. Assigning a
higher weight to the lower-valued feature can counteract this effect.
Importance:Sometimes one feature has a higher importance
(e. g., more recent measurement). Assigning weights according to
the importance of the feature can align the distance measure with
known feature importance scores.
For example:
d
weighted
Euclidean
(x,˜x) =
v
u
u
t
p
X
j=1
wj(xj−˜xj)
2
(special case of Mahalanobis distance)
A. Marx, P. Bürkner © winter term 24/25

K-NN REGRESSION
Predictions for regression (non-weighted and weighted):
ˆy=
1
|Nk(x)|
X
i:x
(i)
∈Nk(x)
y
(i)
ˆy=
1
P
i:x
(i)
∈Nk(x)
wi
X
i:x
(i)
∈Nk(x)
wiy
(i)
with neighbors weighted according to their distance tox:wi=
1
d(x
(i)
,x)
A. Marx, P. Bürkner © winter term 24/25

K-NN SUMMARY
Hypothesis Space:Step functions over partitioning ofX.
Hyperparameters: distance measured(·,·)onX; neighborhood sizek.
Risk:Use any loss function for regression (or classification).
Optimization:Not applicable/necessary.
But: clever data structures & query algorithms avoid computing alln
distances for generating predictions (e.g., K-D Trees). Herec>0 is an
arbitrary but fixed constant.
Query efficient:˜
O(n+n
1+1/c
2
)space,˜
O(n
1/c
2
)time
Space efficient:˜
O(n
1/c
log
3
(n))space,˜
O(n
1+1/c
log
3
(n))time
NB: time↔space trade-off
A. Marx, P. Bürkner © winter term 24/25

K-NN SUMMARY
k-NN has no optimization step and is a very local model.
We cannot simply use least-squares loss on the training data for
pickingk, because we would always pickk=1.
k-NN makes no assumptions about the underlying data
distribution.
Accuracy ofk-NN can be severely degraded by the presence of
noisy or irrelevant features, or when the feature scales are not
consistent with their importance.
k-NN can be used for regression and classification
A. Marx, P. Bürkner © winter term 24/25

SUMMARY
Today we studied regression, a fundamental approach in supervised
learning, where we looked at
Linear regression
Loss functions for regression, e. g., L1-, L2-, and Huber loss
Fundamental optimization of linear regression under the L2-loss
Polynomial regression
k-NN regression
Next, we will study the basics of classification, another supervised
learning approach.
A. Marx, P. Bürkner © winter term 24/25

Appendix
A. Marx, P. Bürkner © winter term 24/25

GOWER DISTANCE
Categorical variables, missing data and mixed space:
The Gower distancedgower(x,˜x)is a weighted mean ofdgower(xj,˜xj):
dgower(x,˜x) =
p
P
j=1
δ
xj,˜xj
·dgower(xj,˜xj)
p
P
j=1
δ
xj,˜xj
.
δ
xj,˜xj
is0 or 1. It becomes0when thej-th variable ismissingin at
least one of the observations (xor˜x),orwhen the variable is
asymmetric binary(where “1” is more important/distinctive than
“0”, e.g., “1” means “color-blind”)and both values are zero.
Otherwise it is 1.
dgower(xj,˜xj), thej-th variable contribution to the total distance, is a
distance between the values ofxjand˜xj. For nominal variables the
distance is 0 if both values are equal and 1 otherwise. The
contribution of other variables is the absolute difference of both
values, divided by the total range of that variable.
A. Marx, P. Bürkner © winter term 24/25

GOWER DISTANCE
Example of Gower distance with data on sex and income:
index
1
2
3
dgower(x,˜x) =
p
P
j=1
δ
x
j
,˜x
j
·dgower(xj,˜xj)
p
P
j=1
δ
x
j
,˜x
j
dgower(x
(1)
,x
(2)
) =
1·1+1·
|2340−2100|
|2680−2100|
1+1
=
1+
240
580
2
=
1+0.414
2
=0.707
dgower(x
(1)
,x
(3)
) =
0·1+1·
|2340−2680|
|2680−2100|
0+1
=
0+
340
580
1
=
0+0.586
1
=0.586
dgower(x
(2)
,x
(3)
) =
0·1+1·
|2100−2680|
|2680−2100|
0+1
=
0+
580
580
1
=
0+1.000
1
=1
A. Marx, P. Bürkner © winter term 24/25
Tags