Data Science_Chapter -2_Statical Data Analysis.pdf

sangeetaborde1 91 views 73 slides Oct 16, 2024
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

This presentation will be helpful for students who are studying the basics of statistical data analysis.


Slide Content

Chapter -2
Statistical Data Analysis
1

Introduction
●Data Science is as interdisciplinary field which requires a strong
understanding of mathematics,statistical reasoning and computer science
●Statistics is the science of collecting ,analyzing and interpreting data
●The data is usually numerical data in large quantities
●Statistics serve as a foundation while dealing with data and its analysis in
data science.
●It provides tools and methods to find structure in ans to give deeper insight
into data
●Data scientist use the combination of statistical formulae and computer
algorithms to notice patterns and trends within data
2

Steps for processing data
1.Identify the importance feature in the data
2.Finding relationship between features
3.Converting the features into the required format
4.Nomalizing and scaling the data
5.Identifying the distribution and nature of the data
6.Performing adjustment in the data
7.Identifying the right mathematical approach
8.Verify the results using different accuracy measurement scales
3

Roles of statistics in Data Science
Data Exploration
Data Cleaning
Data Transformation
Data Visualization
Finding Similarity/Dissimilarity
Model Selection and Evaluation
Hypothesis Testing
Statistical Modeling
Probability Distribution and Estimation
4

Types of Statistics
5
Types of Statistics
Descriptive statistics Inferential Statistics
Parameter EstimationHypothesis Testing
Measures of
Dispersion
Measures of
Frequency
Measures of
Central Tendency

Descriptive Statistics
●Provides ways for describing,presenting,summarizing and organizing
the data
●Descriptive statistics summarizes this large amount of data and
presents it in a simple and understandable form.
●The summarization is done from the sample of the population using
different parameters like mean,median,standard deviation.
●It help us to organize ,represent & describe data using tables,graph &
summary measures.
●Summarization is done with the help of simple parameteres like
Mean,median,Standard deviation.
6

Types of Descriptive Statistics
7
Descriptive Statistics
Measures of
Frequency
Measures of
Central Tendency
Measures of Dispersion
Mean
Mode
Median
Range
Interquartile
Range
Standard
Deviation

Measures of Frequency
●Frequency is statistical quantity in data science.
●It is number of times a value of the data occurs.
●In a dataset it analyzes how often a particular data value in a feature occurs.
●The frequency distribution can be tabulated as a frequency chart
8
Twenty students were asked how many hours they worked per day.
Their responses, in hours, are as
follows:
22,13,4,6,13,11,10

DATA
VALUE
FREQUENCY
2 3
3 5
4 3
5 6
6 2
7 1
9

Measures of Central Tendency
●It is important measures of statistical analysis is to find one value that
describes the characteristics of the entire set of data.
●This single value is referred to as a central tendency that describes a whole
set of data with single value that represents the center of its distribution.
●Measure of central Tendency is also known as summary statistics that is used
to represent the center point.
10

Mean
●The most common and effective numeric measure of the center of a set of
data .
●It is the sum of all the observations divided by the sample size.
●The types of mean
Arithmetic Mean
Harmonic Mean
Geometric mean
11

Arithmetic mean
●It is obtained by adding all the values and then dividing the sum by the total
number of values.
●Let x1,x2,x3,x4…..xnbe a set of N values or observation. The arithmetic
mean of this set of values is :
12

●Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5
●We can calculate
(8+3+7+6+9+10+5+7+8+5)
10 =6.8
The arithmetic mean can be calculated by using mean () function from Numpy
library
13

Harmonic Mean
●The harmonic mean is used when we want to find the reciprocal of the
average of the reciprocal terms in a series. The formula to determine
harmonic mean isn / [1/x
1+ 1/x
2+ 1/x
3+ ...+ 1/x
n].
●Example x=(6,3,1,5,2)
●HM= ?
14

Geometric Mean
●A geometric mean is a mean or average which shows the central tendency of
a set of numbers by using the product of their values.
15

Median
●It is middle value of data.
●It is the value that separates the higher half of a data set from the lower half.
●It splits the data in half and also called 50 thpercentile
●If the number of elements in the data set is odd then middle element is
median
●If the number of elements in the data set is even then average of two
central elements.
Advantages
Less affected by the outliers and skewed data as compared to mean
Appropriate for Skewed data
16

Mode
●It is value that occur more frequently in a dataset.
●It is possible for several different values to have the maximum frequency
which result in more than one mode.
●Dataset with one mode is called unimodal.
●Dataset with two mode is called bimodal.
●Dataset with three mode is called trimodal.
17

●Advantages
○Can be used for categorical values
○Determined for qualitative and quantitative values
○Not affected by extreme values
●Disadvantages
○Not based on all values
○Mode can not clearly defined in case of multi model series
○Not applicable for further statistical analysis and algebraic calculation
18

Measures of Dispersion
●Dispersion is the extent to which values in a distribution differ from the average of
distribution
●Measures of central tendency is alone not sufficient to describe the data.
●Measures of dispersion helps us to know the degree of variability in the data and
provide better understanding of data
●Measures of dispersion indicate the measures to assess the dispersion or spread
of numeric data.
●The measures are:
oRage
oQuantiles
oQuartiles
oPercentiles
oInterquartilerange
19

Range
●It is simplest measure of dispersion.Letx1,x2,….xnbe a set of observations
for some numeric attributes X.
●The range of the set is the difference between the largest(max() and the
smallest (min() values)
●Range=max-min
20

Standard Deviation
●It is a measure of how much the data values deviate from the mean value
●σ = √(∑x−x̄)2 /n)
21
Find the SD for 4,9,11,12,17,5,8,12,14

Variance
●Variance measures how far a data set is spread out.Itis mathematically
defined ad the average of the squared differences from the mean.
●Variance = (Standard deviation)
2

2
22

Interquartile Range
●Interquartile range is a measure of variation, which describes how spread out
the data is.
●The interquartile range isa measure of variability based on splitting data
into quartiles.
●Interquartile range is the difference between the first and
thirdquartiles(Q
1and Q
3).
●Quartile divides the range of data into four equal parts That are demarcated
by the three quartiles Q1,Q2,Q3
●Consider the following data
2,3,4,7,10,15,22,26,27,30,32
23

Python statistical Functions:
24
Method Description
statistics.harmonic_mean() Calculates the harmonic mean (central location) of the given data
statistics.mean() Calculates the mean (average) of the given data
statistics.median() Calculates the median (middle value) of the given data
statistics.median_grouped() Calculates the median of grouped continuous data
statistics.median_high() Calculates the high median of the given data
statistics.median_low() Calculates the low median of the given data
statistics.mode() Calculates the mode (central tendency) of the given numeric or nominal data
statistics.pstdev() Calculates the standard deviation from an entire population
statistics.stdev() Calculates the standard deviation from a sample of data
statistics.pvariance() Calculates the variance of an entire population
statistics.variance() Calculates the variance from a sample of data

Inferential Statistics
25
•Inferential Statistics draw inferences
and prediction about a population
based on data chosen from the
population in question
•Sample is considered as a
representative of the entire universe or
population
•Statistical Inference mainly deals with
two different kinds of problems
Hypothesis testing
Estimation of parameter values

Hypothesis testing
●Most promising tech. used in data analysis to check whether a stated hypothesis
is accepted or rejected.( process is called Hypothesis Testing)
●Hypothesis testing is mainly used to determine whether there is sufficient
evidence in a data sample to conclude that a particular condition holds for an
entire population.
●There are two hypothesis
○Null Hypothesis
○Alternative Hypothesis
●The null hypothesis in statistics states that there is no difference between groups
or no relationship between variables. Ex Private coaching for students
●The alternative hypothesis states that there is a relationship between the two
variables being studied (one variable has an effect on the other).
26

Steps for Hypothesis Testing
Four basic steps to be followed for Hypothesis Testing-
Step 1: State the null and alternative hypothesis
Step 2: Select the appropriate significance level and check the specified test
assumption
Step 3: Analyze the data by computing appropriate statistical tests
Step 4: Interpret the result.
Two conclusions that can be inferred:
1.Reject the null hypothesis by showing enough evidence to support alternative
hypothesis.
2.Accept the null hypothesis by showing evidence to prove that there is not
enough evidence to support alternative ghypothesis.
27

Example of Hypothesis
●For example, suppose a biologist believes that a certain fertilizer will cause
plants to grow more during a one-month period than they normally do, which
is currently 20 inches. To test this, she applies the fertilizer to each of the
plants in her laboratory for one month.
●She then performs a hypothesis test using the following hypotheses:
●H
0:μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
●H
A:μ > 20 inches (the fertilizer will cause mean plant growth to increase)
28
H
0 True False
Rejected Type I Error √
Not Rejected √ Type II Error

●For example, suppose a doctor believes that a new drug is able to reduce
blood pressure in obese patients. To test this, he may measure the blood
pressure of 40 patients before and after using the new drug for one month.
●He then performs a hypothesis test using the following hypotheses:
●H
0:μ
after= μ
before(the mean blood pressure is the same before and after using
the drug)
●H
A:μ
after< μ
before(the mean blood pressure is less after using the drug)
29

Parametric hypothesis tests: Parametric tests & Non
parametric tests
Information about the population is completely known and can be used for statistical inference in the case
of parametric tests. The type of parametric test to be considered is a decision-making task.
Steps for Parametric test:
Step -1 State Null and Alternate hypothesis
Step -2 Consider the level of significance
Step-3 Identify the type of parametric test to be conducted
Step-4 Find the Critical value to decide the accept/reject regions
Step-5 Consider the sample and find the obtained parametric test value
Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted or
rejected
30

Core Terms related with Parametric test
The Null hypothesis & alternative hypothesis are mutually exclusive
1.Acceptance and critical regions :
All sets of possible values can be divided into two mutually exclusive groups:
●Acceptance Region: Values that appear consistent with the null hypothesis.
●Rejection Region: Consists of values unlikely to occur if the null hypothesis is true.
The value(s) that separate the critical region from the acceptance region are called critical values.
31

One tailed test and Two tailed Test
If the specified problem has an equal sign it is two tailed test
If the problem has a greater than or less than sign it is one tailed test
Case 1 :A government school states that dropout of female students between ages 12 and 18
years is 28%. Fig no 3(Two tailed test)
Case 2 :A government school states that dropout of female
students between ages 12 and 18 years greater than 28%
Case 3 :A government school states that dropout of female
students between ages 12 and 18 years less than 28%
32

Significance Level
It is denoted by α
It is probability of rejecting null hypothesis being rejected even if it is true.This is
because 100% accuracy is practically not possible for accepting & rejecting a
hypothesis.
For example a significance level of 0.03 indicates that a 3 % risk is being taken
that a difference in values exists when there is no difference.
Typical values of significance level is 0.01,0.05,0.1
33

Calculated probability( r )-statistical expectation is true
It is calculated probability that states that when the null hypothesis is true, the
statistical summary will be greater than or equal to the actual observed results.
It is the probability of finding the observed or more extreme results when the null
hypothesis is true.
Some of the widely used hypothesis testing types are:
Z-test
T-test
Chi-Square
34

Types of Hypothesis Testing
35
Hypothesis Test
Two
Sample
One
Sample
NonParametricTestParametric Test
Z-Test
Chi-Square
Test
T-test
Independent
Samples
Paired Samples
Z-Test
Two group
test
Paired-Test
Two
Sample
One
Sample

Z -Test
●This test is used for comparing the mean of a sample to some hypothesized mean
of a given population.
●The method for carrying out z-test for one sample is
z=X-
µ
H0
σ
p /√n
Where
µ
H0 =hypothesized population mean
σ
p Standard deviation
36

Example
●For a sample of 500 female students having a mean height of 5.4 feet.The
task is to find whether it can be reasonably regarded as a sample from a large
population with a mean height of 5.6 feet and standard deviation of 1.45
feet.Letus consider 5 % level of significance to solve the problem.
37

T-test
●The one sample t-test is mainly used for determining whether the mean of
sample is statistically different from a known or hypothesized mean of a given
population.
●The test variable needs to be continuous
z=X-
µ
H0
σ
s/√n
38

Chi Square test
●A chi square test is a test of statistical significance for categroicalvariables.
●It is used to find difference between the observed and expected data
●To find the correlation beweencategorical variables In our data
39

Z TEST
40

Z-TEST
Import math
Import numpyas np
From numpy.randomimport randn
From statsmodels.stats.weightstatsimport ztest
# Here we are considering random array of 50 numbers having mean=110 sd=15
Mean_1=110
Sd_1=15/math.sqrt(50)
Null_mean=100
data=sd_1*randn(50)+mean_1
#print mean and sd
Print(‘mean=%.2f stdv=%.2f’ % (np.mean(data),np.std(data)))
41

●P_value=ztest(data,value-null_mean,alternate=‘larger’)
●If(p_value<0.05):
Print(“Reject Null hypothesis”)
else:
Print(“fail to reject null hypothesis”)
42

ANOVA
●Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if
the mean of two or more groups are significantly different from each other.
43

Two sample parametric tests
●Independent samples z-test
This test is carried out on two normally distributed but independent population
for comparing the means of the samples.
The population variances of both the samples are already known.
Original size of samples considered should be larger than 30
44
Where S1 is Standard deviation of sample 1
Where S2 is Standard deviation of sample 2

Independent sample t-test
●This test is carried out to test the statistical difference between
1.The means of two groups
2.The means of two interventions
3.The means of two change scores
45

Paired Sample t-test
●To carried out to compare two population means for given two samples in
which observation in one sample can be paired with observations in one
sample can be compared with observations in other.
●This test is usually used in case of before-and-after observations for
considered subject
46

Non Parametric Hypothesis test
●Information about the population is unknown and hence no assumption can be made regarding the
population
●It is more suitable for data that can be represented in qualitative scales
(nominal or ordinal )
●Cover techniques that do not rely on data belonging to any particular distribution
●The distribution of data can be skewed as well as the population variacecan be non homogeneous
●One sample non-parametric test
One factor Chi-Square
Binomial
WilcoxonSigned Rank Test
●Two Independent Sample
Mann-Whitney Test
Kolmogorov-Smirnov/s Test
●Two Paired Samples
Sign
Chi-Square
WilcoxonSigned rank
47

Estimation of Parameter values
●In statistics finding estimation or inference refers to the task of drawing conclusion
about a population based on information provided about the population
●This can be done in two ways
Point estimate
Interval estimate
●Point estimation considers only single value of a statistics.
●Point estimation is based on single random sample its value will vary when
different random samples will considered from sample population.
●Few of the standard Point estimation methods are
Maximum Likelihood Estimator
Minimum Variance mean Unbiased Estimator
Minimum mean squared error
Best Linear Unbiased Estimator
48

Interval Estimate
It considers two values between which the population parameter considers two
values between which the population parameter is likely to lie.
The two values
49

Measuring Data Similarity and Dissimilarity
●Similarity measure is a way of measuring how data samples are related or
close to each other.
●Dissimilarity measure is to tell how much the data objects are distinct.
●Similarity measures are expressed as numerical value
●It gets higher when the data samples are more alike
●Zero means low similarity and one means very similar)
●Data structures
The data matrix
The dissimilarity matrix
●Object dissimilarity can be computed for objects described by nominal
attributes, binary attributes, numerical attributes, ordinal attributes.
50

Proximity measures for Nominal Attributes
●Nominal Attributes means relating to names. The. value of nominal attribute
are symbols or names or things.
●Let M be the total number of states in nominal attribute .Then status can be
numbered from 1 to M.
●Let m be the total number of attributes for which I and j are in same state and
p the total number of attributes then dissimilarity can be calculated as
d(i,j)=(p-m)/p
Similarity as
s(I,j)=1-d(I,j)
51

Proximity measures for Numeric data
●Euclidean distance d =√[(x
i1–x
j1)
2
+ …(y
i2–y
j2)
2
]
●Manhattan distance TheManhattan Distancebetween two points(X1,
Y1)and(X2, Y2)is given byd(I,j)=|Xi1 –Xj1| + |Yi2 –Yj2|…+|Yin –Yjn|
●Minkowskidistance
( |X
1–Y
1|
p
+ |X
2–Y
2|
p
+ |X
2–Y
2|
p
)
1/p
52

●SET A
1. Write a Python program to find the maximum and minimum value of a given
flattened array.
import numpyas np
ar=np.array([[0,1],[2,3]])
print("Original Flattened Array");
print(ar)
print("-----------------")
print("Maximum value of the above flattened array:")
print(np.amax(ar))
print("Minimum value of the above flattened array:")
print(np.amin(ar))
53

Write a python program to compute Euclidian Distance between two data
points in a dataset. [Hint: Use linalgo.normfunction from NumPy]
import numpyas np
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(point1 -point2)
# printing Euclidean distance
print(dist)
54

3. Create one dataframeof data values. Find out mean, range and IQR for
this data
.
import pandas as pd
df= pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=["Apple", "Orange", "Banana", "Pear"],
index=["Basket1", "Basket2", "Basket3", "Basket4",
"Basket5", "Basket6"])
Print(“\n-----------Calculate Mean -----------\n”)
print(df.mean())
print("-----Maximum Value-------")
a=df.max()
print(a)
print("-----Minimum Value-------")
b=df.min()
print(b)
r=a-b
print("-------Range-------")
print(r)
55

4 find sum of Manhattan distance between all the pairs of given points
Return the sum of distance between all the pair of points.
def distancesum(x, y, n):
sum = 0
# for each point, finding distance
# to rest of the point
for iin range(n):
for j in range(i+1,n):
sum += (abs(x[i] -x[j]) +
abs(y[i] -y[j]))
return sum
# Driven Code
x = [ -1, 1, 3, 2 ]
y = [ 5, 6, 5, 3 ]
n = len(x)
print(distancesum(x, y, n) )
56

5. Write a NumPyprogram to compute the histogram of numsagainst the
bins.
import numpyas np
import matplotlib.pyplotas plt
nums= np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
57

6. Create a dataframefor students’ information such name, graduation
percentage and age.
#Display average age of students, average of graduation percentage.
#And, also describe all basic statistics of data. (Hint: use describe()).
import pandas as pd
import numpyas np
stud_data= {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90,
45,89],
"age": [21,23,22,19,20,18]}
df= pd.DataFrame(stud_data)
print(df)
print("------average of graduation percentage-------")
mean_grade= df["grade"].mean()
print(mean_grade)
print("------average of graduation age-------")
mean_age= df["age"].mean()
print(mean_age)
print("------Describe basic statistics of data-------")
df.describe()
58

Concept of outlier
●An outlier isan observation that lies an
abnormal distance from other values in a
random sample from a population.
●Outlier detection is the process of finding
data objects with behaviors that are
different from the expectation
●They can be caused by measurement or
execution errors.
●1.weight vs Height
●2.Fraud transaction in case of credit card
data
59

Examples:
●Real-World example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet,
respectively. These two giraffes would be considered outliers compared to the general
giraffe population.
●When going throughdata analysis outliers can cause anomalies in the results obtained.
This means that they require some special attention and, in some cases, will need to be
removed to analyze data effectively.
Here are two main reasons why giving outliers special attention is a necessary aspect of the
data analytics process:
1.Outliers may harm the result of an analysis
2.Outliers—or their behavior—may be the information that a data analyst requires from
the analysis
60

How to identify outliers using Z-score & Interquartile Range,
visualizations.

●Example: With small datasets, it can be easy to spot outliers manually (for example,
with a set of data being 28, 26, 21, 24, 78, you can see that 78 is the outlier) but
when it comes to large datasets orbig data, other tools are required.
●In data analytics, analysts createdata visualizationsto present data graphically in a
meaningful and impactful way, to present their findings to relevant stakeholders.
These visualizations can easily show trends, patterns, and outliers from a large set of
data in the form of maps, graphs and charts.
61

●https://youtu.be/R-P8qEGXnBs
62

There are eight main causes of outliers.
●Incorrect data entry by humans
●Codes used instead of values
●Sampling errors or data has been extracted from the wrong place or mixed
with other data
●Unexpected distribution of variables
●Measurement errors caused by the application or system
●Experimental errors in extracting the data or planning errors
●Intentional dummy outliers inserted to test the detection methods
●Natural deviations in data, not an error, that indicate fraud or some other
anomaly you are trying to detect.
63

64

Global Outlier
●Global outliers are also called point outliers. Global
outliers are taken as the simplest form of outliers.
●When data points deviate from all the rest of the data
points in a given data set, it is known as the global
outlier.
●In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green
data point is the global outlier.
65

Contextual Outlier
●Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data
object deviates from the other data points because of
any specific condition in a given data set.
●As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral
attributes.
●Contextual outlier analysis enables the users to
examine outliers in different contexts and conditions,
which can be useful in various applications.
●For example, A temperature reading of 45 degrees
Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the
context of a summer season. In the given diagram, a
green dot representing the low-temperature value in
June is a contextual outlier since the same value in
December is not an outlier.

66

●Collective outliers are groups of data
points that collectively deviate significantly
from the overall distribution of a dataset.
●Collective outliers may not be outliers
when considered individually, but as a
group, they exhibit unusual behavior.
●Detecting and interpreting collective
outliers can be more complex than
individual outliers, as the focus is on group
behavior rather than individual data
points.
67

Outlier detection Method
●Supervised
●Semi Supervised
●Unsupervised
68

Supervised methods
●Supervised methods model data normality and abnormality.
●Domain professionals tests and label a sample of the basic data.
●Outlier detection can be modeled as a classification issue. The service is to
understand a classifier that can identify outliers.
●The sample can be used for training and testing.
●In some application the experts may label just the normal objects and any
other objects not matching the model of normal objects are reported as
outlier.
69

Unsupervised methods
70
•In various application methods, objects labeled as “normal” or “outlier” are
not applicable.
•Therefore, an unsupervised learning approach has to be used.
•Unsupervised outlier detection methods create an implicit assumption such
as the normal objects are considerably “clustered.”
•An unsupervised outlier detection method predict that normal objects follow a
pattern far more generally than outliers.
•Normal objects do not have to decline into one team sharing large similarity.
Instead, they can form several groups, where each group has multiple
features.

Semi-Supervised Methods
●In several applications, although obtaining some labeled instance is possible,
the number of such labeled instances is small.
●It can encounter cases where only a small group of the normal and outlier
objects are labeled, but some data are unlabeled.
●Semi-supervised outlier detection methods were produced to tackle such
methods.
●Semi-supervised outlier detection methods can be concerned as applications
of semisupervisedlearning approaches. For example, when some labeled
normal objects are accessible, it can use them with unlabeled objects that are
nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal
objects are defined as outliers.
71

Statistical Method
●This are also known as mode based method
●Simply starting with visual analysis of the Univariatedata by using Boxplots,
Scatter plots, Whisker plots, etc., can help in finding the extreme values in the
data.
●Assuming a normal distribution, calculate the z-score, which means the
standard deviation (σ) times the data point is from the sample’s mean.
●Another way would be to use InterQuartileRange (IQR) as a criterion and
treating outliers outside the range of 1.5 times from the first or the third
quartile.
72

Proximity Methods
●They assume that an object is an outlier if the nearest neighbors of the object
are far away in feature space;
●The proximity of the object to its neighbors significantly deviates from the
proximity of most of the other objects to their neighbors in the same data set.
●Proximity-based methods are classified into two types: Distance-based
methods judge a data point based on the distance(s) to its neighbors.
Density-based determines the degree of outlines of each data instance based
on its local density.
73