Sanjeev Kumar Dash D ata Mining-2023.ppt

Data Mining
Ch.Sanjeev Kumar Dash

Definition
•Definition- It is the non trivial process of
extracting ,interesting useful and novel
patterns or implicit knowledge from huge
amount of data.

Definition
•Non-trivial process-which means that it is not
obvious the knowledge it has to be extracted.
•Implicit knowledge- The knowledge is inbuilt
in the data we have to extract it only from
the data.

•Novelty part -knowledge has to be a new
knowledge and unknown knowledge
previously.
•Potentially useful -This has to be useful
knowledge depending on the application.
•The knowledge often takes the form of
patterns in data, some regularity or some kind
of structure in the data and from huge
amount of data.

Which is not Data mining
•The plain search like we do in
•Google or any search engine;
•Query processing in a DBMS
•Booking a ticket in Indian railway
Example-www.irctc dot. com
• we want to find out how many ticket are freely
available in this train on this day.
•Data mining would be from historical data, not
from the existing.

Data Mining
•This entire study is very much
interdisciplinary.
•It borrows from database technology,
statistics, machine learning, pattern
recognition algorithms, cognitive theory etc.

Steps of KDD

KDD Process

Steps of KDD
•1. Data cleaning (to remove noise and inconsistent
data)
•2. Data integration (where multiple data sources may
be combined)
•3. Data selection (where data relevant to the analysis
task are retrieved from the
•database)
•4. Data transformation (where data are transformed
and consolidated into forms
•appropriate for mining by performing summary or
aggregation operations)

•5. Data mining (an essential process where
intelligent methods are applied to extract
data patterns)
•6. Pattern evaluation (to identify the truly
interesting patterns representing knowledge
•based on interestingness measures)
•7. Knowledge presentation (where visualization
and knowledge representation techniques are
used to present mined knowledge to users)

•Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.
•The data sources can include databases, data
•warehouses, the Web, other information
repositories, or data that are streamed into
the system dynamically.

What Kinds of Data Can Be Mined?
•Data mining can be applied to any kind of data
as long as the data are meaningful for a target
application.
• The most basic forms of data for mining
applications are database data , data
warehouse data ,and transactional data .

•Data mining can also be applied to other
forms of data .
•For example data streams, ordered/sequence
data, graph or networked data, spatial data,
text data, multimedia data, and the WWW .

Types of Datasets

Record Data

Text Data

Graph Data

Ordered Data

Database Data
•A database system, also called a database
management system (DBMS), consists of a
collection of interrelated data, known as a
database, and a set of software programs to
manage and access the data.

•A relational database is a collection of tables,
each of which is assigned a unique name.
• Each table consists of a set of attributes
(columns or ﬁelds) and usually stores large
set of tuples (records or rows).

Data Warehouses
•A data warehouse a repository of
information collected from multiple sources,
stored under a uni
ﬁed schema, and usually
residing at a single site.
•Data warehouses are constructed via a
process of data cleaning, data integration,
data transformation, data loading, and
periodic data refreshing.

•A data warehouse is usually modeled by a
multidimensional data structure, called a
•data cube, in which each dimension corresponds
to an attribute or a set of attributes in the
schema, and each cell stores the value of some
aggregate measure such as count or sum(sales
•amount).
•A data cube provides a multidimensional view of
data and allows the pre computation and fast
access of summarized data.

Transactional Data
•In general, each record in a transactional
database captures a transaction, such as a
customer’s purchase, a ﬂight booking, or a
user’s clicks on a web page.
•A transaction typically includes a unique
transaction identity number (transID) and a
list of the items
•making up the transaction, such as the items
purchased in the transaction.

•A transactional database may have additional
tables, which contain other information
related to the transactions, such as item
description, information about the
salesperson or the branch, and so on.

Data mining functionalities
•We have observed various types of data and
information repositories on which
•data mining can be performed.

Data mining functionalities.
•These include
•1. Characterization and discrimination ,
2. The mining of frequent patterns, associations,
and Correlations,
3 . Classi
ﬁcation and regression ,
4. Clustering analysis and outlier analysis.
Data mining functionalities are used to specify
the kinds of patterns to be found in data mining
tasks.

Characterization
•Data characterization is a summarization of
the general characteristics or features of a
target class of data.
•Data entries can be associated with classes or
concepts.
•For example, in the Electronic store
•classes of items for sale include computers
and printers, and concepts of customers
include bigSpenders and budgetSpenders.

Characterization
•It can be useful to describe individual classes
•and concepts in summarized, concise, and yet
precise terms.
•Such descriptions of a class or a concept are
called class/concept descriptions.
•These descriptions can be derived using data
characterization, by summarizing the data of
the class under study(often called target class)

Example
•A customer relationship manager at
AllElectronics may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year
at AllElectronics.
•The result is a general proﬁle of these
customers, such as that they are 40 to 50
years old, employed, and have excellent credit
ratings.

Characterization
•The output of data characterization can be
presented in various forms.
•Examples include pie charts, bar charts,
curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.

Data discrimination
•Data discrimination is a comparison of the
general features of the target class data
•objects against the general features of objects
from one or multiple contrasting classes.
•The target and contrasting classes can be
speci
ﬁed by a user, and the corresponding
•data objects can be retrieved through
database queries.

Example,
•A user may want to compare the general
features of software products with sales that
increased by 10% last year against those with
sales that decreased by at least 30% during
the same period.

Mining Frequent Patterns,
Associations, and Correlations
•Frequent patterns, as the name suggests, are
patterns that occur frequently in data.
•There are many kinds of
•frequent patterns,
•including frequent itemsets,
•frequent subsequences (also known as sequential
patterns),
•and frequent substructures.
•

•A frequent itemset typically refers to a set of
items that often appear together in a
transactional data set.
•for example, milk and bread, which are
frequently bought together in grocery stores
by many customers.

•A frequently occurring subsequence, such as
the pattern that customers, tend to purchase
ﬁ
rst a laptop, followed by a digital camera,
and then a memory card, is a (frequent)
sequential pattern.

• A substructure can refer to different
structural forms (e.g., graphs, trees, or lattices)
that may be combined with itemsets or
subsequences.
• If a substructure occurs frequently, it is called
a (frequent) structured pattern.
•Mining frequent patterns leads to the
discovery of interesting associations and
correlations within data.

Classiﬁcation and Regression for
Predictive Analysis
•Classiﬁcation is the process of ﬁnding a
model (or function) that describes and
distinguishes data classes or concepts.
•The model are derived based on the analysis
of a set of training data (i.e., data objects for
which the class labels are known).
•The model is used to predict the class label of
objects for which the class label is unknown.

•The derived model may be represented in
various forms, such as classiﬁcation rules (i.e.,
IF-THEN rules), decision trees, mathematical
•formulae, or neural networks

Types of ML Classi
fication Algorithms:
•Classification Algorithms can be further divided
into the following types:
•Logistic Regression
•K-Nearest Neighbours
•Support Vector Machines
•Kernel SVM
•Naïve Bayes
•Decision Tree Classification
•Random Forest Classification

Example of Decision tree

•Classi
ﬁcation predicts categorical (discrete,
unordered) labels.
•Regression models predicts continuous-
valued functions.

Regression:
•Regression is a process of finding the
correlations between dependent and
independent variables.
•It helps in predicting the continuous variables
such as prediction of

Market Trends,
prediction of House prices, etc.

Regression
•The task of the Regression algorithm is to find
the mapping function to map the input
variable(x) to the continuous output
variable(y).

Types of Regression Algorithm:
•Simple Linear Regression
•Multiple Linear Regression
•Polynomial Regression
•Support Vector Regression
•Decision Tree Regression
•Random Forest Regression

•To predict the amount of revenue that each
item will generate during an upcoming sale at
electronic shop based on the previous sales
data.

Cluster Analysis
•clustering analyzes data objects without
consulting class labels.
•In many cases, class labeled data may simply not
exist at the beginning.
•Clustering can be used to generate class labels
for a group of data.
•The objects are clustered or grouped based on
the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.

Outlier Analysis
•A data set may contain objects that do not comply
with the general behavior or model of the data.
•These data objects are outliers.
•Many data mining methods discard outliers as noise
or exceptions.
•However, in some applications (e.g., fraud detection)
the rare events can be more interesting than the more
regularly occurring ones.
•The analysis of outlier data is referred to as outlier
analysis or anomaly mining.

Getting to Know Your Data
•Data Objects-
•Data sets are made up of data objects.
• A data object represents an entity.
•For example In a sales database, the objects may
be customers, store items, and sales;
•In a medical database, the objects may be
patients;
•In a university database, the objects may be
students, professors,
•and courses.

What is Data?

Data Objects-
•Data objects are typically described by attributes.
•Data objects can also be referred to as samples,
examples, instances, data points, or objects.
• If the data objects are stored in a database, they
are data tuples.
•That is, the rows of a database correspond to
the data objects, and the columns correspond to
the attributes.

What Is an Attribute?
•An attribute is a data ﬁeld, representing a
characteristic or feature of a data object.
•The nouns attribute, dimension, feature, and variable
are often used interchangeably in the literature.
•The term dimension is commonly used in data
warehousing.
•Machine learning literature tends to use the term
feature, while statisticians prefer the term variable.
•Data mining and database professionals commonly use
the term attribute.

•Each row as a vector, whose components are
this individual attribute values.
•These vectors are also sometimes known as
the object vector or the feature vector.
•What is the dimension of the vector?
•The dimension of the vector is determined by
the number of attributes in the table.

Types of Attribute
•Nominal attribute-Nominal means “relating to
names.”
•The values of a nominal attribute are symbols
or names of things.
•Each value represents some kind of category,
code, or state, and so nominal attributes are
also referred to as categorical.
•The values do not have any meaningful order.

•For example hair color are black,brown, blond,
red, auburn, gray, and white.
• The attribute marital status can take on
•the values single, married, divorced, and
widowed.
•Another example of a nominal attribute is
occupation, with the values teacher, dentist,
programmer, farmer, and so on.

Binary Attributes
•A binary attribute is a nominal attribute with
only two categories or states: 0 or 1, where
•0 typically means that the attribute is absent,
and 1 means that it is present.
•Binary attributes are referred to as Boolean if
the two states correspond to true and false.

Example
•The attribute medical test is binary, where a
value of 1 means the result of the test for the
patient is positive, while 0 means the result is
negative.

Binary attributes.
•Given the attribute smoker describing a
patient object, 1 indicates that the patient
smokes, while 0 indicates that the patient
does not.

•A binary attribute is symmetric if both of its
states are equally valuable and carry
•the same weight; that is, there is no
preference on which outcome should be
coded as 0 or 1.

asymmetric
•A binary attribute is asymmetric if the
outcomes of the states are not equally
important.
•such as the positive and negative outcomes of
a medical test for HIV.
•By convention,we code the most important
outcome, which is usually the rarest one, by 1
(e.g., HIVpositive) and the other by 0 (e.g., HIV
negative).

Ordinal Attributes
•An ordinal attribute is an attribute with
possible values that have a meaningful order
or ranking among them, but the magnitude
between successive values is not known.
•This nominal attribute has three possible
values: small, medium,and large.

•we cannot tell from the values how much
bigger, say, a medium is than a large.
•Professional rank-Professional ranks can be
enumerated in a sequential order:
•for example, assistant, associate, and full for
professors.

Numeric Attributes
•A numeric attribute is quantitative; that is, it
is a measurable quantity, represented in
•integer or real values.
• Numeric attributes can be interval-scaled or
ratio-scaled.

Interval-Scaled Attributes
•Interval-scaled attributes are measured on a
scale of equal-size units.
•The values of interval-scaled attributes have
order and can be positive, 0, or negative.
•Example- A temperature attribute is interval-
scaled.
•Suppose that we have the outdoor temperature
value for a number of different days.
• By ordering the values, we obtain a ranking of
the objects with respect to temperature.

•Example- Calendar dates are another
example. For instance, the years 2002 and
2010 are eight years apart.

Ratio-Scaled Attributes
•A ratio-scaled attribute is a numeric attribute
with an inherent zero-point.
•That is, if a measurement is ratio-scaled, we
can speak of a value as being a multiple (or
ratio)of another value.
• In addition, the values are ordered, and we
can also compute the difference between
values, as well as the mean, median, and
mode.

Examples of ratio-scaled attributes
•such as years of experience(e.g., the objects
are employees)
•and number of words (e.g., the objects are
documents).
•Additional examples include attributes to
measure weight, height, latitude and
longitude.

•Interval scales hold no true zero and can
represent values below zero.
• For example, you can measure temperatures
below 0 degrees Celsius, such as -10 degrees.
Ratio variables, on the other hand, never fall
below zero. Height and weight measure from
0 and above, but never fall below it.

Discrete versus Continuous Attributes
•A discrete attribute has a ﬁnite or countably
inﬁnite set of values, which may or may not
be represented as integers.
•The attributes hair color, smoker, medical test,
and drink size each have a
ﬁnite number of
values, and so are discrete.

Example
•Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes
•or, the values 0 to 110 for the attribute age.
•An attribute is countably inﬁnite if the set of
possible values is inﬁnite but the values can be
put in a one-to-one correspondence with natural
numbers.
•For example, the attribute customer ID is
countably inﬁnite.
•Zip codes are another example.

continuous
•If an attribute is not discrete, it is continuous.
•Continuous values are real numbers, whereas
numeric values can be either integers or real
numbers.
•Continuous attributes are typically
represented as ﬂoating-point variables.

Example
•A feature F1 can take certain values: A, B, C, D, E,
F, and represents the grade of students
•from a college. Which of the following statement
is true in the following case?
•a. Feature F1 is an example of a nominal variable.
•b. Feature F1 is an example of an ordinal variable.
•c. It doesn’t belong to any of the above
categories.
•d. Both of these

Measures of Central Tendency
•The central value or the most occurring value
that gives a general idea of the whole data set
is called the Measure of Central Tendency.
•Some of the most commonly used measures
of central tendency are:
•Mean
•Median
•Mode

Example
•Suppose that we have some attribute X, like
salary, which has been recorded for a set of
objects.
•Let x1,x2,x3,….,xn be the set of N observed
values or observations for X.
•If we were to plot the observations for salary,
where would most of the values fall?

Mean
•The most common and effective numeric
measure of the “center” of a set of data is
•the (arithmetic) mean. Let x1,x2,x3,….,xn be
the set of N observed values or observations
for X. The mean of this set of values is

Arithmetic mean

Mean

Weighted Arithmetic Mean

Example

•A major problem with the mean
•is its sensitivity to extreme (e.g., outlier)
values.
•Even a small number of extreme values
•can corrupt the mean.

Trimmed mean
•Trimmed mean-which is the mean obtained
after chopping off values at the high and low
extremes.
•For example, we can sort the values observed
for salary and remove the top and bottom 2%
•before computing the mean. We should avoid
trimming too large a portion (such as
•20%) at both ends, as this can result in the loss
of valuable information.

•Let's say, as an example, a figure skating competition
produces the following scores: 6.0, 8.1, 8.3, 9.1, and
9.9.
•The mean for the scores would equal:
•((6.0 + 8.1 + 8.3 + 9.1 + 9.9) / 5) = 8.28
•To trim the mean by a total of 40%, we remove the
lowest 20% and the highest 20% of values, eliminating
the scores of 6.0 and 9.9.
•Next, we calculate the mean based on the calculation:
•(8.1 + 8.3 + 9.1) / 3 = 8.50

Median.
•Median is the middle value in a set of ordered
data values. It is the value that separates the
higher half of a data set from the lower half.

•Suppose that a given data set of N values for an
attribute X is sorted in increasing order.
• If N is odd, then the median is the middle value
of the ordered set.
•If N is even, then the median is not unique; it is
the two
•middlemost values and any value in between.
•If X is a numeric attribute in this case, by
convention, the median is taken as the average
of the two middlemost values.

•Suppose we have the following values for salary (in
thousands of dollars), shown
•in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
•So median= (52+56)/2=54
•Suppose that we had only the
ﬁrst
11 values in the list.
•Given an odd number of values, the median is the
middlemost value.
•This is the sixth value in this list, which has
a value of $52,000.

•How to calculate median for an even number of
values?
•Example:
•9, 8, 5, 6, 3, 4
•Arrange values in order
•3, 4,
5, 6, 8, 9
•Add 2 middle values and calculate their mean.
•Median
= 5+6/2
•Median
= 5.5

•The median is expensive to compute when we
have a large number of observations.

Median for range of values

•Note: The

median class
is the class that
contains the value located at N/2.

Median for range of values

•L: Lower limit of median class:

11
•W: Width of median class:

9
•N: Total Frequency:

60
•C: Cumulative frequency up to median class:

8
•F: Frequency of median class:

25

•Median = L + W[(N/2 – C) / F]
•Median = 11 + 9[(60/2 – 8) / 25]
•Median =

18.92
•We estimate that the median exam score
is

18.92.

Mode
•The mode is another measure of central
tendency.
•The mode for a set of data is the value that
occurs most frequently in the set.
•For Example,
•In {6, 9, 3, 6, 6, 5, 2, 3}, the Mode is 6 as it
occurs most often.

Types of Mode
•The different types of Mode are Unimodal,
Bimodal, Trimodal, and Multimodal. Let us
understand each of these Modes.
•Unimodal Mode - A set of data with one Mode is
known as a Unimodal Mode.
•For example, the Mode of data set A = { 14, 15,
16, 17, 15, 18, 15, 19} is 15 as there is only one
value repeating itself. Hence, it is a Unimodal
data set.

•Bimodal Mode - A set of data with two Modes
is known as a Bimodal Mode. This means that
there are two data values that are having the
highest frequencies.
•For example, the Mode of data set A =
{ 8,13,13,14,15,17,17,19} is 13 and 17 because
both 13 and 17 are repeating twice in the
given set. Hence, it is a Bimodal data set.

•Trimodal Mode - A set of data with three Modes
is known as a Trimodal Mode. This means that
there are three data values that are having the
highest frequencies.
•For example, the Mode of data set A = {2, 2, 2, 3,
4, 4, 5, 6, 5,4, 7, 5, 8} is 2, 4, and 5 because all the
three values are repeating thrice in the given set.
•Hence, it is a Trimodal data set.
•

•Multimodal Mode - A set of data with four or
more than four Modes is known as a
Multimodal Mode.
•For example, The Mode of data set
• A = {100, 80, 80, 95, 95, 100, 90, 90,100 ,95 }
is 80, 90, 95, and 100 because both all the
four values are repeated twice in the given
set. Hence, it is a Multimodal data set.

•Data in most real applications are not
symmetric.
•They may instead be either positively skewed,
where the mode occurs at a value that is
smaller than the median or negatively
skewed , where the mode occurs at a value
greater than the median .

What is
data skewness?
•When most of the values are skewed to the left
or right side from the median, then the data is
called skewed.
•Data can be in any of the following shapes;
•Symmetric:
Mean, median and mode are at the
same point.
•Positively skewed:
When most of the values are
to the left from the median.
•Negatively skewed:
When most of the values are
to the right from the median.

•For symmetric distribution=
mean=median=mode
•For skewed distribution
•For positively skewed=mean>median>mode
•For negatively skewed =Mode>median>mean
•Mean-mode=3(mean-median)
•Mode=3median-2mean

•The midrange can also be used to assess the
central tendency of a numeric data set.
•It is the average of the largest and smallest
values in the set.
•Midrange. The midrange of the data of is
•(30,000+110,000)/2 = $70,000.

Measuring the Dispersion of Data:
•Range, Quartiles, Variance,Standard
Deviation, and Interquartile Range.
•Range- let x1,x2,….,xn be a set of
observations for some numeric attribute, X.
The range the set is the difference between
the largest (max()) and smallest (min()

What is quartile?
•Suppose that the data for attribute X are
sorted in increasing numeric order.
• we can pick certain data points so as to split
the data distribution into equal-size
consecutive sets.
•These data points are called quantiles.
•Quantiles are points taken at regular intervals
of a data distribution, dividing it into
essentially equal size consecutive sets.

•The 2-quantile is the data point dividing the lower and
upper halves of the data distribution. It corresponds to
the median.
•The 4-quantiles are the three data points that
•split the data distribution into four equal parts; each
part represents one-fourth of the data distribution.
•They are more commonly referred to as quartiles.
•The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100
•equal-sized consecutive sets.

•The ﬁrst quartile, denoted by Q1, is the 25th
percentile. It cuts off the lowest 25% of the
data.
•The third quartile, denoted by Q3, is the 75th
percentile—it cuts off the lowest 75% (or
•highest 25%) of the data.
•The second quartile is the 50th percentile. As
the median, it gives the center of the data
distribution.

•The distance between the
ﬁrst and third
quartiles is a simple measure of spread
•that gives the range covered by the middle
half of the data.
•This distance is called the interquartile range
(IQR) and is deﬁned as
•IQR= Q3-Q1

How to
find quartiles of odd length
data set?
•Data = 8, 5, 2, 4, 8, 9, 5
•Step 1:
•First of all, arrange the values in order.
•Data = 2, 4, 5, 5, 8, 8, 9

•Step 2:
•For dividing this data into four equal parts, we
needed three quartiles.
•Q1:
Lower quartile
•Q2:
Median of the data set
•Q3:
Upper quartile

•Step 3:
•Find the median of the data set and label it
as

Q2.
•Data =
2, 4,
5, 5,

8, 8,
9
•Q1:

4 – Lower quartile
•Q2:

5 – Middle quartile
•Q3:

8 – Upper quartile
•Inter Quartile Range=

Q3 – Q1
•

=
8 – 4
•

= 4

How to
find quartiles of even length
data set?
•Data = 8, 5, 2, 4, 8, 9, 5,7
•Step 1:
•First of all, arrange the values in order
•After ordering the values:
•Data = 2, 4, 5, 5, 7, 8, 8, 9

•Step 2:
•For dividing this data into four equal parts, we
needed three quartiles.
•Q1:
Lower quartile
•Q2:
Median of the data set
•Q3:
Upper quartile
•Step 3:
•Find the median of the data set and label it as

Q2.

•Data = 2, 4, 5, 5, 7, 8, 8, 9
•Minimum:
2
•Q1:

4
+ 5 / 2 = 4.5 Lower quartile
•Q2:

5+ 7 / 2 = 6
Middle quartile
•Q3:

8 +
8 / 2 = 8 Upper quartile
•Maximum:
9
•Inter Quartile Range=

Q3 – Q1
•= 8 –
4.5
•= 3.5

Five-Number Summary, Boxplots, and
Outliers
•The ﬁve-number summary of a distribution
consists of the median (Q2)), the quartiles
Q1and Q3, and the smallest and largest
individual observations.
•That is written in the order of Minimum,
Q1,Median, Q3, Maximum.

•How to Find a Five-Number Summary: Steps
•Step 1:

Put your numbers in ascending order
(from smallest
to largest). For this particular data set, the order is:
Example: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
•Step 2:

Find the minimum and maximum
for your data set.
Now that your numbers are in order, this should be easy to
spot.
In the example in step 1, the minimum (the smallest
number) is 1 and the maximum (the largest number) is 27.
•Step 3:

Find the

median. The median is the middle number.
If you aren’t sure how to find the median, see:

How to find the mean mode and median.

•Step 4:

Place parentheses around the numbers

above and
below
the median.
(This is not

technically
necessary, but it makes Q1 and Q3
easier to find).
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
•Step 5:

Find Q1 and Q3. Q1 can be thought of as a median
in the lower half of the data, and Q3 can be thought of as a
median for the upper half of data.
(1, 2,

5, 6, 7), 9, ( 12, 15,18,19,27).
•Step 6:

Write down your summary found in the above steps.
minimum = 1, Q1 = 5, median = 9, Q3 = 18, and maximum =
27

Box plot
•Boxplots are a popular way of visualizing a
distribution.
•A boxplot incorporates the ﬁ
ve-number summary as
follows:
•Typically, the ends of the box are at the quartiles so
that the box length is the interquartile range.
•The median is marked by a line within the box.
•Two lines (called whiskers) outside the box extend to
the smallest (Minimum) and largest (Maximum)
observations.

•When the median is in the middle of the box, and
the whiskers are about the same on both sides of
the box, then the distribution is symmetric.
•When the median is closer to the bottom of the
box, and if the whisker is shorter on the lower
end of the box, then the distribution is positively
skewed (skewed right).
•When the median is closer to the top of the box,
and if the whisker is shorter on the upper end of
the box, then the distribution is negatively
skewed (skewed left).

•Boxplots can be computed in O(nlogn) time.
•An outlier is

an observation that lies an
abnormal distance from other values in a
random sample from a
population.

•Suppose that the data for analysis includes the
attribute age. The age values for the data
•tuples are (in increasing order) 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30,
•33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
•(a) What is the mean of the data? What is the median?
•(b) What is the mode of the data? Comment on the
data’s modality (i.e., bimodal,
•trimodal, etc.).

•(c) What is the midrange of the data?
•(d) Can you
ﬁnd (roughly) the ﬁrst quartile
(Q1) and the third quartile (Q3) of the data?
•(e) Give the ﬁve-number summary of the data.
•(f) Show a boxplot of the data.

Example
•11,22,20,14,29,8,35,27,13,49,10,24,17
•After sorting
•8,10,11,13,14,17,20,22,24,27,29,35,49
•Q1=13+11/2=12 min=8 max=49
•Q2=20
•Q3=28
•IQR=28-12=16 min =8 max=49
•Max outlier= q3+1.5*IQR=28+1.5*16=52
•Min outlier=q1-1.5*IQR=12-1.5*16=-12

•18,34,76,29,15,41,46,25,54,38,20,32,43,22
•(15,18,20,22,25,29,32),(34, 38,41,43,46,54,76)
•Q2=(32+34)=33
•Q1=22 Q3=43
•IQR=43-22=21
•[Q1-1.5*IQR]=[-9.5]
•[q3+1.5*IQR]=74.5

Variance and Standard Deviation
•Dispersion refers to the ‘distribution’ of
objects over a large region.
•Variance and standard deviation are measures
of data dispersion.
•They indicate how spread out a data
distribution is.
•Variance measures how far each number in
the dataset from the mean.

Standard deviation
•Standard deviation is a squared root of the
variance to get original values.
•A low standard deviation means that the data
observations tend to be very close to the
mean.
•A high standard deviation indicates that
the data are spread out over a large range of
values.

Graphic Displays of Basic Statistical
Descriptions of Data
•Graphic displays of basic statistical
descriptions.
•These include quantile plots, quantile–quantile
plots, histograms, and scatter plots.
•Such graphs are helpful for the visual
inspection of data, which is useful for data
preprocessing.

Find median

Proximity measure
•Whether two objects are unlike or like?
•Application
•Clustering
•Outlier analysis
•Nearest neighbour

Types of attribute
•Nominal attributes
•Ordinal attributes
•Binary attributes
•Numerical attributes
•Mixed attributes

Measuring Data Similarity and
Dissimilarity
•Similarity and dissimilarity measures, which are
referred to as measures of proximity.
•Similarity and dissimilarity are related.
•A similarity measure for two objects, i and j, will
typically return the value 0 if the objects are
unalike.
•The higher the similarity value, the greater the
similarity between objects. (Typically, a value of 1
•indicates complete similarity, that is, the objects
are identical.)

•A dissimilarity measure works the opposite
way.
•It returns a value of 0 if the objects are the
same (and therefore, far from being
dissimilar).
•The higher the dissimilarity value, the more
dissimilar the two objects are.

Data Matrix versus Dissimilarity
Matrix
•Data matrix -Data matrix (or object-by-
attribute structure):
•This structure stores the n data objects in the
form of a relational table, or n-by-p matrix (n
objects p attributes):

•Suppose that we have n objects (e.g., persons,
items, or courses) described by p attributes (also
called measurements or features, such as age,
height, weight, or gender).
•The objects are x1=(x11,x12,x13,…x1p),
•x2=(x21,x22,….x2p) and so on,
•where xij is the value for object xi of the jth
attribute.
•Each row corresponds to an object

Dissimilarity matrix (or object-by-
object structure):
This structure stores a collection of proximities
that are available for all pairs of n objects.
•It is often represented by an n-by-n table:
•where d(i, j) is the measured dissimilarity or
“difference” between objects i and j.
•In general, d(i, j) is a non-negative number
that is close to 0 when objects i and j are
highly similar or “near” each other
•and becomes larger the more they differ.

•Note
• d(i, i)= 0 that is, the difference between an
object and itself is 0.
•-- d(i, j)= d (j, i).

•Measures of similarity can often be expressed
as a function of measures of dissimilarity.
•For example, for nominal data,
•Sim(i, j)= 1 –d(i, j)
•where sim(i, j)is the similarity between objects
i and j.

Proximity Measures for Nominal
Attributes
•For example,map color is a nominal attribute
that may have, say, ﬁve states: red, yellow,
green, pink, and blue.
•Let the number of states of a nominal
attribute be M.
•The states can be denoted by letters, symbols,
or a set of integers, such as 1, 2, … , M.

•How is dissimilarity computed between objects
described by nominal attributes?
•The dissimilarity between two objects i and j
can be computed based on the ratio of
mismatches:

•where m is the number of matches (i.e., the
number of attributes for which i and j are in
•the same state)
•and p is the total number of attributes
describing the objects.

•Weights can be assigned to increase the effect
of m or to assign greater weight to the
matches in attributes having a larger number
of states.

•P=1 – it is no of nominal attribute.
•m –no of matches.
•--d(2,1)= (p-m)/p=(1-0)/1=1
•D(3,1)=(1-0)/1=1
•D(4,1)=(1-1)/1=0
•D(i, j)evaluates to 0 if objects i and j match,
and 1 if the objects differ.

Dissimilarity matrix

•A dissimilarity measure works the opposite
way. It returns a value of 0 if the objects are
the same (and therefore,
•far from being dissimilar). The higher the
dissimilarity value, the more dissimilar the
•two objects are.

Similarity matrix

•A similarity measure for two objects, i and j,
will typically return the value 0 if the objects
are unalike.
•The higher the similarity value, the greater the
similarity between objects. (Typically, a value
of 1 indicates complete similarity, that is, the
objects are identical.)

Example

Ordinal attribute
•There are three states for test-2: fair, good, and
•excellent, that is, Mf = 3.
•For step 1, if we replace each value for test-2 by
its rank, the four objects are assigned the ranks 3,
1, 2, and 3, respectively.
•Step 2 normalizes the ranking by mapping rank 1
to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
•For step 3, we can use, say, the Euclidean
distance.

•1. find the no of of ordinal attributes
•(Mf=3) and rank it.
2. Normalize the rank
3. Find the distance between the objects.

•Zif=Rf-1/mf-1
•Fair(1)=1-1/3-1=0
•Good(2)= 2-1/2=0.5
•Excellent(3)/3-1/3-1=1
•Manhantan distance=|x1-y1|+|x2-y2|

Binary attribute
•It can be either symmetric or asymmetric
•A binary attribute is symmetric if both of its
states are equally valuable and carry the same
weight; that is, there is no preference on which
outcome should be coded as 0 or 1. (Example
male and female)
•A binary attribute is asymmetric if the outcomes
of the states are not equally important,
•such as the positive and negative outcomes of a
medical test for HIV

Binary attribute
•We can find similarity matrix for symmetric
binary for asymmetric binary and dissimilarity
symmetric binary for asymmetric binary .

• q-m11
• r-m10
• s-m01
•t-m00

•If all binary attributes are thought of as having the same
weight, we have the 2 *2 contingency table
•where q is the number of attributes that equal 1 for both
objects i and j,
•r is the number of attributes that equal 1 for object i but
equal 0 for object j,
•s is the number of attributes that equal 0 for object i but
equal 1 for object j,
•and t is the number of attributes that equal 0 for both
objects i and j.
•The total number of attributes is p, where p = q +r + s +t .

Similarity
•For symmetric we use simple matching
coefficient.
•Simple matching coefficient(SCM)=
•sim(I,j)=(m11+m00)/(m11+m10+m01+m00)

Similarity
•for asymmetric binary we can use jaccord
coefficient.
•Sim(i,j)=q/(q+r+s)=m11/(m11+m01+m10)

Dissimilarity
•For dissimilarity matrix for symmetric binary
and asymmetric binary.

symmetric binary

dissimilarity matrix for asymmetric
binary

asymmetric binary

Example

For dissimilarity symmetric binary

Dissimilarity( Symmetric binary)
•D(jack,jim)=(0+1)/(2+0+1+3)=1/6
•D(mary,jack)=(1+1)/(1+2+1+2)=2/6
•D(mary,jim)=(2+1)/1+2+1+2=3/6

Dissimilarity( Symmetric binary)

Binary (similarity)
•1. symmetric
•Symmetric Binary ( simple matching
coeffcient)
•sim(I,j)=(m11+m00)/(m11+m10+m01+m00)

•X=1,0,0,0,0,0,0,0,0,0
•Y=0,0,0,0,0,0,1,0,0,1
•Scm(x,y)=7/0+1+1+2+7=7/10=.7
•The similarity between x and y is 70%.
•

Similarity for asymmetric binary
Jaccard coeffcient

•Sim(i,j)=q/(q+r+s)=m11/(m11+m01+m10)
•Asymmetric binary=0/2=0

•Super market contains 1000 product
•C1={sugar, coffee , tea, rice,egg}
•C2={sugar, coffee, bread, biscuit}
•How much similarity there between customer1 and
customer2.
•M11=2 ( item present in both the custemer)
•(sugar,coffee)=2
•M10=3 (item present is customer 1 but not in
customer 2) {Tea,rice,egg}=3
•M01=2(item present in c2 but in c1) {bread,biscut}=2

•M00=item present in not in c1 or c2
•=Total item-(m11+m10+m01)
•=1000-7=993
•Jaccard coefﬁcient
=(m11)/(m11+m10+m01)=(2/2+3+2)=2/7
•Scm=(m11+m00)/
(m11+m00+m01+m00)=(2+993)/
(2+3+2+993)=.995

•SCM=(2+993)/2+3+2+993=0.995

Mixed attribute

For test1

For test-2

Numerical data
•Normalize data- frame data between 0 and 1.
•->d(I,j)=|xif-xjf|/max-min
•D(2,1)=|45-22|/64-22=23/42=0.55
•D(3,1)=|45-64|/64-22=0.45
•D(4,1)=|22-64|/64-22=42/42=1
•D(4,2)=|28-22|/64-22=.14
•D(4,3)=|64-28|=64-22=.86

Final dissimilar matrix

How to put all together
( ) ( )
1
( )
1
(, )
n
f f
ij ij
f
n
f
ij
f
d
d i j








( )
( )
0 0 sin
1
f
ij if jf if jf
f
ij
if x x or x x ismis g
otherwise


   


•D(2,1)=1*1+1*1+0.55*1/1+1+1=2.55/3=0.85
•D(3,1)=(1*1)+(1*0.5)+(1*.45)/3=0.65

Final matrix

Cosine Similarity
•A document can be represented by thousands
of attributes, each recording the frequency
•of a particular word (such as a keyword) or
phrase in the document.
•Thus, each document is an object represented
by what is called a term-frequency vector.

•Term-frequency vectors are typically very long
and sparse (i.e., they have many 0 values).
•Applications using such structures include
information retrieval, text document
•clustering, biological taxonomy, and gene
feature mapping.

•The traditional distance measures that we
have studied in this chapter do not work well
for such sparse numeric data.

Cosine similarity
•Cosine similarity is a measure of similarity
that can be used to compare documents
or, say, give a ranking of documents with
respect to a given vector of query
•words.

•Let x and y be two vectors for comparison.
Using the cosine measure as a similarity
function, we have

•The measure computes the cosine of the
angle between vectors x and y. A cosine value
of 0 means that the two vectors are at 90
degrees to each other (orthogonal) and have
no match.
•The closer the cosine value to 1, the smaller
the angle and the greater the match between
vectors.

Cosine similarity and distance
•Suppose we have two points p1 and p2.
•If distance between p1 and p2 increases the
similarity decreases.
•If distance between p1 and p2 decreases the
similarity increases.
•1- cosine similarity = cosine distance.

•The cosine similarity says to find the similarity
between two objects. we have to find the
angle between them.
•Cosine similarity= Cos (theta).
•The theta is the angle between object p1 and
p2.
•The cosine similarity ranging between -1 to 1.

•The angle is more between then they have
less similarity.
•The angle is less between then they have
more similarity.
•Cos 0=1 it means more similarity.
•Cos 90=0 it means less similarity.
•The angle between two objects p1 and p2= 45
degree.

•Cosine similarity= cos 45= .5
•50% similarity between p1 and p2.

•Cos 90=0
•The similarity is less.

The angle is 0 (more similar)

Example
•Cosine similarity between two term-
frequency vectors. Suppose that x and y are
the ﬁ
rst two term-frequency vectors in Table
2.5. That is, x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and
•Y =(3, 0, 2, 0, 1, 1, 0, 1, 0, 1). How similar are
x and y? To compute the cosine similarity
between the two vectors,

Euclidian distance (L2norm)

Euclidian distance

Manhattan distance(l1 norm)

Minkowski distance

Supremum distance

•Given two objects represented by the tuples (22,
1, 42, 10) and (20, 0, 36, 8):
•(a) Compute the Euclidean distance between the
two objects.
•(b) Compute the Manhattan distance between the
two objects.
•(c) Compute the Minkowski distance between the
two objects, using q D 3.
•.

Assignment
•(d) Compute the supremum distance between
the two objects

Data Preprocessing
•Today’s real-world databases are highly
susceptible to noisy, missing, and
inconsistent data due to their typically huge
size (often several gigabytes or more) and
their likely origin from multiple, heterogenous
sources.
• Low-quality data will lead to low-quality
mining results.

Data Preprocessing
•There are several data preprocessing
techniques.
•Data cleaning can be applied to remove noise
and correct inconsistencies in data.
• Data integration merges data from
•multiple sources into a coherent data store
such as a data warehouse.

•Data reduction can reduce data size by, for
instance, aggregating, eliminating redundant
features, or clustering.
•Data transformations (e.g., normalization) may
be applied, where data are scaled to fall within a
smaller range like 0.0 to 1.0.
•This can improve the accuracy and efficiency of
mining algorithms involving distance
measurements.

• The three of the elements defining data
quality: accuracy, completeness, and
consistency.
•Inaccurate, incomplete, and inconsistent
data are commonplace properties of large
real-world databases and data warehouses.

Factors affecting data quality
•Timeliness also affects data quality
• The month-end data are not updated in a
timely fashion has a negative impact on the
data quality.

•Two other factors affecting data quality are
believability and interpretability.
•Believability re
flects how much the data are
trusted by users,
•Interpretability reflects how easy the data are
understood.

Example
•Suppose that a database, at one point, had several
errors, all of which have since been corrected.
• The past errors, however, had caused
•many problems for sales department users, and so
they no longer trust the data. The
•data also use many accounting codes, which the
sales department does not know how to
•interpret.

Data Cleaning
•Real-world data tend to be incomplete, noisy,
and inconsistent.
•Data cleaning (or data cleansing) routines
attempt to
fill in missing values, smooth out
noise while identifying outliers, and correct
inconsistencies in the data.

Missing Values
•1. Ignore the tuple: This is usually done when
the class label is missing (assuming the
•mining task involves classification).
•This method is not very effective, unless the
tuple contains several attributes with missing
values.

2. Fill in the missing value manually: In general,
this approach is time consuming and
•may not be feasible given a large data set with
many missing values.
3. Use a global constant to
fill in the missing
value: Replace all missing attribute values
•by the same constant such as a label like
“Unknown” or infanite.

•4. Use a measure of central tendency for the
attribute (e.g., the mean or median) to
•fi
ll in the missing value:
•5. Use the attribute mean or median for all
samples belonging to the same class as
•the given tuple.

6. Use the most probable value to
fill in the
missing value: This may be determined
•with regression, inference-based tools using a
Bayesian formalism, or decision tree
induction.

•https://t4tutorials.com/what-are-quartiles-in-
data-mining/

Sanjeev Kumar Dash D ata Mining-2023.ppt

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Sanjeev Kumar Dash D ata Mining-2023.ppt

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 77

Slide 78

Slide 79

Slide 80

Slide 81

Slide 82