subhashchandra197
257 views
48 slides
May 09, 2024
Slide 1 of 48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
About This Presentation
.
Size: 2.45 MB
Language: en
Added: May 09, 2024
Slides: 48 pages
Slide Content
Data science
UNIT-I
Data objects and Attribute types
Similarity and Dissimilarity
1
Types of Data Sets
•Record
•Relational records
•Data matrix, e.g., numerical matrix, crosstabs
•Document data: text documents: term-
frequency vector
•Transaction data
•Graph and network
•World Wide Web
•Social or information networks
•Molecular Structures
•Ordered
•Video data: sequence of images
•Temporal data: time-series
•Sequential Data: transaction sequences
•Genetic sequence data
•Spatial, image and multimedia:
•Spatial data: maps
•Image data:
•Video data:Document 1
seasontimeout
lost
win
gamescore
ballpla
y
coachteam
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0 TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
2
Important characteristics of Structured data
•Dimensionality
•Curse of dimensionality
•Sparsity
•Only presence counts
•Resolution
•Patterns depend on the scale
•Distribution
•Centrality and dispersion
3
Data: 1. Structured
2. Unstructured
3. Semi structured
4. Quasi structured
4
Data Objects
•Data sets are made up of data objects.
•A data objectrepresents an entity.
•Examples:
•sales database: customers, store items, sales
•medical database: patients, treatments
•university database: students, professors, courses
•Also called samples , examples, instances, data points,
objects, tuples.
•Data objects are described by attributes.
•Database rows -> data objects; columns ->attributes.
5
Attributes
•Attribute (ordimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
•E.g., customer _ID, name, address
•Types:
•Nominal
•Binary
•Ordinal
•Continuous and Discontinues
•Numeric: quantitative
•Interval-scaled
•Ratio-scaled
6
Types of attributes
7
8
Nominal Attributes-related to names
9
10
11
Similarity and Dissimilarity
Proximity: refers to a similarity or dissimilarity
12
Similarity
Similarity and Dissimilarity Applications:
•Websearch
•Computer vision:
•ImageProcessing
•Natural language processing
•Clustering outliers
13
Data Matrix and Dissimilarity Matrix
15
Data Matrix
Proximity Measure for Nominal Attributes
16
Example: Nominal attribute
17
Example: Nominal attribute
18
Example: Finding Similarity matrix for Nominal attribute
19
Example : find dissimilarity of given data
20
Proximity Measure for Binary Attributes
21
Proximity Measure for Binary Attributes
•A contingency table for binary data
•Distance measure for symmetric binary
variables:
•Distance measure for asymmetric binary
variables:
•Jaccard coefficient (similaritymeasure
for asymmetric binary variables):
22
Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j
Proximity Measure for Binary Attributes: Example
23
Proximity Measure for Binary Attributes
24
•Example
•Gender is a symmetric attribute
•The remaining attributes are asymmetric binary
•Let the values Y and P be 1, and the value N 0
25NameGenderFeverCoughTest-1Test-2Test-3Test-4
JackM Y N P N N N
MaryF Y N P N P N
JimM Y P N N N N 75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(
maryjimd
jimjackd
maryjackd
Proximity Measure for Binary Attributes: Example: Find Dissimilarity
matrix
Proximity Measure for Numeric Attributes: Standardizing Numeric Data
•Z-score:
•X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
•the distance between the raw score and the population mean in units of
the standard deviation
•negative when the raw score is below the mean, “+” when above
•An alternative way: Calculate the mean absolute deviation
where
•standardized measure (z-score):
•Using mean absolute deviation is more robust than using standard deviation
x
z
26.)...
21
1
nffff
xx(x
n
m |)|...|||(|1
21 fnffffff
mxmxmx
n
s f
fif
if s
mx
z
Distance on Numeric Data: Minkowski Distance
•Minkowski distance: A popular distance measure
where i= (x
i1, x
i2, …, x
ip) andj= (x
j1, x
j2, …, x
jp) are two p-
dimensional data objects, and his the order (the distance
so defined is also called L-hnorm)
•Properties
•d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
•d(i, j) = d(j, i)(Symmetry)
•d(i, j) d(i, k) + d(k, j)(Triangle Inequality)
•A distance that satisfies these properties is a metric
28
Special Cases of Minkowski Distance
•h= 1: Manhattan(city block, L
1
norm)distance
•E.g., the Hamming distance: the number of bits that are different
between two binary vectors
•h = 2: (L
2norm) Euclideandistance
•h . “supremum”(L
max
norm, L
norm) distance.
•This is the maximum difference between any component
(attribute) of the vectors||...||||),(
2211 ppj
x
i
x
j
x
i
x
j
x
i
xjid
29)||...|||(|),(
22
22
2
11 ppj
x
i
x
j
x
i
x
j
x
i
xjid
Example: find Dissimilarity of given Numeric
data
34
Proximity measure for Ordinal attributes
•An ordinal variable can be discrete or continuous
•Order is important, e.g., rank
•Can be treated like interval-scaled
•replace x
ifby their rank
•map the range of each variable onto [0, 1] by replacingi-th object in
the f-th variable by
•compute the dissimilarity using methods for interval-scaled variables
351
1
f
if
if
M
r
z },...,1{
fif
Mr
Example: Ordinal Attributes
36
Example: Ordinal Attributes
37
Example: Find Dissimilarity matrix of given
ordinal attribute
38
Proximity measures of Mixed Attributes
•A database may contain all attribute types
•Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
•One may use a weighted formula to combine their effects
•fis binary or nominal:
d
ij
(f)
= 0 if x
if = x
jf, or d
ij
(f)
= 1 otherwise
•fis numeric: use the normalized distance
•fis ordinal
•Compute ranks r
ifand
•Treat z
ifas interval-scaled)(
1
)()(
1
),(
f
ij
p
f
f
ij
f
ij
p
f
d
jid
1
1
f
if
M
r
z
if
39
40
Example: Find Dissimilarity of Mixed Attributes
41
Example: Find Dissimilarity of Mixed Attributes
42
Example: Find Dissimilarity of Mixed Attributes
Example: Mixed Attributes
43
Example : Mixed Attributes
44
Example: Mixed Attributes
45
Example: Mixed Attributes
46
Cosine Similarity
•A documentcan be represented by thousands of attributes, each recording the
frequencyof a particular word (such as keywords) or phrase in the document.
•Other vector objects: gene features in micro-arrays, …
•Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
•Cosine measure: If d
1
and d
2
are two vectors (e.g., term-frequency vectors), then
cos(d
1
,d
2
)=(d
1
d
2
)/||d
1
||||d
2
||,
whereindicatesvectordotproduct,||d||:thelengthofvectord
47