Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt

subhashchandra197 257 views 48 slides May 09, 2024
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

.


Slide Content

Data science
UNIT-I
Data objects and Attribute types
Similarity and Dissimilarity
1

Types of Data Sets
•Record
•Relational records
•Data matrix, e.g., numerical matrix, crosstabs
•Document data: text documents: term-
frequency vector
•Transaction data
•Graph and network
•World Wide Web
•Social or information networks
•Molecular Structures
•Ordered
•Video data: sequence of images
•Temporal data: time-series
•Sequential Data: transaction sequences
•Genetic sequence data
•Spatial, image and multimedia:
•Spatial data: maps
•Image data:
•Video data:Document 1
seasontimeout
lost
win
gamescore
ballpla
y
coachteam
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0 TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
2

Important characteristics of Structured data
•Dimensionality
•Curse of dimensionality
•Sparsity
•Only presence counts
•Resolution
•Patterns depend on the scale
•Distribution
•Centrality and dispersion
3
Data: 1. Structured
2. Unstructured
3. Semi structured
4. Quasi structured

4

Data Objects
•Data sets are made up of data objects.
•A data objectrepresents an entity.
•Examples:
•sales database: customers, store items, sales
•medical database: patients, treatments
•university database: students, professors, courses
•Also called samples , examples, instances, data points,
objects, tuples.
•Data objects are described by attributes.
•Database rows -> data objects; columns ->attributes.
5

Attributes
•Attribute (ordimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
•E.g., customer _ID, name, address
•Types:
•Nominal
•Binary
•Ordinal
•Continuous and Discontinues
•Numeric: quantitative
•Interval-scaled
•Ratio-scaled
6

Types of attributes
7

8
Nominal Attributes-related to names

9

10

11

Similarity and Dissimilarity
Proximity: refers to a similarity or dissimilarity
12
Similarity

Similarity and Dissimilarity Applications:
•Websearch
•Computer vision:
•ImageProcessing
•Natural language processing
•Clustering outliers
13

Similarity / Proximity Measures
14
•Nominal attributes
•Binary attributes
•Ordinal attributes
•Numeric attributes 1. Scaled 2. Ratio
•Mixed attributes

Data Matrix and Dissimilarity Matrix
15
Data Matrix

Proximity Measure for Nominal Attributes
16

Example: Nominal attribute
17

Example: Nominal attribute
18

Example: Finding Similarity matrix for Nominal attribute
19

Example : find dissimilarity of given data
20

Proximity Measure for Binary Attributes
21

Proximity Measure for Binary Attributes
•A contingency table for binary data
•Distance measure for symmetric binary
variables:
•Distance measure for asymmetric binary
variables:
•Jaccard coefficient (similaritymeasure
for asymmetric binary variables):
22
Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j

Proximity Measure for Binary Attributes: Example
23

Proximity Measure for Binary Attributes
24

•Example
•Gender is a symmetric attribute
•The remaining attributes are asymmetric binary
•Let the values Y and P be 1, and the value N 0
25NameGenderFeverCoughTest-1Test-2Test-3Test-4
JackM Y N P N N N
MaryF Y N P N P N
JimM Y P N N N N 75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(












maryjimd
jimjackd
maryjackd
Proximity Measure for Binary Attributes: Example: Find Dissimilarity
matrix

Proximity Measure for Numeric Attributes: Standardizing Numeric Data
•Z-score:
•X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
•the distance between the raw score and the population mean in units of
the standard deviation
•negative when the raw score is below the mean, “+” when above
•An alternative way: Calculate the mean absolute deviation
where
•standardized measure (z-score):
•Using mean absolute deviation is more robust than using standard deviation 


x
z
26.)...
21
1
nffff
xx(x
n
m  |)|...|||(|1
21 fnffffff
mxmxmx
n
s  f
fif
if s
mx
z

Data Matrix and Dissimilarity Matrix
27pointattribute1attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix

Distance on Numeric Data: Minkowski Distance
•Minkowski distance: A popular distance measure
where i= (x
i1, x
i2, …, x
ip) andj= (x
j1, x
j2, …, x
jp) are two p-
dimensional data objects, and his the order (the distance
so defined is also called L-hnorm)
•Properties
•d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
•d(i, j) = d(j, i)(Symmetry)
•d(i, j) d(i, k) + d(k, j)(Triangle Inequality)
•A distance that satisfies these properties is a metric
28

Special Cases of Minkowski Distance
•h= 1: Manhattan(city block, L
1
norm)distance
•E.g., the Hamming distance: the number of bits that are different
between two binary vectors
•h = 2: (L
2norm) Euclideandistance
•h . “supremum”(L
max
norm, L

norm) distance.
•This is the maximum difference between any component
(attribute) of the vectors||...||||),(
2211 ppj
x
i
x
j
x
i
x
j
x
i
xjid 
29)||...|||(|),(
22
22
2
11 ppj
x
i
x
j
x
i
x
j
x
i
xjid 

Example: Minkowski Distance
30
Dissimilarity Matricespointattribute 1attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0 L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0 L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L
1)
Euclidean (L
2)
Supremum

Example : Numeric attribute: Find the distances
31

Example: Numeric Attribute
32

Example: Ratio-Scaled Attributes: Find Dissimilarity matrix
33

Example: find Dissimilarity of given Numeric
data
34

Proximity measure for Ordinal attributes
•An ordinal variable can be discrete or continuous
•Order is important, e.g., rank
•Can be treated like interval-scaled
•replace x
ifby their rank
•map the range of each variable onto [0, 1] by replacingi-th object in
the f-th variable by
•compute the dissimilarity using methods for interval-scaled variables
351
1



f
if
if
M
r
z },...,1{
fif
Mr

Example: Ordinal Attributes
36

Example: Ordinal Attributes
37

Example: Find Dissimilarity matrix of given
ordinal attribute
38

Proximity measures of Mixed Attributes
•A database may contain all attribute types
•Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
•One may use a weighted formula to combine their effects
•fis binary or nominal:
d
ij
(f)
= 0 if x
if = x
jf, or d
ij
(f)
= 1 otherwise
•fis numeric: use the normalized distance
•fis ordinal
•Compute ranks r
ifand
•Treat z
ifas interval-scaled)(
1
)()(
1
),(
f
ij
p
f
f
ij
f
ij
p
f
d
jid






 1
1



f
if
M
r
z
if
39

40
Example: Find Dissimilarity of Mixed Attributes

41
Example: Find Dissimilarity of Mixed Attributes

42
Example: Find Dissimilarity of Mixed Attributes

Example: Mixed Attributes
43

Example : Mixed Attributes
44

Example: Mixed Attributes
45

Example: Mixed Attributes
46

Cosine Similarity
•A documentcan be represented by thousands of attributes, each recording the
frequencyof a particular word (such as keywords) or phrase in the document.
•Other vector objects: gene features in micro-arrays, …
•Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
•Cosine measure: If d
1
and d
2
are two vectors (e.g., term-frequency vectors), then
cos(d
1
,d
2
)=(d
1
d
2
)/||d
1
||||d
2
||,
whereindicatesvectordotproduct,||d||:thelengthofvectord
47

Example: Find similarity of documents
•cos(d
1
, d
2
) = (d
1
d
2
) /||d
1
|| ||d
2
|| ,
whereindicatesvectordotproduct,||d|:thelengthofvectord
•Ex:Findthesimilaritybetweendocuments1and2.
d
1
=(5,0,3,0,2,0,0,2,0,0)
d
2
=(3,0,2,0,1,1,0,1,0,1)
d
1
d
2
=5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1=25
||d
1
||=(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)
0.5
=(42)
0.5
=6.481
||d
2
||=(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)
0.5
=(17)
0.5
=4.12
cos(d
1
,d
2
)=0.94
48
Tags