Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt

Data science
UNIT-I
Data objects and Attribute types
Similarity and Dissimilarity
1

Types of Data Sets
•Record
•Relational records
•Data matrix, e.g., numerical matrix, crosstabs
•Document data: text documents: term-
frequency vector
•Transaction data
•Graph and network
•World Wide Web
•Social or information networks
•Molecular Structures
•Ordered
•Video data: sequence of images
•Temporal data: time-series
•Sequential Data: transaction sequences
•Genetic sequence data
•Spatial, image and multimedia:
•Spatial data: maps
•Image data:
•Video data:Document 1
seasontimeout
lost
win
gamescore
ballpla
y
coachteam
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0 TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
2

Important characteristics of Structured data
•Dimensionality
•Curse of dimensionality
•Sparsity
•Only presence counts
•Resolution
•Patterns depend on the scale
•Distribution
•Centrality and dispersion
3
Data: 1. Structured
2. Unstructured
3. Semi structured
4. Quasi structured

4

Data Objects
•Data sets are made up of data objects.
•A data objectrepresents an entity.
•Examples:
•sales database: customers, store items, sales
•medical database: patients, treatments
•university database: students, professors, courses
•Also called samples , examples, instances, data points,
objects, tuples.
•Data objects are described by attributes.
•Database rows -> data objects; columns ->attributes.
5

Attributes
•Attribute (ordimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
•E.g., customer _ID, name, address
•Types:
•Nominal
•Binary
•Ordinal
•Continuous and Discontinues
•Numeric: quantitative
•Interval-scaled
•Ratio-scaled
6

Types of attributes
7

8
Nominal Attributes-related to names

9

10

11

Similarity and Dissimilarity
Proximity: refers to a similarity or dissimilarity
12
Similarity

Similarity and Dissimilarity Applications:
•Websearch
•Computer vision:
•ImageProcessing
•Natural language processing
•Clustering outliers
13

Similarity / Proximity Measures
14
•Nominal attributes
•Binary attributes
•Ordinal attributes
•Numeric attributes 1. Scaled 2. Ratio
•Mixed attributes

Data Matrix and Dissimilarity Matrix
15
Data Matrix

Proximity Measure for Nominal Attributes
16

Example: Nominal attribute
17

Example: Nominal attribute
18

Example: Finding Similarity matrix for Nominal attribute
19

Example : find dissimilarity of given data
20

Proximity Measure for Binary Attributes
21

Proximity Measure for Binary Attributes
•A contingency table for binary data
•Distance measure for symmetric binary
variables:
•Distance measure for asymmetric binary
variables:
•Jaccard coefficient (similaritymeasure
for asymmetric binary variables):
22
Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j

Proximity Measure for Binary Attributes: Example
23

Proximity Measure for Binary Attributes
24

•Example
•Gender is a symmetric attribute
•The remaining attributes are asymmetric binary
•Let the values Y and P be 1, and the value N 0
25NameGenderFeverCoughTest-1Test-2Test-3Test-4
JackM Y N P N N N
MaryF Y N P N P N
JimM Y P N N N N 75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(












maryjimd
jimjackd
maryjackd
Proximity Measure for Binary Attributes: Example: Find Dissimilarity
matrix

Proximity Measure for Numeric Attributes: Standardizing Numeric Data
•Z-score:
•X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
•the distance between the raw score and the population mean in units of
the standard deviation
•negative when the raw score is below the mean, “+” when above
•An alternative way: Calculate the mean absolute deviation
where
•standardized measure (z-score):
•Using mean absolute deviation is more robust than using standard deviation 


x
z
26.)...
21
1
nffff
xx(x
n
m  |)|...|||(|1
21 fnffffff
mxmxmx
n
s  f
fif
if s
mx
z



Data Matrix and Dissimilarity Matrix
27pointattribute1attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix

Distance on Numeric Data: Minkowski Distance
•Minkowski distance: A popular distance measure
where i= (x
i1, x
i2, …, x
ip) andj= (x
j1, x
j2, …, x
jp) are two p-
dimensional data objects, and his the order (the distance
so defined is also called L-hnorm)
•Properties
•d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
•d(i, j) = d(j, i)(Symmetry)
•d(i, j) d(i, k) + d(k, j)(Triangle Inequality)
•A distance that satisfies these properties is a metric
28

Special Cases of Minkowski Distance
•h= 1: Manhattan(city block, L
1
norm)distance
•E.g., the Hamming distance: the number of bits that are different
between two binary vectors
•h = 2: (L
2norm) Euclideandistance
•h . “supremum”(L
max
norm, L

norm) distance.
•This is the maximum difference between any component
(attribute) of the vectors||...||||),(
2211 ppj
x
i
x
j
x
i
x
j
x
i
xjid 
29)||...|||(|),(
22
22
2
11 ppj
x
i
x
j
x
i
x
j
x
i
xjid 

Example: Minkowski Distance
30
Dissimilarity Matricespointattribute 1attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0 L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0 L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L
1)
Euclidean (L
2)
Supremum

Example : Numeric attribute: Find the distances
31

Example: Numeric Attribute
32

Example: Ratio-Scaled Attributes: Find Dissimilarity matrix
33

Example: find Dissimilarity of given Numeric
data
34

Proximity measure for Ordinal attributes
•An ordinal variable can be discrete or continuous
•Order is important, e.g., rank
•Can be treated like interval-scaled
•replace x
ifby their rank
•map the range of each variable onto [0, 1] by replacingi-th object in
the f-th variable by
•compute the dissimilarity using methods for interval-scaled variables
351
1



f
if
if
M
r
z },...,1{
fif
Mr

Example: Ordinal Attributes
36

Example: Ordinal Attributes
37

Example: Find Dissimilarity matrix of given
ordinal attribute
38

Proximity measures of Mixed Attributes
•A database may contain all attribute types
•Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
•One may use a weighted formula to combine their effects
•fis binary or nominal:
d
ij
(f)
= 0 if x
if = x
jf, or d
ij
(f)
= 1 otherwise
•fis numeric: use the normalized distance
•fis ordinal
•Compute ranks r
ifand
•Treat z
ifas interval-scaled)(
1
)()(
1
),(
f
ij
p
f
f
ij
f
ij
p
f
d
jid






 1
1



f
if
M
r
z
if
39

40
Example: Find Dissimilarity of Mixed Attributes

41
Example: Find Dissimilarity of Mixed Attributes

42
Example: Find Dissimilarity of Mixed Attributes

Example: Mixed Attributes
43

Example : Mixed Attributes
44

Example: Mixed Attributes
45

Example: Mixed Attributes
46

Cosine Similarity
•A documentcan be represented by thousands of attributes, each recording the
frequencyof a particular word (such as keywords) or phrase in the document.
•Other vector objects: gene features in micro-arrays, …
•Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
•Cosine measure: If d
1
and d
2
are two vectors (e.g., term-frequency vectors), then
cos(d
1
,d
2
)=(d
1
d
2
)/||d
1
||||d
2
||,
whereindicatesvectordotproduct,||d||:thelengthofvectord
47

Example: Find similarity of documents
•cos(d
1
, d
2
) = (d
1
d
2
) /||d
1
|| ||d
2
|| ,
whereindicatesvectordotproduct,||d|:thelengthofvectord
•Ex:Findthesimilaritybetweendocuments1and2.
d
1
=(5,0,3,0,2,0,0,2,0,0)
d
2
=(3,0,2,0,1,1,0,1,0,1)
d
1
d
2
=5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1=25
||d
1
||=(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)
0.5
=(42)
0.5
=6.481
||d
2
||=(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)
0.5
=(17)
0.5
=4.12
cos(d
1
,d
2
)=0.94
48

Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Unit-I Objects,Attributes,Similarity&amp;Dissimilarity.ppt

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd

Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt