Data mining Measuring similarity and desimilarity

11,316 views 46 slides Jul 04, 2020
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

Similarity and Desimilarity techniques


Slide Content

Unit III :Measuring Data Similarity and Dissimilarity

What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Attributes Objects

Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different ID has no limit but age has a maximum and minimum value

Similarity and Dissimilarity Measures CS 40003: Data Analytics 4 In clustering techniques, similarity (or dissimilarity) is an important measurement. Informally, similarity between two objects (e.g., two images, two documents, two records, etc.) is a numerical measure of the degree to which two objects are alike . The dissimilarity on the other hand, is another alternative (or opposite) measure of the degree to which two objects are different . Both similarity and dissimilarity also termed as proximity . Usually, similarity and dissimilarity are non-negative numbers and may range from zero (highly dissimilar (no similar)) to some finite/infinite value (highly similar (no dissimilar)). Note: Frequently, the term distance is used as a synonym for dissimilarity I n fact, it is used to refer as a special case of dissimilarity.

Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 5

Data Matrix and Dissimilarity Matrix Data matrix n data points with p dimensions Two modes Dissimilarity matrix n data points, but registers only the distance A triangular matrix Single mode 6

Attribute Types Nominal: categories, states, or “names of things” Hair_color = { auburn, black, blond, brown, grey, red, white } marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary : both outcomes equally important e.g., gender Asymmetric binary : outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = { small, medium, large } , grades, army rankings 7

Numeric Attribute Types Quantity (integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g., temperature in C ˚ or F ˚ , calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10 K ˚ is twice as high as 5 K ˚ ). e.g., temperature in Kelvin, length, counts, monetary quantities 8

Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: =  Order: < > Addition: + - Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties

Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, sex: { male, female } mode, entropy, contingency correlation,  2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, { good, better, best }, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation

Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet.

Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables 12

The following questions fall under the Interval Scale category: What is your family income? What is the temperature in your city? Ratio Scale Examples: The following questions fall under the Ratio Scale category: What is your daughter’s current height? Less than 5 feet. 5 feet 1 inch – 5 feet 5 inches 5 feet 6 inches- 6 feet More than 6 feet What is your weight in kilograms? Less than 50 kilograms 51- 70 kilograms 71- 90 kilograms 91-110 kilograms More than 110 kilograms

Proximity Measure for Nominal Attributes Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1 : Simple matching m : # of matches, p : total # of variables Method 2 : Use a large number of binary attributes creating a new binary attribute for each of the M nominal states 15

Proximity Measure for Nominal Attributes

Proximity Measure for Binary Attributes A contingency table for binary data Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient ( similarity measure for asymmetric binary variables): 18 Note: Jaccard coefficient is the same as “coherence”: Object i Object j

Dissimilarity between Binary Variables Example Gender is a symmetric attribute The remaining attributes are asymmetric binary Let the values Y and P be 1, and the value N 0 19

Standardizing Numeric Data Z-score: X: raw score to be standardized, μ : mean of the population, σ : standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, “+” when above An alternative way: Calculate the mean absolute deviation where standardized measure ( z-score ): Using mean absolute deviation is more robust than using standard deviation 21

Example Given the following measurements for the variable age: 18 ; 22; 25; 42; 28; 43; 33; 35; 56; 28; standardize the variable by the following: (a) Compute the mean absolute deviation of age. (b) Compute the z-score for the ¯ rst four measurements.

Distance on Numeric Data: Minkowski Distance Minkowski distance : A popular distance measure where i = ( x i1 , x i2 , …, x ip ) and j = ( x j1 , x j2 , …, x jp ) are two p -dimensional data objects, and h is the order (the distance so defined is also called L- h norm) Properties d(i, j) > 0 if i ≠ j , and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric 24

Special Cases of Minkowski Distance h = 1: Manhattan (city block, L 1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: (L 2 norm) Euclidean distance h   . “supremum” (L max norm, L  norm) distance. This is the maximum difference between any component (attribute) of the vectors 25

Example: Data Matrix and Dissimilarity Matrix 26 Dissimilarity Matrix (with Euclidean Distance) Data Matrix

Weighted Euclidean Distance

Example: Minkowski Distance 28 Dissimilarity Matrices Manhattan (L 1 ) Euclidean (L 2 ) Supremum

Example:

Example:

Example

July 18, 2019 Data Mining: Concepts and Techniques 33 Interval-valued variables Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. Standardize data Calculate the mean absolute deviation: where Calculate the standardized measurement ( z-score ) Using mean absolute deviation is more robust than using standard deviation

Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace x if by their rank map the range of each variable onto [0, 1] by replacing i -th object in the f -th variable by compute the dissimilarity using methods for interval-scaled variables 34

Attributes of Mixed Type A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects f is binary or nominal: d ij (f) = 0 if x if = x jf , or d ij (f) = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute ranks r if and Treat z if as interval-scaled 36

37

Dissimilarity matrix for attribute test1 Dissimilarity matrix for attribute test2 Dissimilarity matrix for attribute test3 Dissimilarity matrix for mixed type of attributes

Dissimilarity matrix for attribute test1 Dissimilarity matrix for attribute test2 Dissimilarity matrix for attribute test3 Dissimilarity matrix for mixed type of attributes

Non-Metric similarity CS 40003: Data Analytics 42 In many applications (such as information retrieval) objects are complex and contains a large number of symbolic entities (such as keywords, phrases, etc.). To measure the distance between complex objects, it is often desirable to introduce a non-metric similarity function. Here, we discuss few such non-metric similarity measurements. Cosine similarity Suppose, x and y denote two vectors representing two complex objects. The cosine similarity denoted as and defined as w here denotes the vector dot product, namely such that and . and denote the Euclidean norms of vector x and y, respectively (essentially the length of vectors x and y ), that is and  

Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, gene feature mapping, ... Cosine measure: If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos( d 1 , d 2 ) = ( d 1  d 2 ) /|| d 1 || || d 2 || , where  indicates vector dot product, || d ||: the length of vector d 43

Example: Cosine Similarity cos( d 1 , d 2 ) = ( d 1  d 2 ) /|| d 1 || || d 2 || , where  indicates vector dot product, || d |: the length of vector d Ex: Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1  d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 || d 1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 || d 2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos( d 1 , d 2 ) = 0.94 44

Example: Cosine Similarity d 1  = 3 2 0 5 0 0 0 2 0 0  d 2  = 1 0 0 0 0 0 0 1 0 2 < d 1 ,  d 2 > = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||  d 1  || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 1/2  = (42)  1/2  = 6.481 ||  d 2  || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)  1/2  = (6) 1/2  = 2.449 cos ( d 1 ,  d 2  ) = 0.3150

Example: D1: “I like this book” D2: “We wrote this book”
Tags