data mining and data warehousing PPT module 2

premajain3 42 views 151 slides Aug 17, 2024
Slide 1
Slide 1 of 151
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151

About This Presentation

data mining and data warehouse is a subject of 7th sem BE computer science belongs to VTU university. In this PPT in cncludes implementation data warehouse like data cubes and different operations can perform on data cubes. also it includes data preprocessing techniques and different steps involved ...


Slide Content

Data Warehouse Implementation & Data Mining

How to compute data cube efficiently How OLAP data can be indexed, using either bitmap or join indices How OLAP queries are processed Various types of warehouse servers for OLAP processing Contents

You would like to create a data cube All_Electronics that contains the following: item, city, year, and sales_in_Euro Answer following queries Compute the sum of sales, grouping by item and city Compute the sum of sales, grouping by item Compute the sum of sales, grouping by city Data Cube Computation

Dimension: city, item,year The total number of data cuboids is 2 3 =8 {(city,item,year), (city,item), (city,year), (city),(item),(year), ()} (), the dimensions are not grouped These group-by’s form a lattice of cuboids for the data cube The basic cuboid contains all three dimensions Data Cube Computation

An SQL query containing no group-by (e.g., “ compute the sum of total sales ”) is a zero dimensional operation . An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city ”) is a one-dimensional operation . On n dimensions is equivalent to a collection of group-by statements, one for each subset of the n dimensions. Data Cube Operator Data Cube Computation

The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation. SQL syntax, the data cube could be defined in DMQL as: define cube sales [item, city, year]: sum (sales_in_dollars) Compute the sales aggregate cuboids as: compute cube sales Data Cube Computation

A Data Cube: sales 10 11 12 3 10 1 11 9 6 9 6 7 12 9 8 5 7 3 I1 I2 I3 New York Chicago Toronto 47 48 44 All 46 37 36 22 29 14 184 All City Item Aggregate cell Base cell Apex Cuboid Item cuboid City cuboid Base cuboid Vancouver I5 I6 I4 5 8 10 13 6 3 45

A 3D Data Cube

Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are pre computed . fast response time avoids redundant computations Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions . Curse of dimensionality Efficient data cube implementation

The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels. This problem is referred to as the curse of dimensionality . Efficient data cube implementation

  Efficient data cube implementation

The total number of cuboids that can be generated is where Li is the number of levels for dimension i. Efficient data cube implementation

For example: Time -- “ day < month < quarter < year .” Location --- “street< city < state < country” T= (4+1) x (4+1) =5 x 5 =25 where Li is the number of levels for dimension i.

Location --- “street< city < state < country” Item --- “Item_type” Possible cuboids () {street} {Item_types} {street, item_type} Street City State Country {city} {city, item_type} {state} {state, item_type} {country} {contry, item_type} T= (4+1) x (1+1) =5 x 2 =10

If the cube has 10 dimensions and each dimension has five levels (including all), the total number of cuboids that can be generated is

There are three choices of data cube materialization: No materialization : Don’t precompute any of the non-base cuboid. Leads to multidimensional aggregation on the fly and is slow. Full materialization : Precompute all the cubes. Running queries will be very fast. Requires huge memory. Partial Materialization : Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion. Materialization

Partial materialization should consider three factors: identify the subset of cuboids or subcubes to materialize exploit the materialized cuboids or subcubes during query processing. Efficiently update the materialized cuboids or subcubes during load and refresh. Partial Materialization

Approaches for cuboid or subcube selection Iceberg cube : Materializing only the cells in a cuboid whose measure value is above the minimum support threshold. count(*) >= min support Iceberg Condition Shell cube : Pre-computing the cuboids for only a small number of dimensions (e.g., three to five) of a data cube I1 I2 I3 I4 I5 I6 I7 I8 C1 1 1 1 1 C2 1 1 1 1 1 C3 1 1 1 1 C4 1 1 1 C5 1 1

Bitmap Index Join Index Indexing OLAP Data

Bit map index is also called as Bit Array. Bitmap Index ID Name Category Stock Status 1 A Shoe True Approved 2 B Shoe True Approved 3 C Books False Pending 4 D Books False Approved 5 E Books False Rejected 6 F Toys True Rejected 1 1 2 2 3 4 4 6 3 2 3 6 Cardinality

Approved Pending Rejected 1 2 3 4 5 6 ID Name Category Stock Status 1 A Shoe True Approved 2 B Shoe True Approved 3 C Books False Pending 4 D Books False Approved 5 E Books False Rejected 6 F Toys True Rejected

Approved Pending Rejected 1 1 2 1 3 1 4 1 5 1 6 1 ID Name Category Stock Status 1 A Shoe True Approved 2 B Shoe True Approved 3 C Books False Pending 4 D Books False Approved 5 E Books False Rejected 6 F Toys True Rejected 1 2 4 1 2 3 4

Bitmap Index is popular in OLAP products because it allows quick searching in data cubes. In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s domain. If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0. Bitmap Index

It is efficient compared to hash and tree indices. It is useful for low cardinality domains because the processing time is reduced. It leads to reductions in space and IO. It can be adapted for higher cardinality domains using compression techniques. Bitmap Index Advantages

Traditional indexing maps the value in a given column to a list of rows having that value. In contrast, join indexing registers the joinable rows of two relations from a relational database. In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. E.g. fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city. Join indices can span multiple dimensions – composite join index Join Index

Determine which operations should be performed on the available cuboids e.g., dice = selection + projection Determine which materialized cuboid(s) should be selected for OLAP operation. the one with low cost Suppose that we define a data cube for AllElectronics of the form “sales cube [time, item, location]: sum(sales in dollars).” The dimension hierarchies used are “day < month < quarter < year” for time; “item name < brand < type” for item; “street < city < province or state < country” for location. Efficient Processing of OLAP Queries

Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? Efficient Processing of OLAP Queries

ROLAP MOLAP HOLAP OLAP Server Architectures

ROLAP works directly with relational databases. Uses a relational or extended-relational DBMS to store and manage warehouse data. ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database. ROLAP tools feature the ability to ask any question because the methodology does not limit to the contents of a cube. ROLAP also has the ability to drill down to the lowest level of detail in the database. optimization for each DBMS, implementation of aggregation navigation logic, and additional tools and services Relational OLAP (ROLAP)

OLAP Server Architectures: Relational OLAP (ROLAP)

MOLAP stores this data in an optimized multi-dimensional array storage. The advantage of using a data cube is that it allows fast indexing to pre-computed summarized data. MOLAP tools have a very fast response time. The data cube contains all the possible answers to a given range of questions. MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets. Sparse subcubes employ compression technology for efficient storage utilization Multidimensional OLAP (MOLAP)

HOLAP server may allow large volumes of detailed data to be stored in a relational database Aggregations are kept in a separate MOLAP store. HOLAP tools can utilize both pre-calculated cubes and relational data sources. The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. The Microsoft SQL Server 2000 supports a hybrid OLAP server. Hybrid OLAP (HOLAP)

To meet the growing demand of OLAP processing in relational databases, some database system vendors implement specialized SQL servers. They provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. Specialized SQL servers

Data Mining

What is data mining? Challenges Data Mining Tasks Data: Types of Data Data Quality Data Preprocessing Measures of Similarity and Dissimilarity Contents

Data mining is the process of automatically discovering useful information in large data repositories. Finding hidden information in a database. Data mining techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown. They also provide capabilities to predict the outcome of a future observation. What Is Data Mining?( Definition)

Looking up individual records using a database management system. finding particular Web pages via a query to an Internet search engine. Above are tasks related to the area of information retrieval . What is (not) Data Mining?

Data Mining and Knowledge Discovery Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall process of converting raw data into useful information,

The input data can be stored in a variety of formats. To transform the raw input data into an appropriate format for subsequent analysis. It includes: Fusing data from multiple sources cleaning data to remove noise and duplicate observations selecting records and features that are relevant to the data mining task at hand. Data Preprocessing

Ensures that only valid and useful results are incorporated into the DSS. Which allows analysts to explore the data and the data mining results from a variety of viewpoints. Testing methods can also be applied during post processing to eliminate spurious data mining results. Post processing

Scalability High Dimensionality Heterogeneous and Complex Data Data Ownership and Distribution Non-traditional Analysis Challenges - motivated the development of data mining

Because of advances in data generation and collection, data sets with sizes of gigabytes, terabytes, or even petabytes are becoming common. If data mining algorithms are to handle these massive data sets, then they must be scalable. For instance, out-of-core algorithms may be necessary when processing data sets that cannot fit into main memory. Scalability

It is now common to encounter data sets with hundreds or thousands of attributes instead of the handful common a few decades ago. For example , consider a data set that contains measurements of temperature a various locations. Traditional data analysis techniques that were developed for low-dimensional data often do not work well for such high dimensional data. Also, for some data analysis algorithms, the computational complexity increases rapidly as the dimensionality (the number of features) increases. High Dimensionality

As the role of data mining in business, science, medicine, and other fields has grown, so has the need for techniques that can handle heterogeneous attributes. Recent years have also seen the emergence of more complex data objects. Techniques developed for mining such complex objects should take into consideration relationships in the data Heterogeneous and Complex Data

Sometimes, the data needed for an analysis is not stored in one location or owned by one organization. Instead, the data is geographically distributed among resources belonging to multiple entities. This requires the development of distributed data mining techniques. Among the key challenges faced by distributed data mining algorithms include how to reduce the amount of communication needed to perform the distributed computation how to effectively consolidate the data mining results obtained from multiple sources, and how to address data security issues. Data ownership and Distribution

The traditional statistical approach is based on a hypothesize-and-test paradigm. This process is extremely labor-intensive. Current data analysis tasks often require the generation and evaluation of thousands of hypotheses consequently, the development of some data mining techniques has been motivated by the desire to automate the process of hypothesis generation and evaluation. Non-traditional Analysis

Predictive tasks : Predict the value of a particular attribute based on the values of other attributes. Target or dependent variable Explanatory or independent variables. Descriptive tasks : Derive patterns that summarize the underlying relationships in data. Post-processing techniques are used to validate and explain the result. Data Mining Tasks

Predictive Modeling Clustering Association Rules Anomaly Detection Milk Data Data Mining Tasks …

Refers to the task of building a model for the target variable as the function of the explanatory variables. There are two types of predictive modeling tasks: classification : which is used for discrete target variables. regression , which is used for continuous target variables. Example predicting whether a web user will make a purchase at an online book store. forecasting the future price of the stock. Predictive Modeling

Consider the task of predicting a species of flower based on the characteristics of the flower Consider classifying an Iris flower as to whether it belongs to one of the following three Iris species: Setosa, Versicolour, or Virginica. Data set contains four other attributes: sepal width, sepal length, petal length, and petal width. Example

Petal width is broken into the categories low, medium , and high, which correspond to the intervals [0' 0.75), [0.75, 1.75), [1.75, ∞), respectively. Also, petal length is broken into categories low, medium, and high, which correspond to the intervals [0, 2.5), [2.5,5), [5, ∞), respectively

Based on these categories of petal width and length, the following rules can be derived: Petal width low and petal length low implies Setosa. Petal width medium and petal length medium implies Versicolour. Petal width high and petal length high implies Virginica.

Given a set of records each of which contain some number of items from a given collection used to discover patterns that describe strongly associated features in the data. The discovered patterns are typically represented in the form of implication rules or feature subsets Association Analysis: Definition

Market-basket analysis Rules are used for sales promotion, shelf management, and inventory management finding groups of genes that have related functionality identifying Web pages that are accessed together understanding the relationships between different elements of Earth's climate system. Association Analysis: Applications Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Cluster Analysis

Document Clustering Clustering has been used to group sets of related customers find areas of the ocean that have a significant impact on the Earth's climate, and compress data.

task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. The goal of an anomaly detection algorithm is to discover the real anomalies and avoid falsely labeling normal objects as anomalous. Deviation/Anomaly/Change Detection Applications: 1. Credit Card Fraud Detection 2. Network Intrusion Detection

What is DataSet? Collection of data objects An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, dimension, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Attributes Objects

Attributes and Measurement

Attributes and Measurement An attribute is a property or characteristic of an object that may vary; either from one object to another or from one time to another. Ex: eye color varies from person to person, while the temperature of an object varies over time Note that eye color is a symbolic attribute with a small number of possible values { brown,black,blue , green,hazel , etc.} while temperature is a numerical attribute with a potentially unlimited number of values.

Attributes and Measurement A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object. The process of measurement is the application of a measurement scale to associate a value with a particular attribute of a specific object

The values used to represent an attribute may have properties that are not properties of the attribute itself, and vice versa Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different ID has no limit but age has a maximum and minimum value Type of an Attribute

Measurement of Length The way you measure an attribute may not match the attributes properties. This scale preserves the ordering and additvity properties of length. This scale preserves only the ordering property of length.

The following properties (operations) of numbers are typically used to describe attributes. Distinctness: = ≠ Order: < > Differences are + - meaningful : Ratios are * / meaningful Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & meaningful differences Ratio attribute: all 4 properties/operations The Different Types of Attributes

There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., place in competition), grades, height {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: height, weight, length, time, counts Types of Attributes

Different Attributes types

Permissible Transformation that do not change the attributes meaning

Discrete Attribute Has only a finite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight, count Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Describing Attributes by the Number of Values

Asymmetric Attributes Only presence (a non-zero attribute value) is regarded as important . Consider a data set where each object is a student and each attribute records whether or not a student took a particular course at a university. it is more meaningful and more efficient to focus on the non-zero values. Binary attributes where only non-zero values are important are called asymmetric binary attributes

Dimensionality (number of attributes) Less dimension may not lead to qualitative mining results High dimensional data – Curse of dimensionality an important motivation in preprocessing the data is dimensionality reduction . Sparsity only the non-zero values need to be stored and manipulated which improves computation time and storage. Resolution obtain data at different levels of resolution, and often the properties of the data are different at different resolution. Patterns should not be too fine or too coarse, it would not be visible . Important Characteristics of Data Set

Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data Types of data sets

Data that consists of a collection of records, each of which consists of a fixed set of attributes Record Data

A special type of record data, where Each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. Transaction Data

If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Data Matrix

A sparse data matrix is a special case of a data matrix in which the attributes are of the same type and are asymmetric only non-zero values are important. Transaction data is an example of a sparse data matrix that has only 0 1 entries. Another common example is document data. The Sparse Data Matrix

Each document becomes a ‘term’ vector Each term is a component (attribute) of the vector The value of each component is the number of times the corresponding term occurs in the document. Document Data

Frequently convey important information. In such cases, the data is often represented as a graph. In particular, the data objects are mapped to nodes of the graph, while the relationships among objects are captured by the links between objects and link properties, such as direction and weight. Graph-Based Data

Examples: Generic graph, a molecule, and webpages Graph Data Benzene Molecule: C6H6

For some types of data, the attributes have relationships that involve order in time or space Different types of ordered data are Sequential Data Sequence Data Time Series Data Spatial Data Ordered Data

Sequential Data Items/Eents

Genomic sequence data

Time Series Data

Spatial Data Average Monthly Temperature of land and ocean Spatio-Temporal Data

Record-oriented techniques can be applied to non-record data by extracting features from data objects and using these features to create a record corresponding to each object. Ex- chemical structure data Handling Non-Record Data

Data mining focuses on (1) the detection and correction of data quality Problems - Data cleaning (2) the use of robust algorithms that can tolerate poor data quality. Data Quality

There may be problems due to human error, limitations of measuring devices, or flaws in the data collection process. measurement error refers to any problem resulting from the measurement process For continuous attributes, the numerical difference of the measured and true value is called the error . Measurement and Data Collection Issues

The term data collection error refers to errors such as : missing values or even entire data objects may be. There may be spurious or duplicate objects Examples of data quality problems: Noise and outliers Missing values Duplicate data Wrong data Measurement and Data Collection Issues

Noise is the random component of a measurement error. It may involve the distortion of a value or the addition of spurious objects. Noise

Data errors may be the result of a more deterministic phenomenon, such as a streak in the same place on a set of photographs. Such deterministic distortions of the data are often referred to as artifacts . Artifacts

The closeness of repeated measurements( of the same quantity) to one another. A systematic variation of the measurements from the quantity being measured. The closeness of measurements to the true value of the quantity being measured. Precision, Bias, and Accuracy

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Case 1: Outliers are noise that interferes with data analysis Case 2: Outliers are the goal of our analysis Credit card fraud Intrusion detection Outliers

Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate data objects or variables Estimate missing values Example: time series of temperature Ignore the missing value during analysis Inconsistent Duplicate Missing Values

Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeneous sources Examples: Same person with multiple email addresses Two distinct persons with same name Duplicate Data

Timeliness Relevance Knowledge about the Data Data quality Issues Related to Applications

Aggregation Sampling Dimensionality reduction Feature subset selection Feature creation Discretization and binarization Variable transformation Data Preprocessing

Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc. Days aggregated into weeks, months, or years More “stable” data Aggregated data tends to have less variability Aggregation

Aggregation

Example: Precipitation in Australia … Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia

Sampling Sampling is the main technique employed for data reduction. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Sampling … The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, if the sample is representative A sample is representative if it has approximately the same properties (of interest) as the original set of data

Sample Size 8000 points 2000 Points 500 Points

Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition Progressive sampling

Sample Size What sample size is necessary to get at least one object from each of 10 equal-sized groups.

Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Techniques Principal Components Analysis (PCA) Singular Value Decomposition Others: supervised and non-linear techniques

Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful Randomly generate 500 points Compute difference between max and min distance between any pair of points

Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x 2 x 1 e

Feature Subset Selection Another way to reduce dimensionality of data Redundant features Duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features Contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

Embedded approaches Filter approaches Wrapper approaches Feature Subset Selection

An Architecture for Feature Subset Selection

High Low Feature Weighting

Feature Creation

Feature Extraction Mapping the Data to a New Space Fourier transform

Discretization Without Using Class Labels Equal interval width approach used to obtain 4 values.

Discretization Without Using Class Labels Equal frequency approach used to obtain 4 values.

Discretization Without Using Class Labels K-means approach to obtain 4 values.

Binarization

Similarity measure Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity measure Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity Similarity and Dissimilarity Measures

Similarity/Dissimilarity for Simple Attributes The following table shows the similarity and dissimilarity between two objects, x and y, with respect to a single, simple attribute.

Dissimilarities between Data Objects - Distance Euclidean Distance where n is the number of dimensions (attributes) and x k and y k are, respectively, the k th attributes (components) or data objects x and y .

Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and x k and y k are, respectively, the k th attributes (components) or data objects x and y . Minkowski Distance

Euclidean Distance

r = 1. City block (Manhattan, taxicab, L 1 norm) distance. ex: hamming distance r = 2. Euclidean distance r → ∞ . “supremum” (L max norm, L ∞ norm) distance. Minkowski Distance: Examples

Distances, such as the Euclidean distance, have some well known properties. d ( x , y ) ≥ for all x and y d ( x , y ) = 0 only if x = y . (Positive definiteness) d ( x , y ) = d ( y , x ) for all x and y . (Symmetry) d ( x , z ) ≤ d ( x , y ) + d ( y , z ) for all points x , y , and z . (Triangle Inequality) where d ( x , y ) is the distance (dissimilarity) between points (data objects), x and y . A distance that satisfies these properties is a metric Common Properties of a Distance

Non-metric Dissimilarities: Set Differences d(A,B): size(A- B) + size(B - A) Non-metric Dissimilarities: Time Examples

Similarities, also have some well known properties. s ( x , y ) = 1 (or maximum similarity) only if x = y . (0 < s <1) s ( x , y ) = s ( y , x ) for all x and y . (Symmetry) where s ( x , y ) is the similarity between points (data objects), x and y . Similarities between Data Objects

Similarity measures between objects that contain only binary attributes are called similarity coefficients, and typically have values between 0 and 1. Let x and y be two objects that consist of n binary attributes. Similarity Measures for Binary Data Simple Matching Coefficient

Suppose that x and y are data objects that represent two rows (two transactions) of a transaction matrix . If each asymmetric binary attribute corresponds to an item in a store, then a 1 indicates that the item was purchased, while a 0 indicates that the product was not purchased Jaccard Coefficient

SMC versus Jaccard: Example

Cosine Similarity

Variation of Jaccard for continuous or count attributes Reduces to Jaccard for binary attributes Extended Jaccard Coefficient (Tanimoto)

Correlation measures - the linear relationship between the attributes of the objects

Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.

x = (-3, -2, -1, 0, 1, 2, 3) y = (9, 4, 1, 0, 1, 4, 9) y i = x i 2 mean( x ) = 0, mean( y ) = 4 std( x ) = 2.16, std( y) = 3.74 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0 Drawback of Correlation

Bregman Divergence

A generalization of Euclidean distance is useful when attributes are correlated, have different ranges of values (different variances), and the distribution of the data is approximately Gaussian Mahalanobis distance,

Domain of application Similarity measures tend to be specific to the type of attribute and data Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measures However, one can talk about various properties that you would like a proximity measure to have Symmetry is a common one Tolerance to noise and outliers is another Ability to find more types of patterns? Many others possible The measure must be applicable to the data and produce results that agree with domain knowledge Comparison of Proximity Measures

  Entropy

  Entropy Examples

Maximum entropy is log 2 5 = 2.3219 Entropy for Sample Data: Example Hair Color Count p -p log 2 p Black 75 0.75 0.3113 Brown 15 0.15 0.4105 Blond 5 0.05 0.2161 Red 0.00 Other 5 0.05 0.2161 Total 100 1.0 1.1540

  Entropy for Sample Data

  Mutual Information

Mutual Information Example Student Status Count p -p log 2 p Undergrad 45 0.45 0.5184 Grad 55 0.55 0.4744 Total 100 1.00 0.9928 Grade Count p -p log 2 p A 35 0.35 0.5301 B 50 0.50 0.5000 C 15 0.15 0.4105 Total 100 1.00 1.4406 Student Status Grade Count p -p log 2 p Undergrad A 5 0.05 0.2161 Undergrad B 30 0.30 0.5211 Undergrad C 10 0.10 0.3322 Grad A 30 0.30 0.5211 Grad B 20 0.20 0.4644 Grad C 5 0.05 0.2161 Total 100 1.00 2.2710 Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624

Applies mutual information to two continuous variables Consider the possible binnings of the variables into discrete categories n X × n Y ≤ N 0.6 where n X is the number of values of X n Y is the number of values of Y N is the number of samples (observations, data objects) Compute the mutual information Normalized by log 2 (min ( n X , n Y ) Take the highest value Maximal Information Coefficient