Data mining techniques unit 2

malathieswaran29 421 views 91 slides Aug 14, 2021
Slide 1
Slide 1 of 91
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91

About This Presentation

unit-2 content


Slide Content

DATA MINING TECHNIQUES UNIT-II

Data objects and Attributes Types What is data object? What is an attribute? -observations -attribute vector or feature vector -univariate, bivariate Types of attributes -Nominal -Binary -Ordinal -Numeric -Discrete Vs Continuous Attributes

Nominal Attributes Relating to name Symbols or name of things Categorical or nominal Enumerations Example

Binary Attribute What is binary attribute ? What is Boolean attribute? Symmetric Asymmetric Example

Ordinal Attribute What is ordinal attribute? Example

Numeric Attributes What is numeric attributes? Interval-scaled Attributes Ratio-scaled Attributes Example

Discrete Vs Continuous Attributes What is discrete and continuous attributes? Example

Basic statistical description of data - Purpose of basic statistical description of data: -Three areas of basic statistical descriptions 1.Measuring the central Tendency: Mean, Median and Mode 2.Measuring the dispersion of data: Range, Quartiles, Variance, Standard Deviation and Interquartile range -Five-Number Summary, Boxplots and Outliers -Variance and Standard Deviation 3.Graphic Displays of Basic Statistical Descriptions of Data -Quantile plot -Quantile-Quantile Plot -Histogram -Scatter plots and Data correlation

Measuring the central Tendency: Mean, Median and Mode -Measure the location of the middle or center of a data distribution. -Measure of the central tendency include the mean, median, mode and midrange -The most common and effective numeric measure is ARITHMETIC MEAN-helps to find the center of a set of data. -The Mean of the set of values :

Example: Mean

Weighted Arithmetic Mean or Weighted Average

Median -Trimmed Mean -For asymmetric data or skewed data: a better measure of the center of data can be identified using Median -The Median is expensive to compute: for large no. of observations -For numeric attributes can easily approximate the value -Median Interval

Mode -Another measure of central tendency -The value that occurs most frequently in the set -It can be used in qualitative and quantitative attributes -unimodal, bimodal and trimodal -multimodal

Mid range -It is the average of the largest and smallest values in the set -This measure is easy to compute using the SQL aggregate functions, Max(),Min()

Measuring the dispersion of data:range,quartiles,variance,standard deviation and Interquartile range Range ,quantiles, quartiles, percentiles and the interquartile range :measures of data dispersion - Range: The set of difference between the largest (max()) and smallest (min()) values -Quantiles: points taken at the regular intervals of a data distribution, dividing it into essentially equal size consecutive sets -Quartiles: Each part of one-fourth of the data distribution -Percentiles: The 100 quantiles are commonly referred as percentiles -Interquartile range: The distance between the first and third quartiles IQR=Q 3 -Q 1

Five –Number Summary,BoxPlots and Outliers -The Five-Number summary: Minimum,Q1,Median,Q3,Maximum -Box Plot: a popular way of visualizing a distribution -It incorporates the five – number summary

Variance and Standard Deviation -A low standard deviation means that the data observations tend to be very close to the mean -A high standard deviation indicates that the data are spread out over a large range of values

Graphic displays of basic statistical descriptions of data - Quantile plot: univariate data distribution -It displays all of the data for the given attribute -Quantile-Quantile plot or q-q plot : It graphs the quantiles of one univariate distribution against the corresponding quantiles of another.

Histograms, Scatter Plots and Data Correlation -Histogram: It is a chart of poles -The resulting graph is more commonly known as bar chart -Scatter plot: The most effective graphical methods for determining the relationship, pattern or trend between two numeric attributes -It allows to represents the bivariate data -Correlation: if one attribute implies the other -Positive correlation, Negative Correlation, Null Correlation

Data Visualization Aims to communicate data clearly and effectively through graphical representation Pixel oriented techniques Geometric projection visualization techniques Icon based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations

Pixel oriented techniques To visualize the value of a dimension The colour of the pixel reflects-the dimension’s value The data records can also be ordered in a query-dependent way A pixel is one above it in the window To solve this problem-can layout data records in space filling curve to fill the window Circle segment: comparisons of dimension

Geometric projection visualization techniques Drawback of pixel oriented techniques: cannot help much in understanding the distribution of data in multidimensional space. Geometric projection: helps user in finding interesting projections of multidimensional data sets Scatter plot:2D data points ,three dimension added using different colour or shapes to represent different data points Scatter plot matrix technique: for datasets with more than four dimensions Parallel coordinates: can handle higher dimensionality

Icon - based visualization techniques Small icons used to represent multidimensional data values Two popular icon-based techniques: Chernoff faces and stick figures Chernoff faces: it displays multidimensional data up to 18 variables or dimensions as cartoon faces Viewing for large tables of data is tedious Asymmetric Chernoff: double the no. of facial characteristics, allowing up to 36 dimensions to be displayed Stick figure: maps multidimensional data to five-piece stick figures

Hierarchical visualization techniques It partition all dimensions into subsets(i.e., sub spaces).The subspaces are visualized in a hierarchical manner Tree maps: display hierarchical data as a set of nested rectangles.

Visualizing complex data and relations For non-numeric data such as text and social networks –visualization techniques are available. Tag cloud: visualization of statistics of user generated tags. Tag cloud can be done for single and multiple items Disease influence graph provides the visualization of correlations between diseases

Measuring data similarity and dissimilarity Measures of proximity Two data structures: the data matrix and dissimilarity matrix Data matrix Vs Dissimilarity matrix Proximity measures for nominal attributes Proximity measures for binary attributes Dissimilarity of numeric data: Minkowski distance Proximity measures for ordinal attributes Dissimilarity for attributes of mixed types Cosine similarity

Data matrix Vs Dissimilarity matrix Data Matrix(or object-by-attribute structure) Dissimilarity matrix ( or object-by- object structure) Sim(i , j)=1-d(i , j) Two - mode matrix One - mode matrix

Proximity measures for nominal attributes The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d( i,j )=p-m/p M-number of matches P-total number of attributes describing the objects i,j -the number of attributes for which i and j are in the same state.

Proximity measures for binary attributes Symmetric binary attributes Asymmetric binary dissimilarity Asymmetric binary similarity

Dissimilarity of numeric data: Minkowski distance

Proximity measures for ordinal attributes

Dissimilarity for attributes of mixed types

Cosine similarity

Data Pre processing Need for data pre processing Data Inaccurate Incomplete data Inconsistent data Timeliness Believability Interpretability

Major Tasks in Data Pre processing Data Cleaning Data Integration Data reduction Data Transformation

Data Cleaning 1.Missing Values,2.Noisy Data,3.Data Cleaning as a Process MISSING VALUES -Ignore the tuple -Filling in the missing value manually -Use a “global constant” to fill in the missing value -Use a measure of central tendency for the attribute(e.g., mean or median to fill in the missing value -Use the attribute mean or median for all samples belonging to the same class as the given tuple -Use the most probable value to fill in the missing value

2.Noisy data-(i)Binning,(ii)Regression,(iii)Outlier analysis NOISY DATA Binning Equal-frequency bins Smoothing by bin means Smoothing by bin boundaries Ex. sorted data for price(in dollars) 4,8,15,21,21,24,25,28,34

2.Noisy data-(ii)Regression,(iii)Outlier Analysis REGRESSION Linear Regression Multiple Regression OUTLIER ANALYSIS

3.Data Cleaning as a Process(i)Discrepancy detection (ii)Data Transformation DISCREPANCY DETECTION Caused by several factors How can we proceed with discrepancy detection? Field overloading Rules to examine data -Unique rule -Consecutive rule -Null rule Tools to aid in the discrepancy detection step: Data scrubbing tools Data auditing tools

3.Data Cleaning as a Process (ii)Data Transformation DATA TRANSFORMATION Commercial tools for data transformation: (i)Data migration tools (ii)ETL(Extraction/Transformation/Loading)tools What is potter’s wheel? What are declarative languages for data cleaning?

Data Integration-1.Entity Identification Problem ,2.Redundancy and correlation analysis,3.Tuple Duplication Merging of data from multiple data sources Helps to reduce and avoid inconsistencies Helps to improve accuracy and speed of the datamining process

Data Integration-1.Entity Identification Problem Customer_id-DB1 Customer_number-DB2 How can the data analyst interpret? As it refers to same attribute Meta data –includes the name,meaning,data type,range of values Meta data-helps to avoid errors in schema integration Pay type:H /S-DB1 :1/2-DB2

Data Integration- 2.Redundancy and correlation analysis REDUNDANT(DOB & AGE) An attribute may be redundant, if it can be derived from another attribute or set of attributes CORRELATION ANALYSIS Some redundancies can be detected by correlation analysis Given two attributes, correlation analysis can measure how strongly one attribute implies the other, based on the available data. Types of analysis :(i) Nominal data- X 2 (chi )square test (ii) Numeric data-correlation coefficient, covariance

Data Integration- 2.Redundancy and correlation analysis Nominal attribute- eg .,Hair color-black,brown,red etc., eg .,Marital status-single,married,divorced,widowed Numeric attribute-Integer or real values

Chi-square distribution table

Correlation Coefficient for Numeric Data Correlation between two attributes A&B are evaluated by computing the correlation coefficient (known as Pearson's product moment coefficient)

Covariance of numeric data Correlation and covariance are two simple measures for assessing how much two attributes change together.

Tuple duplication Duplication should also be detected at tuple level Denormalized tables is a source of data redundancy Data value conflicts detection and resolution Eg.,Weight,hotel,schools,attributes

Data Reduction Why data reduction? Data reduction strategies: 1)Dimensionality reduction Wavelet Transforms Principal Component Analysis(PCA) Attribute Subset Selection 2)Numerosity Reduction Parametric-Regression and Log linear models Non-Parametric-Histograms, Clustering, Sampling, Data cube aggregation 3)Data Compression

Dimensionality Reduction What is dimensionality reduction? What is Curse of dimensionality? What is Wavelet Transform? 1.Discrete Wavelet Transform(DWT) DWT in data reduction 2.Discrete Fourier Transform(DFT) DWT is closely related to DFT, a signal processing technique involving sines and cosines

Difference between DWT and DFT DWT DFT 1.More Accurate Not accurate 2.Requires less space Requires more space 3.Several families of DWT (Haar-2,Daubechies-4,Daubechies-6 etc.,) Only one DFT

Hierarchical pyramid Algorithm Procedure for applying a discrete wavelet transform that halves the data at each iteration, resulting in fast computational speed. Method: 1.Length L of input data vector must be an integral power of 2. 2.Each transform involves applying two functions. (i)Applies data smoothing (ii)Performs weighted difference 3.The two functions are applied to pairs of data points in X 4.The two functions are recursively applied to the data sets obtained in the previous loop, until the resulting data sets obtained are of length2. 5.Selected values from the data sets obtained in the previous iterations are the wavelet coefficients of the transformed data.

Matrix A matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients. Matrix must be orthogonal. Multidimensional Data.

Principal Components Analysis(also called as Karhunen-Loeve or K-L method) What is Principal Component Analysis? Procedure for PCA 1.Normalize input data 2.Compute “K” orthonormal vectors 3.Principal components are sorted in order of decreasing “significance” or “strength” 4.The data size can be reduced by eliminating the weaker components.

Attribute Subset Selection Another way to reduce dimensionality of data. Irrelevant attributes GOAL: 1.To find a minimum set of attributes 2.Reduces the no. of attributes appearing in the discovered pattern How can we find a ‘good’ subset of the original attributes?

Greedy Methods What is Heuristic search? Basic heuristic methods: 1.Stepwise forward selection Initial attribute set:{a1,a2,a3,a4,a5,a6} Initial reduced set:{} =>{a1} =>{a1,a4} =>{a1,a4,a6} Reduced attribute set

Basic Heuristic Method 2.Stepwise backward elimination: Initial attribute set:{a1,a2,a3,a4,a5,a6} =>{a1,a3,a4,a5,a6} =>{a1,a3,a5,a6} =>{a1,a3,a6} 3.Combination of forward selection and backward elimination 4.Decision tree induction Discrete transform in attribute subset selection

Numerosity Reduction-( i )parametric data reduction Regression &log-linear models Regression-can be used to approximate the given data. 1.Linear (simple)Regression: Data are modelled to fit a straight line. Y= wx + b, x & y- numeric database attributes, w & b-regression co efficents . These coefficients are solved by method of least square. 2.Multiple Linear Regression: Allows a response variable ’y’ to be modelled as a linear function of two or more predictor variables. Y=b0+b1x1+b2x2 Log-Linear Models: Approximate discrete multidimensional probability distributions.

Numerosity Reduction-( i )Non-parametric data reduction Histogram-use binning to approximate data distributions and ara popular form of data reduction. ( i )Equal –width: the width of each bucket range is uniform (ii)Equal-frequency(or equal-depth):the frequency of each bucket is constant

Histogram-Equal-frequency, Equal-width

Numerosity Reduction-( i )Non-parametric data reduction Clustering-it considers data tuples as objects. Grouping of similar objects Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid Sampling-allows a large data set to be represented by a much smaller random data sample(or sub set) Simple random sample without replacement of size(SRSWOR)-all tuples are equally likely to be sampled Simple random sample with replacement of size(SRSWR)-similar to SRSWOR, after a tuple is drawn, it is placed back in D so that it may be drawn again. - Cluster Sample:grouped into M mutually disjoint “clusters” Stratified sample:If D is divided into mutually disjoint parts called strata. this helps in clustering for skewed data.

Data cube aggregation

Data Cube Aggregation The cube created at the lowest abstraction level is referred to as the base cuboid. A cube at the highest level of abstraction is the apex cuboid.

Data Compression The transformation are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If we construct only an approximation of the original data then the data is called lossy. The dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression

Data transformation and Data Discretization The data are transformed or consolidated so that the resulting mining process may be more efficient and the patterns found may be easier to understand. Data transformation strategies overview: Smoothing Attribute Construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data

Data Transformation by Normalization Min-Max Normalization Z-score Normalization Decimal Scaling

Min-Max Normalization

Min-Max Normalization

Z-score Normalization

Decimal Scaling

Discretization Discretization by binning Discretization by histogram analysis Discretization by cluster, decision tree and correlation analyses

Concept hierarchy generation for nominal data Specification of a partial ordering of attributes explicity at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes but not of their partial ordering Specification of only a partial set of attributes
Tags