Data objects and Attributes Types What is data object? What is an attribute? -observations -attribute vector or feature vector -univariate, bivariate Types of attributes -Nominal -Binary -Ordinal -Numeric -Discrete Vs Continuous Attributes
Nominal Attributes Relating to name Symbols or name of things Categorical or nominal Enumerations Example
Binary Attribute What is binary attribute ? What is Boolean attribute? Symmetric Asymmetric Example
Ordinal Attribute What is ordinal attribute? Example
Numeric Attributes What is numeric attributes? Interval-scaled Attributes Ratio-scaled Attributes Example
Discrete Vs Continuous Attributes What is discrete and continuous attributes? Example
Basic statistical description of data - Purpose of basic statistical description of data: -Three areas of basic statistical descriptions 1.Measuring the central Tendency: Mean, Median and Mode 2.Measuring the dispersion of data: Range, Quartiles, Variance, Standard Deviation and Interquartile range -Five-Number Summary, Boxplots and Outliers -Variance and Standard Deviation 3.Graphic Displays of Basic Statistical Descriptions of Data -Quantile plot -Quantile-Quantile Plot -Histogram -Scatter plots and Data correlation
Measuring the central Tendency: Mean, Median and Mode -Measure the location of the middle or center of a data distribution. -Measure of the central tendency include the mean, median, mode and midrange -The most common and effective numeric measure is ARITHMETIC MEAN-helps to find the center of a set of data. -The Mean of the set of values :
Example: Mean
Weighted Arithmetic Mean or Weighted Average
Median -Trimmed Mean -For asymmetric data or skewed data: a better measure of the center of data can be identified using Median -The Median is expensive to compute: for large no. of observations -For numeric attributes can easily approximate the value -Median Interval
Mode -Another measure of central tendency -The value that occurs most frequently in the set -It can be used in qualitative and quantitative attributes -unimodal, bimodal and trimodal -multimodal
Mid range -It is the average of the largest and smallest values in the set -This measure is easy to compute using the SQL aggregate functions, Max(),Min()
Measuring the dispersion of data:range,quartiles,variance,standard deviation and Interquartile range Range ,quantiles, quartiles, percentiles and the interquartile range :measures of data dispersion - Range: The set of difference between the largest (max()) and smallest (min()) values -Quantiles: points taken at the regular intervals of a data distribution, dividing it into essentially equal size consecutive sets -Quartiles: Each part of one-fourth of the data distribution -Percentiles: The 100 quantiles are commonly referred as percentiles -Interquartile range: The distance between the first and third quartiles IQR=Q 3 -Q 1
Five –Number Summary,BoxPlots and Outliers -The Five-Number summary: Minimum,Q1,Median,Q3,Maximum -Box Plot: a popular way of visualizing a distribution -It incorporates the five – number summary
Variance and Standard Deviation -A low standard deviation means that the data observations tend to be very close to the mean -A high standard deviation indicates that the data are spread out over a large range of values
Graphic displays of basic statistical descriptions of data - Quantile plot: univariate data distribution -It displays all of the data for the given attribute -Quantile-Quantile plot or q-q plot : It graphs the quantiles of one univariate distribution against the corresponding quantiles of another.
Histograms, Scatter Plots and Data Correlation -Histogram: It is a chart of poles -The resulting graph is more commonly known as bar chart -Scatter plot: The most effective graphical methods for determining the relationship, pattern or trend between two numeric attributes -It allows to represents the bivariate data -Correlation: if one attribute implies the other -Positive correlation, Negative Correlation, Null Correlation
Data Visualization Aims to communicate data clearly and effectively through graphical representation Pixel oriented techniques Geometric projection visualization techniques Icon based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations
Pixel oriented techniques To visualize the value of a dimension The colour of the pixel reflects-the dimension’s value The data records can also be ordered in a query-dependent way A pixel is one above it in the window To solve this problem-can layout data records in space filling curve to fill the window Circle segment: comparisons of dimension
Geometric projection visualization techniques Drawback of pixel oriented techniques: cannot help much in understanding the distribution of data in multidimensional space. Geometric projection: helps user in finding interesting projections of multidimensional data sets Scatter plot:2D data points ,three dimension added using different colour or shapes to represent different data points Scatter plot matrix technique: for datasets with more than four dimensions Parallel coordinates: can handle higher dimensionality
Icon - based visualization techniques Small icons used to represent multidimensional data values Two popular icon-based techniques: Chernoff faces and stick figures Chernoff faces: it displays multidimensional data up to 18 variables or dimensions as cartoon faces Viewing for large tables of data is tedious Asymmetric Chernoff: double the no. of facial characteristics, allowing up to 36 dimensions to be displayed Stick figure: maps multidimensional data to five-piece stick figures
Hierarchical visualization techniques It partition all dimensions into subsets(i.e., sub spaces).The subspaces are visualized in a hierarchical manner Tree maps: display hierarchical data as a set of nested rectangles.
Visualizing complex data and relations For non-numeric data such as text and social networks –visualization techniques are available. Tag cloud: visualization of statistics of user generated tags. Tag cloud can be done for single and multiple items Disease influence graph provides the visualization of correlations between diseases
Measuring data similarity and dissimilarity Measures of proximity Two data structures: the data matrix and dissimilarity matrix Data matrix Vs Dissimilarity matrix Proximity measures for nominal attributes Proximity measures for binary attributes Dissimilarity of numeric data: Minkowski distance Proximity measures for ordinal attributes Dissimilarity for attributes of mixed types Cosine similarity
Data matrix Vs Dissimilarity matrix Data Matrix(or object-by-attribute structure) Dissimilarity matrix ( or object-by- object structure) Sim(i , j)=1-d(i , j) Two - mode matrix One - mode matrix
Proximity measures for nominal attributes The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d( i,j )=p-m/p M-number of matches P-total number of attributes describing the objects i,j -the number of attributes for which i and j are in the same state.
Data Pre processing Need for data pre processing Data Inaccurate Incomplete data Inconsistent data Timeliness Believability Interpretability
Major Tasks in Data Pre processing Data Cleaning Data Integration Data reduction Data Transformation
Data Cleaning 1.Missing Values,2.Noisy Data,3.Data Cleaning as a Process MISSING VALUES -Ignore the tuple -Filling in the missing value manually -Use a “global constant” to fill in the missing value -Use a measure of central tendency for the attribute(e.g., mean or median to fill in the missing value -Use the attribute mean or median for all samples belonging to the same class as the given tuple -Use the most probable value to fill in the missing value
2.Noisy data-(i)Binning,(ii)Regression,(iii)Outlier analysis NOISY DATA Binning Equal-frequency bins Smoothing by bin means Smoothing by bin boundaries Ex. sorted data for price(in dollars) 4,8,15,21,21,24,25,28,34
2.Noisy data-(ii)Regression,(iii)Outlier Analysis REGRESSION Linear Regression Multiple Regression OUTLIER ANALYSIS
3.Data Cleaning as a Process(i)Discrepancy detection (ii)Data Transformation DISCREPANCY DETECTION Caused by several factors How can we proceed with discrepancy detection? Field overloading Rules to examine data -Unique rule -Consecutive rule -Null rule Tools to aid in the discrepancy detection step: Data scrubbing tools Data auditing tools
3.Data Cleaning as a Process (ii)Data Transformation DATA TRANSFORMATION Commercial tools for data transformation: (i)Data migration tools (ii)ETL(Extraction/Transformation/Loading)tools What is potter’s wheel? What are declarative languages for data cleaning?
Data Integration-1.Entity Identification Problem ,2.Redundancy and correlation analysis,3.Tuple Duplication Merging of data from multiple data sources Helps to reduce and avoid inconsistencies Helps to improve accuracy and speed of the datamining process
Data Integration-1.Entity Identification Problem Customer_id-DB1 Customer_number-DB2 How can the data analyst interpret? As it refers to same attribute Meta data –includes the name,meaning,data type,range of values Meta data-helps to avoid errors in schema integration Pay type:H /S-DB1 :1/2-DB2
Data Integration- 2.Redundancy and correlation analysis REDUNDANT(DOB & AGE) An attribute may be redundant, if it can be derived from another attribute or set of attributes CORRELATION ANALYSIS Some redundancies can be detected by correlation analysis Given two attributes, correlation analysis can measure how strongly one attribute implies the other, based on the available data. Types of analysis :(i) Nominal data- X 2 (chi )square test (ii) Numeric data-correlation coefficient, covariance
Data Integration- 2.Redundancy and correlation analysis Nominal attribute- eg .,Hair color-black,brown,red etc., eg .,Marital status-single,married,divorced,widowed Numeric attribute-Integer or real values
Chi-square distribution table
Correlation Coefficient for Numeric Data Correlation between two attributes A&B are evaluated by computing the correlation coefficient (known as Pearson's product moment coefficient)
Covariance of numeric data Correlation and covariance are two simple measures for assessing how much two attributes change together.
Tuple duplication Duplication should also be detected at tuple level Denormalized tables is a source of data redundancy Data value conflicts detection and resolution Eg.,Weight,hotel,schools,attributes
Data Reduction Why data reduction? Data reduction strategies: 1)Dimensionality reduction Wavelet Transforms Principal Component Analysis(PCA) Attribute Subset Selection 2)Numerosity Reduction Parametric-Regression and Log linear models Non-Parametric-Histograms, Clustering, Sampling, Data cube aggregation 3)Data Compression
Dimensionality Reduction What is dimensionality reduction? What is Curse of dimensionality? What is Wavelet Transform? 1.Discrete Wavelet Transform(DWT) DWT in data reduction 2.Discrete Fourier Transform(DFT) DWT is closely related to DFT, a signal processing technique involving sines and cosines
Difference between DWT and DFT DWT DFT 1.More Accurate Not accurate 2.Requires less space Requires more space 3.Several families of DWT (Haar-2,Daubechies-4,Daubechies-6 etc.,) Only one DFT
Hierarchical pyramid Algorithm Procedure for applying a discrete wavelet transform that halves the data at each iteration, resulting in fast computational speed. Method: 1.Length L of input data vector must be an integral power of 2. 2.Each transform involves applying two functions. (i)Applies data smoothing (ii)Performs weighted difference 3.The two functions are applied to pairs of data points in X 4.The two functions are recursively applied to the data sets obtained in the previous loop, until the resulting data sets obtained are of length2. 5.Selected values from the data sets obtained in the previous iterations are the wavelet coefficients of the transformed data.
Matrix A matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients. Matrix must be orthogonal. Multidimensional Data.
Principal Components Analysis(also called as Karhunen-Loeve or K-L method) What is Principal Component Analysis? Procedure for PCA 1.Normalize input data 2.Compute “K” orthonormal vectors 3.Principal components are sorted in order of decreasing “significance” or “strength” 4.The data size can be reduced by eliminating the weaker components.
Attribute Subset Selection Another way to reduce dimensionality of data. Irrelevant attributes GOAL: 1.To find a minimum set of attributes 2.Reduces the no. of attributes appearing in the discovered pattern How can we find a ‘good’ subset of the original attributes?
Greedy Methods What is Heuristic search? Basic heuristic methods: 1.Stepwise forward selection Initial attribute set:{a1,a2,a3,a4,a5,a6} Initial reduced set:{} =>{a1} =>{a1,a4} =>{a1,a4,a6} Reduced attribute set
Basic Heuristic Method 2.Stepwise backward elimination: Initial attribute set:{a1,a2,a3,a4,a5,a6} =>{a1,a3,a4,a5,a6} =>{a1,a3,a5,a6} =>{a1,a3,a6} 3.Combination of forward selection and backward elimination 4.Decision tree induction Discrete transform in attribute subset selection
Numerosity Reduction-( i )parametric data reduction Regression &log-linear models Regression-can be used to approximate the given data. 1.Linear (simple)Regression: Data are modelled to fit a straight line. Y= wx + b, x & y- numeric database attributes, w & b-regression co efficents . These coefficients are solved by method of least square. 2.Multiple Linear Regression: Allows a response variable ’y’ to be modelled as a linear function of two or more predictor variables. Y=b0+b1x1+b2x2 Log-Linear Models: Approximate discrete multidimensional probability distributions.
Numerosity Reduction-( i )Non-parametric data reduction Histogram-use binning to approximate data distributions and ara popular form of data reduction. ( i )Equal –width: the width of each bucket range is uniform (ii)Equal-frequency(or equal-depth):the frequency of each bucket is constant
Histogram-Equal-frequency, Equal-width
Numerosity Reduction-( i )Non-parametric data reduction Clustering-it considers data tuples as objects. Grouping of similar objects Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid Sampling-allows a large data set to be represented by a much smaller random data sample(or sub set) Simple random sample without replacement of size(SRSWOR)-all tuples are equally likely to be sampled Simple random sample with replacement of size(SRSWR)-similar to SRSWOR, after a tuple is drawn, it is placed back in D so that it may be drawn again. - Cluster Sample:grouped into M mutually disjoint “clusters” Stratified sample:If D is divided into mutually disjoint parts called strata. this helps in clustering for skewed data.
Data cube aggregation
Data Cube Aggregation The cube created at the lowest abstraction level is referred to as the base cuboid. A cube at the highest level of abstraction is the apex cuboid.
Data Compression The transformation are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If we construct only an approximation of the original data then the data is called lossy. The dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression
Data transformation and Data Discretization The data are transformed or consolidated so that the resulting mining process may be more efficient and the patterns found may be easier to understand. Data transformation strategies overview: Smoothing Attribute Construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data
Data Transformation by Normalization Min-Max Normalization Z-score Normalization Decimal Scaling
Min-Max Normalization
Min-Max Normalization
Z-score Normalization
Decimal Scaling
Discretization Discretization by binning Discretization by histogram analysis Discretization by cluster, decision tree and correlation analyses
Concept hierarchy generation for nominal data Specification of a partial ordering of attributes explicity at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes but not of their partial ordering Specification of only a partial set of attributes