DATA REDUCTION STRATEGIES DATA CUBE AGGREGATION ATTRIBUTE SUBSET SELECTION
Why data reduction? Huge amount of data is being created day by day. Development of big data platform. Poor performance of old algorithms. Most of the data mining algorithms are column wise implemented. Pushed for data reduction procedures.
What is data reduction? Data reduction is a process that reduced the volume of original data and represents it in a much smaller volume. It maintains the integrity of the data while reducing. The time required for data reduction should not overshadow the the time saved by data mining on the reduced data set. Data reduction does not affect the result obtained from data mining . Data reduction increases the efficiency of data mining.
Data reduction strategies Data cube aggregation Attribute subset selection Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Data Cube Aggregation This technique is used to aggregate (combine) data in a simpler form. So we can summarize the data in such a way that the data is used as result
Data Cube Aggregation The data is given of states and their profit earned in dollars for selling laptops in each country in different tables by each state .
States Gross Profit($) Arizona 500 Texas 320 Illanoid 430 States Gross Profit($) Kerala 245 Tamil Nadu 380 Goa 950 States Gross Profit($) Alberta 420 Manitoba 200 Ontario 300 Country Gross Profit($) USA 1250 India 1575 Canada 920 Country USA Country Canada Country India
Attribute Subset Selection From a large number of attributes a minimal attribute set is being reduced by eliminating the irrelevant attributes that may not much affect the data . Mining of reduced data makes it easier to understand.
Methods of Attribute Subset Selection are: Stepwise Forward Selection- It starts with an empty set and add the relevant attributes ignoring the rest. Step-wise backward elimination –It starts with full set and removes the irrelevant attributes keeping the rest. Combining forward selection and backward elimination -select the best and removes the worst Decision-tree induction -It is a flowchart like structure to choose best attribute to partition data.
Example A data set is given from which we need to segregate the number of male, female and transgender individuals who are eligible for voting. Initial Attribute Set ={ Name, Age, Gender, Address, Phone}
Forward Selection Initial attribute set ={ Name, Age, Gender, Address, Phone} Initial Reduced Set =>{ } =>{ Age } =>{Age, Gender } Reduced attribute set =>{ Age ,Gender}