data generalization and summarization

26,241 views 20 slides Mar 06, 2018
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

about data generalization and summarization-based characterized


Slide Content

Presented by r.ramadevi I . M sc ( cs & it) nadar Saraswathi college of arts & science theni . DATA GENERALIZATION AND SUMMARIZATION-BASED CHARACTERIZED

Data generalization and summarization- based characterization Data and objects in database often contain detailed information at primitive concept levels FOR EXAMPLE: The item relation in a sales database may contain attributes describing low-level item information such as item _ id ,name ,brand , category ,supplier , place _ made and price This requries an important functionality in data mining : data generalization

DATA GENERALIZATION DATA GENERALIZATION is a process that abstracts a large set of task relevant data in a database from a relatively low conceptual level to hight conceptual levels The generalization of large data sets can be categorized according to two approaches (1)The data cube (or OLAP)approach (2)The attribute-oriented induction approach

ATTRIBUTE-ORIENTED INDUCTION The attribute-oriented induction(AOI) approach to data generalization and summarization _ based characterization was first proposed in 1989 The data cube approach can be considered as a data warehouse based precomputation oriented materialized _ view approach It performs off _ line aggregation before an OLAP or data mining query is submitted for processing The attribute oriented induction approach , a relation database query oriented , generalization-based ,on-line data analysis technique

ATTRIBUTE-ORIENTED INDUCTION Some aggregation in the data cube can be computed on – line While off – line precomputation of multidimensional space can speed up attribute – oriented induction as well To first collect the task _ relevant data using a relational database query and then perform generalization based on the examination of the number of distinct value of each attributer in the relevant set of data

EXAMPLE Specifying a data mining query for characterization with DMQL: Suppose that a user would like to describe the general characteristics of graduate students in the BIG _ UNIVERSITY The attributes (name ,gender ,major , birth _ place , birth _ data , phone no & gpa use Big _ university _ DB mine characteristics as “science _ students” in relevant to name , gander , major , birth place , birth date , phone no ,gpa from student where status in “ graduate “

TRANSFORMING A DATA MINING QUERY TO A RELATIONAL QUERY The transformed query is executed against the relational data base Big university DB and return the data show This table on which induction will be perfomed use Big _ university _ DB select name , gander , major , birth place , birth date , phone no ,gpa from student where status in [ “M.SC”, “ M.A ”,” M.B.A ., ”,” Ph.D”]

DATA generalization two types ATTRIBUTES REMOVED: If there is a large set of distinct values for an attributes of the initial working relation (1)There is no generalization operator on the attributes (2)Its higher level concept are expressed in terms of other attributes

ATTRIBUTES GENERALIZATION If there is a large set of distinct values for an attributes in the initial working relation and there exists a set of generalization operation on the attributes This corresponds to the generalization rule known as climbing generalization trees in learning example or concept tree ascension First technique : called attributes generalization threshold control second technique : called generalization relation threshold control

ATTRIBUTE – ORIENTED INDUCTION For each attributes of the relation the generalization proceeds as follows: 1.name:the large number of distinct values for gender , no generalization operation defined attributes is removed 2.gender:There are two distinct values , the attributes is retained 3.major:support the concept hierarchy has be defined the attributes major to generalization to the values{arts _ science ,business) 4.Birth _ place: The attributes has a large number of distinct values , birth _ data defined as city < province _ or _ status < country

ATTRIBUTE – ORIENTED INDUCTION 5.Birth date: support that hierarchy exists that can generalization birth date to age & age to age _ range 6.residence:The number of distinct vales for number & street will likely be very high 7.phone:The attributes contain to many distinct values & therefore be removed in generalization 8.gpa:support a concept hierarchy exists for gpa that groups values for grade point average numerical intervals like {3.75-4.0,3.5-75,..}

Efficient implementation of attribute – oriented induction Algorithm : attribute _ oriented _ induction mining generalization characteristics in a relational database given a users data mining request INPUT: ( i )DB a relational data base (ii)DMQ query a data mining query (iii)a _ list a list of attributes (iv)Get(a) a sat of concept hierarchies or generalization operators on attributes (v)a _ get _ thresh(a)

OUTPUT & METHODS Output : p , a prime _ generalization _ relation Methods : the method is outline as follows 1.W get _ task _relevant _ data (DMQ query , DB)the working relevant hold the task _ relevant data 2.Prepare _ for _generalization(W) (a)scan w & collect the distinct values for each attributes (b)For each attribute ai determine if not computer its minimum desired level L

P generalization (w) The prime _generalization _ relation P derived by replacing each value v in w accumulating count and computing any other aggregate value (a)For each generalization tuple insert the tuple into a sorted prime relation p by a binary search (b)since in most cases the number of distinct values at the prime relation level is small

Presentation of the derived generalization Attributes – oriented induction generates one or a set of generalized description Location item sales count Asia TV 15 300 Europe TV 12 250 North America TV 28 450 Asia computer 120 1000

A CROSSTAB FOR THE SALES IN 1999 LOCATION\ITEM TV COMPUTER BOTH _ ITEM sales count sales count sales count ASIA 15 300 120 1000 135 13000 Europe 12 250 150 1200 162 1450 All regions 55 1000 470 4000 525 5000 The t-weight as an interestingness measures the typicality of each disjunct in the rule

T -weight The t weight for Qa is the percentage of tuple of the target class from the initial working relation that are covered by Qa t _ weight = count ( qa )/count(qi)

BAR CHART REPRESENTATION 200 150 100 50 TV computer TV + Computer

PIE CHART REPRESENTATION North Asia(27.7%) America(50%) TV sales Europe(21.82%) Asia(42%) Europe(25%) north(31%) computer sales

THANK YOU!!!
Tags