Data Mining Primitives, Languages & Systems

NiloySikder 1,508 views 45 slides Apr 05, 2019
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

Data Mining Primitives, Languages & Systems, presented as an assignment/ class lecture.


Slide Content

Advanced Data Mining Lec-4: Data Mining Primitives, Languages & Systems [Class Presentation] Presented by Niloy Sikder ID: MSc 190221 CSE Discipline Khulna University, Khulna

Mar 6, 2019 CSE, KU 1 Presentation Outline What are the Primitives of Data Mining? Task-relevant data Data Warehouse Data Cube Drill-down & Roll-up Data Selection Data Filtering Data Slicing Data Pivoting Dicing Data Grouping Clustering Clustering Methods Knowledge type to be mined Data Characterization Statistical Measures AOI Data Discrimination Associations and Correlations Classification Classification methods Prediction Background knowledge Concept Hierarchies System architectures of data mining Data Mining System Architecture Types of Data Mining Architectures Languages of data mining DMQL OLE DB Pattern interestingness measures Visualization of discovered patterns

Data Mining Primitives

Mar 6, 2019 CSE, KU 3 What are the Primitives of Data Mining? The set of task-relevant data to be mined The kind of knowledge to be mined The background knowledge Interestingness measures and thresholds for pattern evaluation The expected representation for visualizing the discovered patterns

Mar 6, 2019 CSE, KU 4 The First Primitive of Data Mining : Task-relevant Data Portions of the database or the set of data in which the user is interested. Fig. 1: Task-relevant data for specifying a data mining task

Mar 6, 2019 CSE, KU 5 Task-relevant Data: Data Warehouse A Warehouse is a repository of information usually from multiple sources Fig. 2 : Typical framework of a data warehouse for AllElectronics . Usually resides at a single site Constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing

Mar 6, 2019 CSE, KU 6 Task-relevant Data: Data Cube A multidimensional data structure inside a data warehouse Fig. 3: Summarized data for AllElectronics . Each dimension corresponds to an attribute Each cell stores the value of some aggregate measure

Mar 6, 2019 CSE, KU 7 Data Cube: Drill-down & Roll-up A presentation of data at different levels of abstraction Fig. 3: Summarized data resulting drill-down and roll-up operations on the cube. Allow the user to view the data at differing degrees of summarization

Mar 6, 2019 CSE, KU 8 Task-relevant Data: Data Selection The process of retrieving relevant data to the analysis task from database Data can be specified by condition-based data filtering , slicing, pivoting or dicing a data cube Data Selection: Data Filtering Selective presentation or deliberate manipulation of information to make it more acceptable or favorable to the mining model Reduces the content of noise or errors from raw data DSP – Low-pass, High-pass, Band-pass, Notch, Comb, Cut-off frequency DIP – Convolution, Gaussian , Bilateral, adaptive, Coye Database – Various SQL filters

Mar 6, 2019 CSE, KU 9 Data Selection: Data Filtering (cont.) Grafil (Graph Similarity Filtering), was developed to filter graphs efficiently in large-scale graph databases

Mar 6, 2019 CSE, KU 10 Data Selection: Data Slicing Selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions

Mar 6, 2019 CSE, KU 11 Data Selection: Data Pivoting Aggregating over all dimensions except two R esults in a two-dimensional cross tabulation reducing a dimension

Mar 6, 2019 CSE, KU 12 Data Selection: Dicing Selecting a subset of cells by specifying a range of attribute values Equivalent to defining a sub-array from the complete array

Mar 6, 2019 CSE, KU 13 Curse of Dimensionality Dimensionality of a data set is the number of attributes that the objects in the data set possess Difficult to analyze and visualize high-dimensional data Data becomes increasingly sparse in the space that it occupies Clustering high-dimensional data is challenging All the dimensions may not be relevant Increases computational complexity Requires more processing power & time

Mar 6, 2019 CSE, KU 14 Task-relevant Data: Data Grouping Clustering is the process of grouping the data into classes or clusters Objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters Can also be used for outlier detection

Mar 6, 2019 CSE, KU 15 Data Grouping: Clustering Typical requirements of clustering in data mining: Scalability Ability to deal with different types of attributes/ data types Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering Interpretability and usability

Mar 6, 2019 CSE, KU 16 Data Grouping: Clustering Methods Partitioning methods: k-Means Method k- Medoids Method CLARANS (for large databases) Hierarchical methods: Agglomerative and Divisive Hierarchical Clustering BIRCH ROCK Chameleon Density-based methods : DBSCAN OPTICS DENCLUE Grid-based methods: STING WaveCluster

Mar 6, 2019 CSE, KU 17 Data Grouping: Clustering Methods (cont.) Model-Based methods : Expectation-Maximization Conceptual Clustering Neural Network Approach Clustering high-dimensional data: CLIQUE PROCLUS

Mar 6, 2019 CSE, KU 18 The Second Primitive of Data Mining : Knowledge Types Important to specify the kind of knowledge to be mined, as this determines the data mining function to be performed Fig. 1: Task-relevant data for specifying a data mining task User can be more specific and provide pattern templates ( metarules or metaqueries ) that all discovered patterns must match

Mar 6, 2019 CSE, KU 19 Knowledge Types: Data Characterization A summary of the general characteristics or features of a target class of data Summarizes data by replacing relatively low-level values (numeric) with higher-level concepts (young , middle-aged , and senior) Several methods for effective data characterization: Statistical measures Attribute-oriented induction (AOI) Output can be presented in pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables

Mar 6, 2019 CSE, KU 20 Data Characterization : Statistical Measures Central tendency of data – mean, weighted mean, median, mode Dispersion of data – range, quartiles, variance, standard deviation Graphical representations – histograms , boxplots , quantile plots, quantile plots, scatter plots, scatter -plot matrices

Mar 6, 2019 CSE, KU 21 Data Characterization: AOI First collects the task-relevant data using a database query Then performs generalization based on the examination of the number of distinct values of each attribute in the relevant set of data Performed through either attribute removal or attribute generalization Aggregation is performed by merging identical generalized tuples and accumulating their respective counts

Mar 6, 2019 CSE, KU 22 Knowledge Types: Data Discrimination A comparison of the general features of target class data objects with a set of contrasting classes The target and contrasting classes can be specified by the user They must be comparable i.e. share similar dimensions and attributes Data discrimination procedure: Data collection: query processing Dimension relevance analysis: select only the highly relevant dimensions for further analysis Synchronous generalization: results in a prime target class relation Presentation of the derived comparison: tables, graphs, and rules

Mar 6, 2019 CSE, KU 23 Knowledge Types: Data Discrimination (cont.) Compare the general properties between the graduate and undergraduate students at BigUniversity , given the attributes name, gender, major, birth place, birth date, residence, phone#, and gpa . This data mining task can be expressed in DMQL as follows: Example: use Big University_DB mine comparison as “grad vs undergrad_students ” in relevance to name, gender, major, birth_place , birth_date , residence, phone#, gpa for “ graduate_students ” where status in “graduate” versus “ undergraduate_students ” where status in “undergraduate” analyze count% from student

Mar 6, 2019 CSE, KU 24 Knowledge Types: Associations and Correlations Frequent patterns, are the patterns that occur frequently in data buys(X; “computer ”)) => buys(X ; “software”) [ support = 1%; confidence = 50%] Mining frequent patterns leads to the discovery of interesting associations and correlations within data A frequent itemset refers to a set of items that frequently appear together in a transactional data set age(X, “20:::29 ”) ^ income(X , “20K:::29K ”)) => buys(X , “CD player”) [support = 2%, confidence = 60%]

Mar 6, 2019 CSE, KU 25 Market Basket Analysis: Fig. 1: Task-relevant data for specifying a data mining task Knowledge Types: Associations and Correlations (cont.)

Mar 6, 2019 CSE, KU 26 Knowledge Types: Classification The process of finding a model (or function) that describes and distinguishes data classes or concepts

Mar 6, 2019 CSE, KU 27 Knowledge Types: Classification (cont.)

Mar 6, 2019 CSE, KU 28 Classification by Decision Tree Induction ID3, C4,5, CART Bayesian Classification Knowledge Types: Classification methods Rule-Based Classification Classification by Back-propagation Support Vector Machines Lazy Learners (or Learning from Your Neighbors) Genetic Algorithms Ensemble Methods: Bagging & Boosting Fuzzy Set Approaches Rough Set Approach

Mar 6, 2019 CSE, KU 29 Linear Regression Nonlinear Regression Knowledge Types: Prediction Methods Log-linear models Decision tree induction Ensemble Methods: Bagging & Boosting Forcasting The process of finding a value/ range of an attribute for a given condition from the training dataset

Mar 6, 2019 CSE, KU 30 The Third Primitive of Data Mining : Background Knowledge Useful to guide the knowledge discovery process and evaluate patterns

Mar 6, 2019 CSE, KU 31 Defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts Allows data to be mined at multiple levels of abstraction Background Knowledge : Concept Hierarchies

Mar 6, 2019 CSE, KU 32 Interestingness Measures and Thresholds for Pattern Evaluation May be used to guide the mining process or, after discovery, to evaluate the discovered patterns Different kinds of knowledge may have different interestingness measures

Mar 6, 2019 CSE, KU 33 Visualization of Discovered Patterns Discovered knowledge should be expressed in high-level languages, visual representations, or other expressive forms Knowledge should be easily understood and directly usable by humans especially crucial if the data mining system is to be interactive

Data Mining Languages

Mar 6, 2019 CSE, KU 35 Data Mining Language: DMQL DMQL (Data Mining Query Language): Based on & similar to the Structured Query Language (SQL) Can work with databases and data warehouses as well Can easily be integrated with the relational query language Example: use database AllElectronics_db use hierarchy location_hierarchy for T.branch , age_hierarchy for C.age mine classification as promising_customers in relevance to C.age , C.income , I.type , I.place _ made , T.branch from customer C, item I, transaction T where I.item _ ID = T.item _ ID and C.cust _ ID = T.cust _ ID and C.income >= 40,000 and I.price >= 100 group by T.cust _ ID having sum( I.price ) >= 1,000 display as rules

Mar 6, 2019 CSE, KU 36 Data Mining Language : OLE DB Microsoft’s OLE DB ( Object Linking and Embedding, Database): A major step toward the standardization of data mining language primitives and aims to become the industry standard Adopts many concepts in relational database systems and applies them to the data mining field, providing a standard programming API. Designed to allow data mining client applications (or data mining consumers) to consume data mining services from various data mining softwares . Has DMX (Data Mining eXtensions ) at the core, which is SQL-like OLE DB for DM describes an abstraction of the data mining process: Model creation Model training Model prediction and browsing

Mar 6, 2019 CSE, KU 37 Data Mining Language : OLE DB (cont.)

Mar 6, 2019 CSE, KU 38 Data Mining Language: OLE DB (cont.) Example: create mining model prediction ( customer_ID long key , gender text discrete , age long discretized (), income long continuous , profession text discrete , ) using Microsoft_Decision_Trees

Data Mining Systems

Mar 6, 2019 CSE, KU 40 Data Mining System Architecture

Mar 6, 2019 CSE, KU 41 Types of Data Mining Architectures No-coupling Data Mining: Data mining system does not use any functionality of a database or warehouse Retrieves data from a particular data sources Does not take any advantages of a database Considered a poor architecture but used for simple data mining applications Loose Coupling Data Mining: System may use some of the functions of database and data warehouse system Fetches the data from the data respiratory managed by the system Stores the mining result either in a file or in a designated place in a database or in a data warehouse D oes not provide high scalability and high performance.

Mar 6, 2019 CSE, KU 42 Types of Data Mining Architectures (cont.) Semi-Tight Coupling Data Mining: M ining system is linked with a database or a data warehouse system Uses several features of data warehouse systems Applications include sorting, indexing & aggregation Efficient implementations of a few data mining primitives can be provided Tight Coupling Data Mining: Mining system is fully integrated into a database or data warehouse system Mining subsystem is treated as one functional component of an IR system Provides system scalability, high performance, and integrated information

March 06, 2019 CSE, KU 35 THANK YOU ANY QUESTIONS?

References [ 1] Data Mining: Concepts and Techniques Second Edition - Jiawei Han, Micheline Kamber [ 2 ] Introduction to Data Mining - Tan Steinbach Kumar [ 3] https://data-flair.training/blogs/data-mining-architecture / [ 4] https:// www.tutorialspoint.com/data_mining/dm_systems.htm