Data Mining Primitives, Languages & Systems, presented as an assignment/ class lecture.
Size: 2.14 MB
Language: en
Added: Apr 05, 2019
Slides: 45 pages
Slide Content
Advanced Data Mining Lec-4: Data Mining Primitives, Languages & Systems [Class Presentation] Presented by Niloy Sikder ID: MSc 190221 CSE Discipline Khulna University, Khulna
Mar 6, 2019 CSE, KU 1 Presentation Outline What are the Primitives of Data Mining? Task-relevant data Data Warehouse Data Cube Drill-down & Roll-up Data Selection Data Filtering Data Slicing Data Pivoting Dicing Data Grouping Clustering Clustering Methods Knowledge type to be mined Data Characterization Statistical Measures AOI Data Discrimination Associations and Correlations Classification Classification methods Prediction Background knowledge Concept Hierarchies System architectures of data mining Data Mining System Architecture Types of Data Mining Architectures Languages of data mining DMQL OLE DB Pattern interestingness measures Visualization of discovered patterns
Data Mining Primitives
Mar 6, 2019 CSE, KU 3 What are the Primitives of Data Mining? The set of task-relevant data to be mined The kind of knowledge to be mined The background knowledge Interestingness measures and thresholds for pattern evaluation The expected representation for visualizing the discovered patterns
Mar 6, 2019 CSE, KU 4 The First Primitive of Data Mining : Task-relevant Data Portions of the database or the set of data in which the user is interested. Fig. 1: Task-relevant data for specifying a data mining task
Mar 6, 2019 CSE, KU 5 Task-relevant Data: Data Warehouse A Warehouse is a repository of information usually from multiple sources Fig. 2 : Typical framework of a data warehouse for AllElectronics . Usually resides at a single site Constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing
Mar 6, 2019 CSE, KU 6 Task-relevant Data: Data Cube A multidimensional data structure inside a data warehouse Fig. 3: Summarized data for AllElectronics . Each dimension corresponds to an attribute Each cell stores the value of some aggregate measure
Mar 6, 2019 CSE, KU 7 Data Cube: Drill-down & Roll-up A presentation of data at different levels of abstraction Fig. 3: Summarized data resulting drill-down and roll-up operations on the cube. Allow the user to view the data at differing degrees of summarization
Mar 6, 2019 CSE, KU 8 Task-relevant Data: Data Selection The process of retrieving relevant data to the analysis task from database Data can be specified by condition-based data filtering , slicing, pivoting or dicing a data cube Data Selection: Data Filtering Selective presentation or deliberate manipulation of information to make it more acceptable or favorable to the mining model Reduces the content of noise or errors from raw data DSP – Low-pass, High-pass, Band-pass, Notch, Comb, Cut-off frequency DIP – Convolution, Gaussian , Bilateral, adaptive, Coye Database – Various SQL filters
Mar 6, 2019 CSE, KU 9 Data Selection: Data Filtering (cont.) Grafil (Graph Similarity Filtering), was developed to filter graphs efficiently in large-scale graph databases
Mar 6, 2019 CSE, KU 10 Data Selection: Data Slicing Selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions
Mar 6, 2019 CSE, KU 11 Data Selection: Data Pivoting Aggregating over all dimensions except two R esults in a two-dimensional cross tabulation reducing a dimension
Mar 6, 2019 CSE, KU 12 Data Selection: Dicing Selecting a subset of cells by specifying a range of attribute values Equivalent to defining a sub-array from the complete array
Mar 6, 2019 CSE, KU 13 Curse of Dimensionality Dimensionality of a data set is the number of attributes that the objects in the data set possess Difficult to analyze and visualize high-dimensional data Data becomes increasingly sparse in the space that it occupies Clustering high-dimensional data is challenging All the dimensions may not be relevant Increases computational complexity Requires more processing power & time
Mar 6, 2019 CSE, KU 14 Task-relevant Data: Data Grouping Clustering is the process of grouping the data into classes or clusters Objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters Can also be used for outlier detection
Mar 6, 2019 CSE, KU 15 Data Grouping: Clustering Typical requirements of clustering in data mining: Scalability Ability to deal with different types of attributes/ data types Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering Interpretability and usability
Mar 6, 2019 CSE, KU 16 Data Grouping: Clustering Methods Partitioning methods: k-Means Method k- Medoids Method CLARANS (for large databases) Hierarchical methods: Agglomerative and Divisive Hierarchical Clustering BIRCH ROCK Chameleon Density-based methods : DBSCAN OPTICS DENCLUE Grid-based methods: STING WaveCluster
Mar 6, 2019 CSE, KU 17 Data Grouping: Clustering Methods (cont.) Model-Based methods : Expectation-Maximization Conceptual Clustering Neural Network Approach Clustering high-dimensional data: CLIQUE PROCLUS
Mar 6, 2019 CSE, KU 18 The Second Primitive of Data Mining : Knowledge Types Important to specify the kind of knowledge to be mined, as this determines the data mining function to be performed Fig. 1: Task-relevant data for specifying a data mining task User can be more specific and provide pattern templates ( metarules or metaqueries ) that all discovered patterns must match
Mar 6, 2019 CSE, KU 19 Knowledge Types: Data Characterization A summary of the general characteristics or features of a target class of data Summarizes data by replacing relatively low-level values (numeric) with higher-level concepts (young , middle-aged , and senior) Several methods for effective data characterization: Statistical measures Attribute-oriented induction (AOI) Output can be presented in pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables
Mar 6, 2019 CSE, KU 20 Data Characterization : Statistical Measures Central tendency of data – mean, weighted mean, median, mode Dispersion of data – range, quartiles, variance, standard deviation Graphical representations – histograms , boxplots , quantile plots, quantile plots, scatter plots, scatter -plot matrices
Mar 6, 2019 CSE, KU 21 Data Characterization: AOI First collects the task-relevant data using a database query Then performs generalization based on the examination of the number of distinct values of each attribute in the relevant set of data Performed through either attribute removal or attribute generalization Aggregation is performed by merging identical generalized tuples and accumulating their respective counts
Mar 6, 2019 CSE, KU 22 Knowledge Types: Data Discrimination A comparison of the general features of target class data objects with a set of contrasting classes The target and contrasting classes can be specified by the user They must be comparable i.e. share similar dimensions and attributes Data discrimination procedure: Data collection: query processing Dimension relevance analysis: select only the highly relevant dimensions for further analysis Synchronous generalization: results in a prime target class relation Presentation of the derived comparison: tables, graphs, and rules
Mar 6, 2019 CSE, KU 23 Knowledge Types: Data Discrimination (cont.) Compare the general properties between the graduate and undergraduate students at BigUniversity , given the attributes name, gender, major, birth place, birth date, residence, phone#, and gpa . This data mining task can be expressed in DMQL as follows: Example: use Big University_DB mine comparison as “grad vs undergrad_students ” in relevance to name, gender, major, birth_place , birth_date , residence, phone#, gpa for “ graduate_students ” where status in “graduate” versus “ undergraduate_students ” where status in “undergraduate” analyze count% from student
Mar 6, 2019 CSE, KU 24 Knowledge Types: Associations and Correlations Frequent patterns, are the patterns that occur frequently in data buys(X; “computer ”)) => buys(X ; “software”) [ support = 1%; confidence = 50%] Mining frequent patterns leads to the discovery of interesting associations and correlations within data A frequent itemset refers to a set of items that frequently appear together in a transactional data set age(X, “20:::29 ”) ^ income(X , “20K:::29K ”)) => buys(X , “CD player”) [support = 2%, confidence = 60%]
Mar 6, 2019 CSE, KU 25 Market Basket Analysis: Fig. 1: Task-relevant data for specifying a data mining task Knowledge Types: Associations and Correlations (cont.)
Mar 6, 2019 CSE, KU 26 Knowledge Types: Classification The process of finding a model (or function) that describes and distinguishes data classes or concepts
Mar 6, 2019 CSE, KU 27 Knowledge Types: Classification (cont.)
Mar 6, 2019 CSE, KU 28 Classification by Decision Tree Induction ID3, C4,5, CART Bayesian Classification Knowledge Types: Classification methods Rule-Based Classification Classification by Back-propagation Support Vector Machines Lazy Learners (or Learning from Your Neighbors) Genetic Algorithms Ensemble Methods: Bagging & Boosting Fuzzy Set Approaches Rough Set Approach
Mar 6, 2019 CSE, KU 29 Linear Regression Nonlinear Regression Knowledge Types: Prediction Methods Log-linear models Decision tree induction Ensemble Methods: Bagging & Boosting Forcasting The process of finding a value/ range of an attribute for a given condition from the training dataset
Mar 6, 2019 CSE, KU 30 The Third Primitive of Data Mining : Background Knowledge Useful to guide the knowledge discovery process and evaluate patterns
Mar 6, 2019 CSE, KU 31 Defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts Allows data to be mined at multiple levels of abstraction Background Knowledge : Concept Hierarchies
Mar 6, 2019 CSE, KU 32 Interestingness Measures and Thresholds for Pattern Evaluation May be used to guide the mining process or, after discovery, to evaluate the discovered patterns Different kinds of knowledge may have different interestingness measures
Mar 6, 2019 CSE, KU 33 Visualization of Discovered Patterns Discovered knowledge should be expressed in high-level languages, visual representations, or other expressive forms Knowledge should be easily understood and directly usable by humans especially crucial if the data mining system is to be interactive
Data Mining Languages
Mar 6, 2019 CSE, KU 35 Data Mining Language: DMQL DMQL (Data Mining Query Language): Based on & similar to the Structured Query Language (SQL) Can work with databases and data warehouses as well Can easily be integrated with the relational query language Example: use database AllElectronics_db use hierarchy location_hierarchy for T.branch , age_hierarchy for C.age mine classification as promising_customers in relevance to C.age , C.income , I.type , I.place _ made , T.branch from customer C, item I, transaction T where I.item _ ID = T.item _ ID and C.cust _ ID = T.cust _ ID and C.income >= 40,000 and I.price >= 100 group by T.cust _ ID having sum( I.price ) >= 1,000 display as rules
Mar 6, 2019 CSE, KU 36 Data Mining Language : OLE DB Microsoft’s OLE DB ( Object Linking and Embedding, Database): A major step toward the standardization of data mining language primitives and aims to become the industry standard Adopts many concepts in relational database systems and applies them to the data mining field, providing a standard programming API. Designed to allow data mining client applications (or data mining consumers) to consume data mining services from various data mining softwares . Has DMX (Data Mining eXtensions ) at the core, which is SQL-like OLE DB for DM describes an abstraction of the data mining process: Model creation Model training Model prediction and browsing
Mar 6, 2019 CSE, KU 37 Data Mining Language : OLE DB (cont.)
Mar 6, 2019 CSE, KU 38 Data Mining Language: OLE DB (cont.) Example: create mining model prediction ( customer_ID long key , gender text discrete , age long discretized (), income long continuous , profession text discrete , ) using Microsoft_Decision_Trees
Data Mining Systems
Mar 6, 2019 CSE, KU 40 Data Mining System Architecture
Mar 6, 2019 CSE, KU 41 Types of Data Mining Architectures No-coupling Data Mining: Data mining system does not use any functionality of a database or warehouse Retrieves data from a particular data sources Does not take any advantages of a database Considered a poor architecture but used for simple data mining applications Loose Coupling Data Mining: System may use some of the functions of database and data warehouse system Fetches the data from the data respiratory managed by the system Stores the mining result either in a file or in a designated place in a database or in a data warehouse D oes not provide high scalability and high performance.
Mar 6, 2019 CSE, KU 42 Types of Data Mining Architectures (cont.) Semi-Tight Coupling Data Mining: M ining system is linked with a database or a data warehouse system Uses several features of data warehouse systems Applications include sorting, indexing & aggregation Efficient implementations of a few data mining primitives can be provided Tight Coupling Data Mining: Mining system is fully integrated into a database or data warehouse system Mining subsystem is treated as one functional component of an IR system Provides system scalability, high performance, and integrated information
March 06, 2019 CSE, KU 35 THANK YOU ANY QUESTIONS?
References [ 1] Data Mining: Concepts and Techniques Second Edition - Jiawei Han, Micheline Kamber [ 2 ] Introduction to Data Mining - Tan Steinbach Kumar [ 3] https://data-flair.training/blogs/data-mining-architecture / [ 4] https:// www.tutorialspoint.com/data_mining/dm_systems.htm