Introduction to data mining Data Preprocessing (Data cleaning, data integration, data reduction, concept hierarchies) Mining Association Rules (Frequent item-sets and Association rules) Clustering Algorithms (Partitioning methods, Hierarchical methods, Density based methods) Classification Web Mining \ Social Network Analysis Course Content
Text Introduction to Data Mining. By P.-N.Tan, M. Steinbach and V. Kumar. Data Mining: Concepts and Techniques. By Jiawei Han and Micheline Kamber. Selected Research Papers Supplementary Material Data Mining: Practical Machine Learning Tools and Techniques. By I.H.Witten and E. Frank, Morgan Kaufmann. Mining of Massive Data Sets. By Anand Rajaram, Jure Leskovec and Jeff Ullman Some textbooks are free to download Textbooks and Readings
The students should have good background in Database Systems Algorithms and data structures Programming Pre-Requisites
What is Data Mining? Knowledge discovery from data
Introduction Data is growing at a phenomenal rate Web data, e‐commerce purchases at department/grocery stores Bank/Credit Card transactions scientific simulations UNCOVER HIDDEN INFORMATION DATA MINING
Data contains value and knowledge Information “hidden” in the data is not readily evident Human analysts take weeks to discover useful information
What is Data Mining Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Exploration & analysis, by automatic or semi‐automatic means, of large quantities of data in order to discover meaningful patterns
What is Data Mining Given lots of data Discover patterns\models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern
Alternative names Data Mining Knowledge Mining Knowledge Discovery in Databases Data Archaeology Data Dredging Database Mining Knowledge Extraction Data Pattern Processing Information Harvesting
People you may know
An algorithm that could cause a lot of grief
Meaningfulness of Analytical results Risk involved in Data Mining is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap .
Meaningfulness of Analytical results Suggested approach: Human-centered, query-based, focused mining How to measure ? Interestingness
Interes t ingness Objective: based on statistics and structures of patterns, e.g. support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g. unexpectedness, novelty, etc. easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm In t e r estingness measures
Data Mining and related Disciplines Data mining overlaps with : Databases (DB) : Large-scale data, simple queries Machine learning (ML) : Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model
Data Mining and related Disciplines Emphasis is on scalability of number of features and instances (big data) stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning automation for handling large, complex and heterogeneous data
Database vs Data Mining Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Data Mining Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)
Database Processing vs . Data Mining Processing Query Well defined SQL Query Poorly defined No precise query language Output Precise Subset of database Output Fuzzy Not a subset of database
What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) What is not DM? Look up phone number in phone directory Query a Web search engine for information about “Amazon”
Applications Commercial applications Classification of debt inquiries Segmentation of customer groups Churn analysis Scientific applications Astronomy Medicine research Medical diagnostics
Applications Banking: loan/credit card approval: predict good customers based on old customers Customer relationship management: identify those who are likely to leave for a competitor Targeted marketing: identify likely responders to promotions Fraud detection: telecommunications, finance from an online stream of event identify fraudulent events
Applications Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis: identify new galaxies by searching for sub clusters
Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
D ata T ar g et Data Selecti o n K n o w l e dge Pre p r o cessed Data Patter n s Data Mining I n ter p ret a t ion / Evaluation Knowledge Discovery in Databases: Process Preprocessing Data mining: the core of knowledge discovery process. Cleaning and Integration
KDD Process Ex: Web Log Selection: Select log data (dates and locations) to use Preprocessing: Remove identifying URLs Remove error logs Transformation: Sessionize (sort and group) Data Mining: Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization
Data Mining Tasks Descriptive methods (Un-Supervised) Find human-interpretable patterns that describe the data Example: Clustering Predictive methods (Supervised) Use some variables to predict unknown or future values of other variables Example: Recommender systems
Data Mining Models and Tasks Descriptive data mining: Describe general properties Predictive data mining: Infer on available data
Classification Classification maps data into predefined groups or classes based on attribute values. (supervised classification) classify students based on final result. classify countries based on climate, or classify cars based on gas mileage Goal: unseen records should be assigned a class as accurately as possible.
Classification Example Tid Refund Marit a l Status T a x a ble Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marit a l Status Tax a ble Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 T est Set T rain i ng Set Model Learn Classifier
Classification Application Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise.This {buy, don’t buy} decision forms the class attribute . Collect various demographic, lifestyle, and company-interaction related information about all such customers Use this information as input attributes to learn a classifier model.
Clu s ter i ng Clustering groups similar data together into clusters based on attribute values. ( unsupervised classification) The set of data points in each cluster have set of attributes, and a similarity measure among them Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures .
Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized
Clustering Application: Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters .
Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Association Rule Discovery Frequent patterns (or frequent itemsets) What items are frequently purchased together in your Walmart? Produce dependency rules which will predict occurrence of an item based on occurrences of other items in data. TID Items 1 Bread, Coke, Milk 2 Cereal, Bread 3 Cereal, Coke, Diaper, Milk 4 Cereal, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Cereal}
Association Rule Discovery: Application Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => C an be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Outlier Analysis /Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection
Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data