1. Introduction to Data Mining (12).pptx

Introduction to Data Mining Afzaal Hussain Email: [email protected]

Introduction to data mining Data Preprocessing (Data cleaning, data integration, data reduction, concept hierarchies) Mining Association Rules (Frequent item-sets and Association rules) Clustering Algorithms (Partitioning methods, Hierarchical methods, Density based methods) Classification Web Mining \ Social Network Analysis Course Content

Text Introduction to Data Mining. By P.-N.Tan, M. Steinbach and V. Kumar. Data Mining: Concepts and Techniques. By Jiawei Han and Micheline Kamber. Selected Research Papers Supplementary Material Data Mining: Practical Machine Learning Tools and Techniques. By I.H.Witten and E. Frank, Morgan Kaufmann. Mining of Massive Data Sets. By Anand Rajaram, Jure Leskovec and Jeff Ullman Some textbooks are free to download Textbooks and Readings

The students should have good background in Database Systems Algorithms and data structures Programming Pre-Requisites

What is Data Mining? Knowledge discovery from data

Introduction Data is growing at a phenomenal rate Web data, e‐commerce purchases at department/grocery stores Bank/Credit Card transactions scientific simulations UNCOVER HIDDEN INFORMATION DATA MINING

Data contains value and knowledge Information “hidden” in the data is not readily evident Human analysts take weeks to discover useful information

What is Data Mining Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Exploration & analysis, by automatic or semi‐automatic means, of large quantities of data in order to discover meaningful patterns

What is Data Mining Given lots of data Discover patterns\models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern

Alternative names Data Mining Knowledge Mining Knowledge Discovery in Databases Data Archaeology Data Dredging Database Mining Knowledge Extraction Data Pattern Processing Information Harvesting

People you may know

An algorithm that could cause a lot of grief

Meaningfulness of Analytical results Risk involved in Data Mining is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap .

Meaningfulness of Analytical results Suggested approach: Human-centered, query-based, focused mining How to measure ? Interestingness

Interes t ingness Objective: based on statistics and structures of patterns, e.g. support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g. unexpectedness, novelty, etc. easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm In t e r estingness measures

Data Mining and related Disciplines Data mining overlaps with : Databases (DB) : Large-scale data, simple queries Machine learning (ML) : Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model

Data Mining and related Disciplines Emphasis is on scalability of number of features and instances (big data) stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning automation for handling large, complex and heterogeneous data

Database vs Data Mining Database Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Data Mining Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)

Database Processing vs . Data Mining Processing Query Well defined SQL Query Poorly defined No precise query language Output Precise Subset of database Output Fuzzy Not a subset of database

What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) What is not DM? Look up phone number in phone directory Query a Web search engine for information about “Amazon”

Applications Commercial applications Classification of debt inquiries Segmentation of customer groups Churn analysis Scientific applications Astronomy Medicine research Medical diagnostics

Applications Banking: loan/credit card approval: predict good customers based on old customers Customer relationship management: identify those who are likely to leave for a competitor Targeted marketing: identify likely responders to promotions Fraud detection: telecommunications, finance from an online stream of event identify fraudulent events

Applications Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis: identify new galaxies by searching for sub clusters

Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

D ata T ar g et Data Selecti o n K n o w l e dge Pre p r o cessed Data Patter n s Data Mining I n ter p ret a t ion / Evaluation Knowledge Discovery in Databases: Process Preprocessing Data mining: the core of knowledge discovery process. Cleaning and Integration

KDD Process Ex: Web Log Selection: Select log data (dates and locations) to use Preprocessing: Remove identifying URLs Remove error logs Transformation: Sessionize (sort and group) Data Mining: Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization

Data Mining Tasks Descriptive methods (Un-Supervised) Find human-interpretable patterns that describe the data Example: Clustering Predictive methods (Supervised) Use some variables to predict unknown or future values of other variables Example: Recommender systems

Data Mining Models and Tasks Descriptive data mining: Describe general properties Predictive data mining: Infer on available data

Classification Classification maps data into predefined groups or classes based on attribute values. (supervised classification) classify students based on final result. classify countries based on climate, or classify cars based on gas mileage Goal: unseen records should be assigned a class as accurately as possible.

Classification Example Tid Refund Marit a l Status T a x a ble Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marit a l Status Tax a ble Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 T est Set T rain i ng Set Model Learn Classifier

Classification Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based or pattern-based classification, logistic regression, … Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

Classification Application Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise.This {buy, don’t buy} decision forms the class attribute . Collect various demographic, lifestyle, and company-interaction related information about all such customers Use this information as input attributes to learn a classifier model.

Clu s ter i ng Clustering groups similar data together into clusters based on attribute values. ( unsupervised classification) The set of data points in each cluster have set of attributes, and a similarity measure among them Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures .

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized

Clustering Application: Market Segmentation: Goal: subdivide a market into distinct subsets of customers  where any subset may be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters .

Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Association Rule Discovery Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart? Produce dependency rules which will predict occurrence of an item based on occurrences of other items in data. TID Items 1 Bread, Coke, Milk 2 Cereal, Bread 3 Cereal, Coke, Diaper, Milk 4 Cereal, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Cereal}

Association Rule Discovery: Application Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => C an be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Outlier Analysis /Anomaly Detection Detect significant deviations from normal behavior Applications:  Credit Card Fraud Detection  Network Intrusion Detection

Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

1. Introduction to Data Mining (12).pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

1. Introduction to Data Mining (12).pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx