September 27, 2024
Data Mining: Concepts and
Techniques 1
Chp-1: Introduction to
Data Mining
September 27, 2024
Data Mining: Concepts and
Techniques 2
Chapter 1. Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Kind of patterns to be mined
Technologies used
Major issues in data mining
September 27, 2024
Data Mining: Concepts and
Techniques 3
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes(1000
terabytes)
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
September 27, 2024
Data Mining: Concepts and
Techniques 4
Evolution of Database Technology
1960s:
Data collection, database creation and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
September 27, 2024
Data Mining: Concepts and
Techniques 5
Evolution of database system technology
September 27, 2024
Data Mining: Concepts and
Techniques 6
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Decision making
September 27, 2024
Data Mining: Concepts and
Techniques 7
September 27, 2024
Data Mining: Concepts and
Techniques 8
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Selection & Transformation
Data Mining
Pattern Evaluation
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Why Data Mining?—Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis, cross
selling
Risk analysis and management
Forecasting, customer retention, quality control,
competitive analysis
Fraud detection and detection of unusual
patterns (outliers
Other Applications
Text mining (news group, email, documents) and Web
mining
Stream data mining
Bioinformatics and bio-data analysis
September 27, 2024
Data Mining: Concepts and
Techniques 10
Ex. 1: Market Analysis and Management
Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations
between product sales, & predict based on such
association
Customer profiling—What types of customers buy
what products (clustering or classification)
September 27, 2024
Data Mining: Concepts and
Techniques 11
Ex. 1: Market Analysis and Management
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary information
Multidimensional summary reports
Statistical summary information (data central tendency and
variation)
September 27, 2024
Data Mining: Concepts and
Techniques 12
September 27, 2024
Data Mining: Concepts and
Techniques 13
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data(geographical data)
Multimedia database
Text databases
The World-Wide Web
Data Mining: On What Kinds of Data?
Mining relational databases
Eg. Anaylze customer data to predict the credit risk
of new customers based on their income, age and
previous credit information.
Data Warehouses
Sales per item type per branch for third quarter.
Data stored to provide information from historical
perespective. Eg. In past 6 to 12 months,
summarized data
Modeled by multidimentional data structure called
data cube.
September 27, 2024
Data Mining: Concepts and
Techniques 14
September 27, 2024
Data Mining: Concepts and
Techniques 15
Transactional data
Eg analyze which items are sold well together?
Printers are normally purchased together with
computer
September 27, 2024
Data Mining: Concepts and
Techniques 16
Data Mining: On What Kinds of Data?
Kinds of Patterns to be mined
September 27, 2024
Data Mining: Concepts and
Techniques 17
What Kinds of Patterns Can Be Mined?
1)Generalization
2)Association and Correlation Analysis
3)Classification
4)Cluster Analysis
5)Outlier Analysis
September 27, 2024
Data Mining: Concepts and
Techniques 18
Data Mining Function: (1) Generalization
Multidimensional concept description:
Characterization and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., summarize the characteristics
of customers who spend more than Rs. 50,000 a
year at an electronics store
Data characterization is a summarization of the
general characteristics or features of a target class
of data
Data cube technology for computing
OLAP (online analytical processing)
Examples of Output forms : pie charts, MDD
cubes, bar charts, curves etc.
September 27, 2024
Data Mining: Concepts and
Techniques 19
Data Mining Function: (1) Generalization
contd.
Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.
Eg. Compare 2 groups of customers- those who shop
for computer products regularly(more than twice a
month) and those who rarely shop for such
products(less than 3 times a year)
Data cube technology for computing
Drill down on any dimension
Discriminant rules: Discrimination descriptions
expressed in the form of rules
Output forms : same as that of data characterization
along with discrimination descriptions
September 27, 2024
Data Mining: Concepts and
Techniques 20
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in
your mart? Eg. Milk & bread
Association, correlation vs. causality
A typical association rule
Computer software [1%, 50%] (support,
→
confidence)
Confidence means that if one buys a computer there is a 50%
chance that she will buy software too. A 1% support means
that 1% of all transactions under analysis show that computer
& software are purchased together
Association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold
and a minimum confidence threshold
September 27, 2024
Data Mining: Concepts and
Techniques 21
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
September 27, 2024
Data Mining: Concepts and
Techniques 22
Various forms of a classification model
September 27, 2024
Data Mining: Concepts and
Techniques 23
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is
unknown)
Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
Data objects are clustered or grouped based
on the principle of maximizing intraclass
similarity and minimizing interclass similarity
September 27, 2024
Data Mining: Concepts and
Techniques 24
Data Mining Function: (4) Cluster Analysis
September 27, 2024
Data Mining: Concepts and
Techniques 25
Data Mining Function: (5) Outlier Analysis
Outlier analysis (anomaly mining)
Outlier: A data object that does not comply
with the general behaviour of the data
Noise or exception? ― One person’s garbage
could be another person’s treasure
Methods: by product of clustering or
regression analysis, …
Useful in fraud detection, rare events analysis
September 27, 2024
Data Mining: Concepts and
Techniques 26
September 27, 2024
Data Mining: Concepts and
Techniques 27
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them
are interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g. large earthquake
often follows a cluster of small earthquake.
September 27, 2024
Data Mining: Concepts and
Techniques 28
Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones
Generate only the interesting
Technologies Used
As a highly application-driven domain, data mining has
incorporated many techniques from other domains
The interdisciplinary nature of data mining research and
development contributes significantly to the success of data
mining and its extensive applications
September 27, 2024
Data Mining: Concepts and
Techniques 29
September 27, 2024
Data Mining: Concepts and
Techniques 30
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology
Data Mining: Confluence of Multiple Disciplines
Statistics
Statistical models are widely used to model data and
data classes.
Eg. We can use statistics to model noise and
missing data.
Machine learning
Computer programs automatically learn to
recognize complex patterns and make intelligent
decisions based on data.
e.g. Handwritten postal codes
September 27, 2024
Data Mining: Concepts and
Techniques 31
September 27, 2024
Data Mining: Concepts and
Techniques 32
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes
of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
September 27, 2024
Data Mining: Concepts and
Techniques 33
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types,
e.g., files in pdf or doc
Mining knowledge in multi-dimensional space.
Data mining: An interdisciplinary effort( mine data with NL
text)
Pattern evaluation: the interestingness problem
Handling noise, uncertainty, and incompleteness of data
Integration of the discovered knowledge with existing one:
knowledge fusion
Pattern evaluation and pattern- or constraint-guided mining
September 27, 2024
Data Mining: Concepts and
Techniques 34
Major Issues in Data Mining (1)
User interaction
Interactive mining( dynamically change focus of search)
Incorporation of background knowledge(constraints, rules)
presentation and visualization of data mining results
Efficiency and Scalability
Efficiency and scalability of data mining algorithms(run time …
predictable,short,acceptable)
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data(simple to temporal data
objects)
Mining dynamic, networked, and global data repositories
September 27, 2024
Data Mining: Concepts and
Techniques 35
Major Issues in Data Mining (2)
Data mining and society
Social impacts of data mining(benefit to society)
Privacy-preserving data mining
Invisible data mining(system have buit in function.. click of
mouse)
September 27, 2024
Data Mining: Concepts and
Techniques 36
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Know
ledge
-Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
September 27, 2024
Data Mining: Concepts and
Techniques 37
Summary
Data mining: Discovering interesting patterns from large amounts
of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining