Outline
•What is Data Mining and KDD?
•Characteristics
•Applications
•Methods
•Packages & Close Relatives
What is Data Mining & KDD?
•“The process of identifying hidden patterns
and relationships within data”
or
•“Data mining helps end users extract useful
business information from large databases”
What’s the Appeal?
•Hidden nuggets of valuable information buried
deep within a mountain of otherwise
unremarkable data
•Pervasive data
•Seek competitive advantage
Process: Knowledge Discovery In
Databases
database
database
data
warehouse
cleaning &
integration
modify data
selection
modify data
selection
data mining
collect and
transform
discovered
patterns
data mining
engines,
models
evaluation &
presentation
user interface
and expert
knowledge
domain
modify
methods,
parameters
Context
•Where you stand on Data Mining depends on
where you sit:
•Business User
•Researcher
•Computer Scientist
Data Mining Might Mean…
•Statistics
•Visualization
•Artificial intelligence
•Machine learning
•Database technology
•Neural networks
•Pattern recognition
•Knowledge-based systems
•Knowledge acquisition
•Information retrieval
•High performance computing
•And so on...
What’s needed?
•Suitable data
•Computing power
•Data mining software
•Skilled operator who knows both the nature of
the data and the software tools
•Reason, theory, or hunch
Typical Applications of Data Mining
& KDD
•Marketing
•Market Basket Analysis
•Customer Relationship Management
•New Product Development
Typical Applications of Data Mining
& KDD
•Financial Services
•Credit Approval
•Fraud Detection
•Marketing
Typical Applications of Data Mining
& KDD
•Health Care
•Epidemiological Analysis - incidence and prevalence
of disease in large populations and detection of the
source and cause of epidemics of infectious disease
•Knowledge for funding
•Policy, programs
Two Basic Approaches
•Supervised
•A dependent or target variable
•Unsupervised
•“Pure Data Mining”
•Fewer assumptions
•Typically used for clustering techniques
Automation
•The ability to aim a tool at some data and push
a button
•Some methods of KDD/Data mining are more
suitable for automation than others
•Graphical representations of relationships with
data
•Excel at Classification & Prediction Models
Decision Trees
Sample of a Decision Tree
gender
femalemale
<65 >=65
married?age
yes no
good
health?
yes no
- +
urban?
yes no
pet
owner?
yes no
+ - - +
pet
owner?
yes no
- +
Decision Trees
•Strengths
•Easily understood
and interpreted
•Represent complexity
in a compact form
•Handle non-linear
data well
•Relatively well suited
to automation.
•Weaknesses
•Large trees with large
numbers of variables
become difficult to
understand
•Missing data must be
appropriately
managed in
construction and use
of the models
Neural Networks
•Derived from Artificial Intelligence Research
•Modelled on the Human Neuron
Neural Networks
•Strengths
•Accuracy of prediction
•Robust performance
with a wide variety of
data types
•Weaknesses
•Prone to overfitting
•Poor clarity of model
Clustering/Nearest Neighbour
•Aim to assign “like” records to a group
•Groups assigned according to some target
variable or criteria
•Nearest neighbour used for prediction
Clustering/
Nearest Neighbour
•Strengths
•Easily understood
and interpreted
•Easily implemented in
basic situations
•Weaknesses
•complex data not well
suited to automation
(much preprocessing
required)
Genetic Algorithms/
Evolutionary Computing
•Grounded in Darwin – applied using
mathematics
•Require
•a way to represent a solution to a problem
•a way to test the “fitness” of the solution
•Solutions are mathematically “mutated”
•Fittest solutions survive
•Convergence
Genetic Algorithms/
Evolutionary Computing
•Strengths
•Suited to novel
problems that are
poorly understood
•Suitable where data is
dirty or missing
•May be useful where
other methods cannot
be applied
•Weaknesses
•Not easily automated
•Require creativity in
their application
Bayesian Networks
•Based on Bayes’ rule:
•P(a|b) = P(b|a) * P(a) / P(b)
•Can construct networks of linked events, each
with prior probabilities
Bayesian Network Example
J.R. Shot
Bobby
shot him
Just a
dream
sequence
Mistress
shot him
Wife
shot him
Suicid
e
J. R.
Treated
for
Depressio
n
Bobby
publicly
threatened
Producer
s
desperat
e for
ratings
Big fight
between
wife,
mistress
Bayesian Networks
•Strengths
•Clarity of the resulting
models
•Good precision in
predicting
•Easily adapt to new
probabilities
•Weaknesses
•Time consuming to
construct and
maintain
•Poor at predicting
rare events
Statistics
•With an outcome or dependent variable:
•Correlations
•ANOVA
•Regression
•Used by themselves or to confirm findings of
another method
Statistics
•Strengths
•“Gold Standard” –
valid and trusted in
scientific circles
•Weaknesses
•Limits findings to
those techniques that
are applied and their
associated limitations
(normality, linearity,
and so on)
Hybrids
•Techniques used in combination
•Example: use of a genetic algorithm to identify
target variables for inclusion in a neural
network model
Recap
•Data Mining is the core activity or method
within a process of Knowledge Discovery in
Databases
•Done in order to find useful information in large
amounts of data not possible using
“conventional” approaches
•Variety of methods
•Knowledge of data domain, methods, as well
as creativity
Data Mining Packages
•Major vendors of database/data management
products (IBM, SPSS, Oracle PeopleSoft,
SAS, and so on)
•Added as a component of turnkey packages
•May incorporate several methods (SAS
Enterprise Miner)
•Single method (TreeAge Software Inc.: a
dedicated decision tree product)
How to implement?
•Do it yourself (you know the data domain)
•Put a team together (domain and method
specialists)
•Hire a consultant (who knows both your
domain and the tools)
•Vertical markets in data mining
Close Relatives of Data Mining
•On-Line Analytical Processing (OLAP)
•Pivot tables in spreadsheets
•General statistical packages
•Intelligent Data Analysis – comprises the use
of data mining methods in the analysis of
“small” datasets