DataMining and Knowledge Discovery in DB.ppt

AnonymousEImkf6RGdQ 51 views 36 slides Sep 19, 2024
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

Data Mining notes


Slide Content

Data Mining and
Knowledge Discovery
in Databases

Outline
•What is Data Mining and KDD?
•Characteristics
•Applications
•Methods
•Packages & Close Relatives

What is Data Mining & KDD?
•“The process of identifying hidden patterns
and relationships within data”
or
•“Data mining helps end users extract useful
business information from large databases”

What’s the Appeal?
•Hidden nuggets of valuable information buried
deep within a mountain of otherwise
unremarkable data
•Pervasive data
•Seek competitive advantage

The Challenge
5102018890521200153945819900000000141988122944882199608162100000010100010000000
1100003111110000000001003130200000000000000202001000000000000000000000000000043
4388888888424243424333012202022200001010010000000441000000001100000000000000000
1000001000000000000000000000000000000000000000000000000019981027510201896060120
0212694096800000015901998090337981199809173100100000100010000000110000320002000
0001000000012399000000000000200222200313100312000000000000000042438888888888424
3424233212121222200000010110000002441000000000100200000000000000000000100000000
0000000000000000000000000000000000000000019981230510201897020320001862692920000
0047091998021356971199802273100000100100010000000001101100000020000100000000021
0110001000000000001000000000000100011000000011100338888222233113233433300000011
0000011101001100102000100000000100000000100000000000000000000000000000000000000
0000000000000000000000000019981221510201899093020052008986730000019410199901127
5981199901263100100010100010000000001111101111122010100000111230010010000001021
0002200000000002000000000000011133438888434242424342423300000011110000010110010
0002441000000000100200000001001010000000100000000000000000000000000000000000000
0000000000019990525510201899122720093540515830000014484199705271797119970610310
0000010110010000000100000311120120000100100101200011110010000110100120000000000
0100000000001010132438888888888224242433100000001002100001110010011230100000010
0000200010000000000110000100000000100000100000000000000001000000000000000001998
1117510201899122720093540515830000014484199705271797219980616310000001011001000
0000110100311111121000100000202210012220220020221222201000000000000000001010011
0032434343213242214242423300210021000011110110000011223100110000010000001000000
0000110000100000000100000100000000000000000000000000000000001998122351020190001

Process: Knowledge Discovery In
Databases
database
database
data
warehouse
cleaning &
integration
modify data
selection
modify data
selection
data mining
collect and
transform
discovered
patterns
data mining
engines,
models
evaluation &
presentation
user interface
and expert
knowledge
domain
modify
methods,
parameters

Context
•Where you stand on Data Mining depends on
where you sit:
•Business User
•Researcher
•Computer Scientist

Data Mining Might Mean…
•Statistics
•Visualization
•Artificial intelligence
•Machine learning
•Database technology
•Neural networks
•Pattern recognition
•Knowledge-based systems
•Knowledge acquisition
•Information retrieval
•High performance computing
•And so on...

What’s needed?
•Suitable data
•Computing power
•Data mining software
•Skilled operator who knows both the nature of
the data and the software tools
•Reason, theory, or hunch

Typical Applications of Data Mining
& KDD
•Marketing
•Market Basket Analysis
•Customer Relationship Management
•New Product Development

Typical Applications of Data Mining
& KDD
•Financial Services
•Credit Approval
•Fraud Detection
•Marketing

Typical Applications of Data Mining
& KDD
•Health Care
•Epidemiological Analysis - incidence and prevalence
of disease in large populations and detection of the
source and cause of epidemics of infectious disease
•Knowledge for funding
•Policy, programs

Two Basic Approaches
•Supervised
•A dependent or target variable
•Unsupervised
•“Pure Data Mining”
•Fewer assumptions
•Typically used for clustering techniques

Automation
•The ability to aim a tool at some data and push
a button
•Some methods of KDD/Data mining are more
suitable for automation than others

Seven Basic Methods:
1.Decision Trees
2.(Artificial) Neural Networks
3.Cluster/Nearest Neighbour
4.Genetic Algorithms/Evolutionary Computing
5.Bayesian Networks
6.Statistics
7.Hybrids

•Graphical representations of relationships with
data
•Excel at Classification & Prediction Models
Decision Trees

Sample of a Decision Tree
gender
femalemale
<65 >=65
married?age
yes no
good
health?
yes no
- +
urban?
yes no
pet
owner?
yes no
+ - - +
pet
owner?
yes no
- +

Decision Trees
•Strengths
•Easily understood
and interpreted
•Represent complexity
in a compact form
•Handle non-linear
data well
•Relatively well suited
to automation.
•Weaknesses
•Large trees with large
numbers of variables
become difficult to
understand
•Missing data must be
appropriately
managed in
construction and use
of the models

Neural Networks
•Derived from Artificial Intelligence Research
•Modelled on the Human Neuron

Neural Networks
Age Gender Income
Prediction
Hidden Layer
Input Variables
0.6
0.3
0.1
0.5
0.7
0.8 0.4
Weights
Weights
0.3 0.2

Neural Networks
•Strengths
•Accuracy of prediction
•Robust performance
with a wide variety of
data types
•Weaknesses
•Prone to overfitting
•Poor clarity of model

Clustering/Nearest Neighbour
•Aim to assign “like” records to a group
•Groups assigned according to some target
variable or criteria
•Nearest neighbour used for prediction

Clustering/Nearest Neighbour
•Applications:
•Text processing: search engines
•Image processing: radiology/image processing
•Fraud detection: outliers

Clustering/
Nearest Neighbour
•Strengths
•Easily understood
and interpreted
•Easily implemented in
basic situations
•Weaknesses
•complex data not well
suited to automation
(much preprocessing
required)

Genetic Algorithms/
Evolutionary Computing
•Grounded in Darwin – applied using
mathematics
•Require
•a way to represent a solution to a problem
•a way to test the “fitness” of the solution
•Solutions are mathematically “mutated”
•Fittest solutions survive
•Convergence

Genetic Algorithms/
Evolutionary Computing
•Strengths
•Suited to novel
problems that are
poorly understood
•Suitable where data is
dirty or missing
•May be useful where
other methods cannot
be applied
•Weaknesses
•Not easily automated
•Require creativity in
their application

Bayesian Networks
•Based on Bayes’ rule:
•P(a|b) = P(b|a) * P(a) / P(b)
•Can construct networks of linked events, each
with prior probabilities

Bayesian Network Example
J.R. Shot
Bobby
shot him
Just a
dream
sequence
Mistress
shot him
Wife
shot him
Suicid
e
J. R.
Treated
for
Depressio
n
Bobby
publicly
threatened
Producer
s
desperat
e for
ratings
Big fight
between
wife,
mistress

Bayesian Networks
•Strengths
•Clarity of the resulting
models
•Good precision in
predicting
•Easily adapt to new
probabilities
•Weaknesses
•Time consuming to
construct and
maintain
•Poor at predicting
rare events

Statistics
•With an outcome or dependent variable:
•Correlations
•ANOVA
•Regression
•Used by themselves or to confirm findings of
another method

Statistics
•Strengths
•“Gold Standard” –
valid and trusted in
scientific circles
•Weaknesses
•Limits findings to
those techniques that
are applied and their
associated limitations
(normality, linearity,
and so on)

Hybrids
•Techniques used in combination
•Example: use of a genetic algorithm to identify
target variables for inclusion in a neural
network model

Recap
•Data Mining is the core activity or method
within a process of Knowledge Discovery in
Databases
•Done in order to find useful information in large
amounts of data not possible using
“conventional” approaches
•Variety of methods
•Knowledge of data domain, methods, as well
as creativity

Data Mining Packages
•Major vendors of database/data management
products (IBM, SPSS, Oracle PeopleSoft,
SAS, and so on)
•Added as a component of turnkey packages
•May incorporate several methods (SAS
Enterprise Miner)
•Single method (TreeAge Software Inc.: a
dedicated decision tree product)

How to implement?
•Do it yourself (you know the data domain)
•Put a team together (domain and method
specialists)
•Hire a consultant (who knows both your
domain and the tools)
•Vertical markets in data mining

Close Relatives of Data Mining
•On-Line Analytical Processing (OLAP)
•Pivot tables in spreadsheets
•General statistical packages
•Intelligent Data Analysis – comprises the use
of data mining methods in the analysis of
“small” datasets
Tags