ch_1_dm data preprocessing in data mining

PriyankaPatil919748 9 views 37 slides Sep 27, 2024
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Data warehouse and data mining


Slide Content

September 27, 2024
Data Mining: Concepts and
Techniques 1
Chp-1: Introduction to
Data Mining

September 27, 2024
Data Mining: Concepts and
Techniques 2
Chapter 1. Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Kind of patterns to be mined

Technologies used

Major issues in data mining

September 27, 2024
Data Mining: Concepts and
Techniques 3
Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes(1000
terabytes)

Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

September 27, 2024
Data Mining: Concepts and
Techniques 4
Evolution of Database Technology

1960s:

Data collection, database creation and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems

September 27, 2024
Data Mining: Concepts and
Techniques 5
Evolution of database system technology

September 27, 2024
Data Mining: Concepts and
Techniques 6
What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

Decision making
September 27, 2024
Data Mining: Concepts and
Techniques 7

September 27, 2024
Data Mining: Concepts and
Techniques 8
Knowledge Discovery (KDD) Process

Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Selection & Transformation
Data Mining
Pattern Evaluation

9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

Why Data Mining?—Potential Applications

Data analysis and decision support

Market analysis and management

Target marketing, customer relationship
management (CRM), market basket analysis, cross
selling

Risk analysis and management

Forecasting, customer retention, quality control,
competitive analysis

Fraud detection and detection of unusual
patterns (outliers

Other Applications

Text mining (news group, email, documents) and Web
mining


Stream data mining

Bioinformatics and bio-data analysis
September 27, 2024
Data Mining: Concepts and
Techniques 10

Ex. 1: Market Analysis and Management

Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies

Target marketing

Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time

Cross-market analysis—Find associations/co-relations
between product sales, & predict based on such
association

Customer profiling—What types of customers buy
what products (clustering or classification)
September 27, 2024
Data Mining: Concepts and
Techniques 11

Ex. 1: Market Analysis and Management

Customer requirement analysis

Identify the best products for different groups of customers

Predict what factors will attract new customers

Provision of summary information

Multidimensional summary reports

Statistical summary information (data central tendency and
variation)
September 27, 2024
Data Mining: Concepts and
Techniques 12

September 27, 2024
Data Mining: Concepts and
Techniques 13
Data Mining: On What Kinds of Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. bio-sequences)

Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data(geographical data)

Multimedia database

Text databases

The World-Wide Web

Data Mining: On What Kinds of Data?

Mining relational databases

Eg. Anaylze customer data to predict the credit risk
of new customers based on their income, age and
previous credit information.

Data Warehouses

Sales per item type per branch for third quarter.

Data stored to provide information from historical
perespective. Eg. In past 6 to 12 months,
summarized data

Modeled by multidimentional data structure called
data cube.
September 27, 2024
Data Mining: Concepts and
Techniques 14

September 27, 2024
Data Mining: Concepts and
Techniques 15


Transactional data

Eg analyze which items are sold well together?

Printers are normally purchased together with
computer
September 27, 2024
Data Mining: Concepts and
Techniques 16
Data Mining: On What Kinds of Data?

Kinds of Patterns to be mined
September 27, 2024
Data Mining: Concepts and
Techniques 17

What Kinds of Patterns Can Be Mined?
1)Generalization
2)Association and Correlation Analysis
3)Classification
4)Cluster Analysis
5)Outlier Analysis
September 27, 2024
Data Mining: Concepts and
Techniques 18

Data Mining Function: (1) Generalization

Multidimensional concept description:
Characterization and discrimination

Generalize, summarize, and contrast data
characteristics, e.g., summarize the characteristics
of customers who spend more than Rs. 50,000 a
year at an electronics store

Data characterization is a summarization of the
general characteristics or features of a target class
of data

Data cube technology for computing

OLAP (online analytical processing)

Examples of Output forms : pie charts, MDD
cubes, bar charts, curves etc.
September 27, 2024
Data Mining: Concepts and
Techniques 19

Data Mining Function: (1) Generalization
contd.

Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.

Eg. Compare 2 groups of customers- those who shop
for computer products regularly(more than twice a
month) and those who rarely shop for such
products(less than 3 times a year)

Data cube technology for computing

Drill down on any dimension

Discriminant rules: Discrimination descriptions
expressed in the form of rules

Output forms : same as that of data characterization
along with discrimination descriptions
September 27, 2024
Data Mining: Concepts and
Techniques 20

Data Mining Function: (2) Association and
Correlation Analysis

Frequent patterns (or frequent itemsets)

What items are frequently purchased together in
your mart? Eg. Milk & bread

Association, correlation vs. causality

A typical association rule

Computer software [1%, 50%] (support,

confidence)

Confidence means that if one buys a computer there is a 50%
chance that she will buy software too. A 1% support means
that 1% of all transactions under analysis show that computer
& software are purchased together

Association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold
and a minimum confidence threshold
September 27, 2024
Data Mining: Concepts and
Techniques 21

Data Mining Function: (3) Classification

Classification and label prediction

Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)

Predict some unknown class labels

Typical methods

Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …

Typical applications:

Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
September 27, 2024
Data Mining: Concepts and
Techniques 22

Various forms of a classification model
September 27, 2024
Data Mining: Concepts and
Techniques 23

Data Mining Function: (4) Cluster Analysis

Unsupervised learning (i.e., Class label is
unknown)

Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns

Data objects are clustered or grouped based
on the principle of maximizing intraclass
similarity and minimizing interclass similarity
September 27, 2024
Data Mining: Concepts and
Techniques 24

Data Mining Function: (4) Cluster Analysis
September 27, 2024
Data Mining: Concepts and
Techniques 25

Data Mining Function: (5) Outlier Analysis

Outlier analysis (anomaly mining)

Outlier: A data object that does not comply
with the general behaviour of the data

Noise or exception? ― One person’s garbage
could be another person’s treasure

Methods: by product of clustering or
regression analysis, …

Useful in fraud detection, rare events analysis
September 27, 2024
Data Mining: Concepts and
Techniques 26

September 27, 2024
Data Mining: Concepts and
Techniques 27
Are All the “Discovered” Patterns Interesting?

Data mining may generate thousands of patterns: Not all of them
are interesting

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.

Subjective: based on user’s belief in the data, e.g. large earthquake
often follows a cluster of small earthquake.

September 27, 2024
Data Mining: Concepts and
Techniques 28
Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?

Association vs. classification vs. clustering

Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches

First generate all the patterns and then filter out the
uninteresting ones

Generate only the interesting

Technologies Used

As a highly application-driven domain, data mining has
incorporated many techniques from other domains

The interdisciplinary nature of data mining research and
development contributes significantly to the success of data
mining and its extensive applications
September 27, 2024
Data Mining: Concepts and
Techniques 29

September 27, 2024
Data Mining: Concepts and
Techniques 30
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology

Data Mining: Confluence of Multiple Disciplines

Statistics

Statistical models are widely used to model data and
data classes.

Eg. We can use statistics to model noise and
missing data.

Machine learning

Computer programs automatically learn to
recognize complex patterns and make intelligent
decisions based on data.

e.g. Handwritten postal codes
September 27, 2024
Data Mining: Concepts and
Techniques 31

September 27, 2024
Data Mining: Concepts and
Techniques 32
Why Confluence of Multiple Disciplines?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes
of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

September 27, 2024
Data Mining: Concepts and
Techniques 33
Major Issues in Data Mining

Mining methodology

Mining different kinds of knowledge from diverse data types,
e.g., files in pdf or doc

Mining knowledge in multi-dimensional space.

Data mining: An interdisciplinary effort( mine data with NL
text)

Pattern evaluation: the interestingness problem

Handling noise, uncertainty, and incompleteness of data

Integration of the discovered knowledge with existing one:
knowledge fusion

Pattern evaluation and pattern- or constraint-guided mining

September 27, 2024
Data Mining: Concepts and
Techniques 34
Major Issues in Data Mining (1)

User interaction

Interactive mining( dynamically change focus of search)

Incorporation of background knowledge(constraints, rules)

presentation and visualization of data mining results

Efficiency and Scalability

Efficiency and scalability of data mining algorithms(run time …
predictable,short,acceptable)

Parallel, distributed, stream, and incremental mining methods

Diversity of data types

Handling complex types of data(simple to temporal data
objects)

Mining dynamic, networked, and global data repositories

September 27, 2024
Data Mining: Concepts and
Techniques 35
Major Issues in Data Mining (2)

Data mining and society

Social impacts of data mining(benefit to society)

Privacy-preserving data mining

Invisible data mining(system have buit in function.. click of
mouse)

September 27, 2024
Data Mining: Concepts and
Techniques 36
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Know
ledge
-Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories

September 27, 2024
Data Mining: Concepts and
Techniques 37
Summary

Data mining: Discovering interesting patterns from large amounts
of data

A natural evolution of database technology, in great demand, with
wide applications

A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.

Data mining systems and architectures

Major issues in data mining