Unit 1_data mining and warehousing subject

tomjerryguest 13 views 67 slides Mar 10, 2025
Slide 1
Slide 1 of 67
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67

About This Presentation

Unit 1


Slide Content

Course: Program Elective-I
(Data Mining and Warehousing )

Course Teachers: Course Chairman:
Dr K.Rajeswari Dr Swati Shinde
Dr Avinash Bhute
Module: Knowledge Engineering, Module Coordinator: Dr Mubin Tamboli.

Agenda
Teaching and Examination Scheme
Relevance of the Course and Prerequisites
Course Objectives and Outcomes
Why Data Analytics and Data Mining
Applications and Job opportunity

Teaching & Examination Scheme
Teaching Scheme : Lecture : 2 hrs/week
Planned 30 (Syllabus)
Examination Scheme (100 Marks)
FA= 40 Marks
(Formative Assessment- 10 marks for each
unit, which will be continuously evaluated with
classroom participation and will be verified in
2 instances)

Relevance of the Course and Prerequisites
Prerequisites
Database Management systems (DBMS), Engineering
Mathematics
Relevance of the Course:
 Professional elective course
 Prerequisite course for Machine Learning, Artificial
Intelligence.
 To learn Data Mining technique for statistical analysis to
develop effective decision support system
Lot of career oppurtunities.
The course will help to learn and apply preprocessing
techniques, various data mining functionalities, post
processing methods for different applications in data mining
and data warehousing.

Text Books
 Jiawei Han, Micheline Kamber, “Data mining: concepts and
techniques", Morgan Kaufmann Publisher 2012, third
edition, ISBN 978-0-12-381479-1.
G. K. Gupta, “Introduction to Data mining with Case
Studies", PHI Learning Private Limited, Delhi 2014, third
edition, ISBN-978-81-203-5002-1.
William H Inmon, “Building the data Warehouse”, Wiley
Publication 2005, fourth edition, ISBN: 978-0-764-59944-6.

Reference Books
Dunham, M. H., “Data mining: Introductory and advanced
topics”, Upper Saddle River, N.J: Pearson education /Prentice
Hall 2003.
Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit”,
3rd Edition, Wiley 2013, ISBN-13: 978-1118530801.
Ian H. Witten and Eibe Frank, “Data Mining: Practical
Machine Learning Tools and Techniques”, Second Edition,
Morgan Kaufmann Publishers 2005, ISBN: 0-12-088407-0.

Case Study – Canteen Data

DBMS Vs Data Mining
Parameters DBMS
Applications
Data Mining
Applications
Uses  Day today TransactionsWeekly /Monthly
Analysis
Data  Current Historical
Operations  INSERT
UPDATE
DELETE
READ/SELECT
LOAD
READ
ANALYSIS
USERS  END USERS to perform
above operations.
Such as Clerk, operators
BUSINESS
ANALYST
TOP MANAGEMENT
MANAGER
EXECUTIVE Director
Examples  ERP system Decision Making
System

DBMS Vs Data Mining
Paramet
ers
DBMS Applications Data Mining
Applications
Examples  ERP system Performance Analysis &
Decision Making System

Course Objectives
To introduce the fundamentals of Data mining and Data Warehousing.
To develop skills to select appropriate multi-dimensional schemas to
design data warehouse model.
To develop skills to identify the appropriateness and need of data mining.
To study and use preprocessing techniques for preparing suitable dataset
for data mining.
To apply data similarity and dissimilarity measures for statistical analysis
To study and apply various methods and algorithms in data mining for
solving real world problems.
.

Course Outcomes
After learning the course, students will be able to:
1. Use data preprocessing techniques for preparing suitable
dataset for data mining.
2. Select appropriate multi-dimensional schema to design data
warehouse model.
3. Apply data similarity and dissimilarity measures for
statistical analysis.
4. Apply Data Mining functionalities to solve real world
problems.

Growing demand
More vigorous competition
High customer expectations
Advancements in technology
Speed of product obsolescence has increased
Its effect
•Reducing sales and market shares
•Decreasing profit margins
•Difficult to survive and grow
BUSINESS SCENARIO

Example of Data Analytics

Companies using data analytics
Wallmart
Flipkart
Amazon
Accenture
Cigna (American health care organisation)
Rapido( Indian Bike rental company,
Bangalore)

Why Data Analytics
•Data is being produced in large quantities
•The computing power is available
•The computing power is affordable
•The competitive pressures are strong
•Commercial products are available
•Terabytes – 10 ^ 12 bytes – Walmart – 24 Terabytes
•Petabytes – 10^15 bytes – GIS database
•Exabytes - 10^18 bytes – National Medical Record
•Zettabytes – 10^21 bytes – Weather images
•Zottabytes – 10^24 bytes – Intelligence Agency Video
15

EXAMPLES OF DATA ANALYTICS
9000 stores
More than 100 Countries
10,000 to 1,00,000 Stock Keeping Units (SKUs)
1 Million Transactions Every Hour

EXAMPLES OF DATA ANALYTICS
1800 stores
35 Million Club Customers
1 Billion Items Home Delivered Annually

EXAMPLES OF DATA ANALYTICS
261 Million Subscribers
8 Billion Calls Every Day
Hundreds of Different Call Plans

EXAMPLES OF DATA ANALYTICS
10 Million Transactions on Busy Days

EXAMPLES OF DATA ANALYTICS
100 Million Transactions Per Day

Why Data Analytics and Data Mining
Companies generate large volumes of data every hour
Data may be in the form of transactional data, log files,
customer data etc
Data generated rapidly with social media like twitter,
facebook, whatsapp, twitter
Companies want to use this data to make their further
business decisions and improve their profits --- and hence
come DATA analytics

Data Mining
Process of exploring and analysing LARGE
DATASETS to find

Why to use data analytics
Improved decision making :
To improve decisions (speeds up) in business
Without guesswork
More personalisation
 understanding customer’s need and interest
thoroughly
Better recommendation for products and
services

Way to use data analytics
Efficient operations
When the interest of audience is known, time is not
wasted in posting irrelevant contents
Effective content management
Helps to optimize campaigns
Even ads as per the interest of customers
Hence Improves results
Effective marketing
Knowing customers, relevant campaigning
Customers get converted to leads.

APPLICATIONS OF DATA ANALYTICS AND JOB
OPPURTUNITIES
DATA ANALYST IN
Management
•Marketing
•Finance
•Human Resource
•Operations
•Supply Chain Management
Industries:
•Retail
•Banking
•Telecom etc.

Best Videos Links for beginners
https://youtu.be/ukzFI9rgwfU (8 minutes)
Machine learning basics
https://youtu.be/X3paOmcrTjQ (5 minutes )
Data Science basics
https://www.youtube.com/watch?v=zwasdVPPFFw :
Home ( 1 hour)
Data Analytics for Beginners

THANK YOU!

Data Mining

Introduction to Data Mining
Gold Mining
 is a process of separating gold from
rocks and other materials
Similarly Data Mining is an Extraction of
interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data.
Exploration and Analysis of large quantities of data
to discover meaningful patterns and rules hidden in
the data.
29

Introduction to Data Mining
Data Mining is an application of Machine Learning
Machine Learning is nothing but a type of Artificial
Intelligence (AI) which enables computers to learn
the data without help of any explicit programs.

30

Example: Jewelry Shops sales
Analysis
Data set : Ex.
Name of jewelry item, Design pattern , Gender, Age
Information:
No. of items sold today
No. of items sold on first day of the month
No. of items sold at the end of the month
Knowledge / Interesting patterns:
If Ring is purchased Bangles are purchased
If Age is Middle age then Earring design1 is purchased
frequently and in more quantity.

Alternative names
Knowledge Discovery (mining) from Data
(KDD)
-- Data Mining is one step of KDD
knowledge extraction,
data/pattern analysis,
data archeology,
data dredging,
information harvesting,
business intelligence, etc.

Need of Data Mining
The ability of the knowledge workers to make
decisions, is one of the primary factors that influence
the performance and competitive strength of a given
organization.
The main purpose of Data Mining technique is to
provide knowledge workers, KNOWLEDGE
extracted from data that allow them to make
effective and timely decisions.
To improve the overall quality of the decision-making
process

34
KDD Process….Cont.
Input Data
Data
Mining
Data Pre-
Processing
Post-
Processing
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

Architecture of a typical DM System

36
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities
Data mining plays an essential role
in the knowledge discovery process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

Data Mining in KDD

Steps in KDD process

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be
combined)

3. Data selection (where data relevant to the analysis task are
retrieved from the database)

4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)

5. Data mining (an essential process where intelligent methods
are applied in order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures;

7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user)
38

Data Mining Task Primitives

40
Example: Medical Data Mining

Health care & medical data mining – often adopted
such a view in statistics and machine learning

Preprocessing of the data (including feature extraction
and dimension reduction)

Classification or/and clustering processes

Post-processing for presentation

41
Example: A Web Mining Framework

Web mining usually involves

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into knowledge-
base

42
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

THANK YOU
https://youtu.be/1NjPTh0Eoeg
KDD Process

Data, Information and Knowledge
Data gathered from different sources cannot be used
directly for data mining process and decision-making
purposes.
 They need to be processed
by means of appropriate extraction tools and
analytical methods capable of
transforming them into information and knowledge
that can be subsequently used by decision makers.

Data, Information and Knowledge
Information: Information is the outcome of extraction
and processing activities carried out on data.
Knowledge: Information is transformed into knowledge
when it is used to make decisions and develop the
corresponding actions.

Operational Data and Informational Data
Operational Systems
(Operational Data)
Informational or knowledge-
based systems (Informational Data)
1. Systems that help us to run the enterprise
operation day-to-day. (Daily transactional
data)
1. Systems to provide functions that go on
within the enterprise that have to do with
planning, forecasting and managing the
organization.
2. These are the backbone systems of any
enterprise, Because of their importance to
the organization, operational systems were
almost always the first parts of the
enterprise to be computerized
2. "Informational systems" have to do with
analyzing data and making decisions, often major
decisions, about how the enterprise will operate,
now and in the future
3. Operational data needs are normally
focused upon a single area
3. Informational data needs often span a
number of different areas and need large
amounts of related operational data
4. Examples: operations or functions in
OLTP "order entry', "inventory",
"manufacturing", "payroll" and
"accounting" systems.
4. Examples: Functions like "marketing
planning", "engineering planning" and
"financial analysis" also require information
systems to support them.

Operational Data and Informational Data
OLTP (Operational Data) OLAP(Informational Data)

users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

DBMS, OLAP, and Data Mining
DBMS
OLAP (Data
warehouse )
Data Mining
Task
Extraction of
detailed and
summary data
Summaries, trends
and forecasts
Knowledge
discovery of
hidden
patterns and
insights
Type of resultInformation Analysis
Insight and
Prediction
Method
Deduction (Ask
the question,
verify with
data)
Multidimensional
data modeling,
Aggregation,
Statistics
Induction (Build
the model,
apply it to new
data, get the
result)
Example
question
Who purchased
mutual funds in
the last 3
years?
What is the average
income of
mutual fund
buyers by
region by year?
Who will buy a
mutual fund in
the next 6
months and
why?

Example of DBMS, OLAP and Data Mining: Weather Data: DBMS
Dayoutlooktemperaturehumiditywindyplay
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast72 90 true yes
13 overcast81 75 false yes
14 rainy 71 91 true no

Example of DBMS, OLAP and Data Mining:
Weather Data
By querying a DBMS containing the above table we
may answer questions like:
•What was the temperature in the sunny days? {85,
80, 72, 69, 75}
•Which days the humidity was less than 75? {6, 7, 9,
11}
•Which days the temperature was greater than 70?
{1, 2, 3, 8, 10, 11, 12, 13, 14}
•Which days the temperature was greater than 70
and the humidity was less than 75? The
intersection of the above two: {11}

Example of DBMS, OLAP and Data Mining:
Weather Data
OLAP:
•Using OLAP we can create a Multidimensional Model of
our data (Data Cube).
•For example using the dimensions: time, outlook and play
we can create the following model.
9 (Y) / 5(N)sunny rainy overcast
Week 1 0 / 2 2 / 1 2 / 0
Week 2 2 / 1 1 / 1 2 / 0

Example of DBMS, OLAP and Data
Mining: Weather Data
Data Mining:
•Using the ID3 algorithm we can produce the following
decision tree:
•outlook = sunny
–humidity = high: then play =>no
–humidity = normal: then play=> yes
•outlook = overcast: then play=> yes
•outlook = rainy
–windy = true: then play=> no
–windy = false: then play=> yes

THANK YOU

59
Attributes types in data mining
Input Data
Data
Mining
Data Pre-
Processing
Post-
Processing
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
Attributes Types:
80% time to
prepare dataset

Attributes types in data mining
What is Attribute?
The attribute is the property of the object. The attribute
represents different features of the object. 
Example:
In this example, RollNo, Name, and Result are attributes of

the object student.
RollNo Name Result
1 Ali Pass
2 Akram Fail

Types Of Attributes

Nominal Attribute
Nominal data:
Nominal data is in alphabetical form and not in integer.
Example:
Attribute Value
Categorical dataLecturer, Assistant Professor, Professor
States New, Pending, Working, Complete, Finish
Colors Black, Brown, White, Red

Binary Attribute
Binary data:
Binary data have only two values/states.
Example:
Binary attribute is of two types:
1)Symmetric binary
2)Asymmetric binary
Attribute Value
HIV detected Yes, No
Result Pass, Fail

Symmetric data:
Both values are  equally important
Example:
Asymmetric data:
Both values are  not equally important
Example:
Attribute Value
Gender Male, Female
Attribute Value
HIV detected Yes, No
Result Pass, Fail

Ordinal Attribute
Ordinal data:
All Values have a meaningful order. 
Example:
Attribute Value
Grade A,
 B, C, D, F
BPS- Basic pay scale16, 17, 18

Discrete Attribute
Discrete Data:
Discrete data have finite value. It can be in numerical form
and can also be  in categorical form.
Example:
Attribute Value
ProfessionTeacher, Business Man, Peon etc.
Postal Code 42200, 42300 etc.

Continuous Attribute
Continuous data:
 Continuous data technically have an infinite number of
steps.
 Continuous data is in float type. There can be many
numbers in between 1 and 2
Example:
Attribute Value
Height 5.4…, 6.5….. etc.
Weight 50.09….
 etc.
Tags