Data Visualization Lecture for Masters of Data Science Students

umaircps 6 views 57 slides Oct 28, 2025
Slide 1
Slide 1 of 57
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57

About This Presentation

Data Visualization


Slide Content

CS-5121/6114/DS5115, CPSD, UET Taxila 1
Lec-01
DATA MINING/DATA
VISUALIZATION/ INFORMATION
RETERIEVAL
Dr. Muhammad Munwar Iqbal

Duaa
CS-5180/6114, CPSD, UET Taxila 2

Marks Distribution
3
Quizzes +
Assignments
20
Research Article20
Midterm 20
Final 60 (40+20)
Total 100

CS-5180/6114, CPSD, UET Taxila 4
Outline
•Define data mining
•Data mining vs. databases
•Basic data mining tasks
•Data mining development
•Data mining issues
•Data mining techniques
Goal:Goal: Provide an overview of data mining. Provide an overview of data mining.

CS-5180/6114, CPSD, UET Taxila 5
Introduction
•Data is produced at a phenomenal rate
•Our ability to store has grown
•Users expect more sophisticated information
•How?
UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION
DATA MININGDATA MINING

CS-5180/6114, CPSD, UET Taxila 6
We are data rich, but information poor

Why Data Mining
•Credit ratings/targeted marketing:
–Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
–Identify likely responders to sales promotions
•Fraud detection
–Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular
customer?
•Customer relationship management:
–Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such
informationCS-5180/6114, CPSD, UET Taxila 7

Data mining
•Process of semi-automatically analyzing
large databases to find patterns that are:
–valid: hold on new data with some certainity
–novel: non-obvious to the system
–useful: should be possible to act on the item
–understandable: humans should be able to
interpret the pattern
•Also known as Knowledge Discovery in
Databases (KDD)
CS-5180/6114, CPSD, UET Taxila 8

CS-5180/6114, CPSD, UET Taxila 9
Data Mining—What’s in a Name?
Data Mining
Knowledge Mining
Knowledge Discovery
in Databases
Data Archaeology
Data Dredging
Database Mining
Knowledge Extraction
Data Pattern Processing
Information Harvesting
Siftware
The process of discovering meaningful new correlations, patterns, and trends by sifting
through large amounts of stored data, using pattern recognition technologies and
statistical and mathematical techniques

CS-5180/6114, CPSD, UET Taxila 10
Integration of Multiple Technologies
Machine
Learning
Database
Management
Artificial
Intelligence
Statistics
Data
Mining
VisualizationAlgorithms

Applications
•Banking: loan/credit card approval
–predict good customers based on old customers
•Customer relationship management:
–identify those who are likely to leave for a competitor.
•Targeted marketing:
–identify likely responders to promotions
•Fraud detection: telecommunications, financial
transactions
–from an online stream of event identify fraudulent events
•Manufacturing and production:
–automatically adjust knobs when process parameter changes
CS-5180/6114, CPSD, UET Taxila 12

Applications (continued)
•Medicine: disease outcome, effectiveness of treatments
–analyze patient disease history: find relationship between diseases
•Molecular/Pharmaceutical: identify new drugs
•Scientific data analysis:
–identify new galaxies by searching for sub clusters
•Web site/store design and promotion:
–find affinity of visitor to pages and modify layout
CS-5180/6114, CPSD, UET Taxila 13

CS-5180/6114, CPSD, UET Taxila 14
Data Mining
•Objective: Fit data to a model
•Potential Result: Higher-level meta information that may
not be obvious when looking at raw data
•Similar terms
–Exploratory data analysis
–Data driven discovery
–Deductive learning

CS-5180/6114, CPSD, UET Taxila 15
Data Mining Algorithm
•Objective: Fit Data to a Model
–Descriptive
–Predictive
•Preferential Questions
–Which technique to choose?
•ARM/Classification/Clustering
•Answer: Depends on what you want to do with data?
–Search Strategy – Technique to search the data
•Interface? Query Language?
•Efficiency

CS-5180/6114, CPSD, UET Taxila
16
Database Processing vs. Data Mining
Processing
•Query
–Well defined
–SQL
•Query
–Poorly defined
–No precise query language

OutputOutput
– PrecisePrecise
– Subset of databaseSubset of database

OutputOutput
– FuzzyFuzzy
– Not a subset of databaseNot a subset of database

CS-5180/6114, CPSD, UET Taxila 17
Query Examples
•Database
•Data Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased
with milk. (association rules)with milk. (association rules)
– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.
– Identify customers who have purchased more Identify customers who have purchased more
than $10,000 in the last month.than $10,000 in the last month.
– Find all credit applicants who are poor credit Find all credit applicants who are poor credit
risks. (classification)risks. (classification)
– Identify customers with similar buying habits. Identify customers with similar buying habits.
(Clustering)(Clustering)

CS-5180/6114, CPSD, UET Taxila 18
Data Mining Models and Tasks

CS-5180/6114, CPSD, UET Taxila 19
Basic Data Mining Tasks
•Classification maps data into predefined groups or
classes
–Supervised learning
–Pattern recognition
–Prediction
•Regression is used to map a data item to a real
valued prediction variable.
•Clustering groups similar data together into
clusters.
–Unsupervised learning
–Segmentation
–Partitioning

CS-5180/6114, CPSD, UET Taxila 20
Basic Data Mining Tasks (cont’d)
•Summarization maps data into subsets with associated simple
descriptions.
–Characterization
–Generalization
•Link Analysis uncovers relationships among data.
–Affinity Analysis
–Association Rules
–Sequential Analysis determines sequential patterns.

CS-5180/6114, CPSD, UET Taxila 21
Ex: Time Series Analysis
•Time series 
include the continuous monitoring of a
person's heart rate, hourly readings of air temperature, daily
closing price of a company stock, monthly rainfall data, and
yearly sales figures. Time series analysis 
is generally
used when there are 50 or more data points in a
 
series.

CS-5180/6114, CPSD, UET Taxila 22
Data Mining vs. KDD
•Knowledge Discovery in Databases (KDD): process of
finding useful information and patterns in data.
•Data Mining: Use of algorithms to extract the information
and patterns derived by the KDD process.

Knowledge Discovery Process
–Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Preprocessed
Data
Task-relevant Data
Data transformations
Selection
Data Mining
Knowledge Interpretation
CS-5180/6114, CPSD, UET Taxila
23

KDD
Data collection & cleaning: (to remove noise and inconsistent data)
Data integration: (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved
from the database)
Data transformation: (where data are transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations, for instance)
Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined knowledge
to the user)
CS-5180/6114, CPSD, UET Taxila 24

CS-5180/6114, CPSD, UET Taxila 25
KDD Process Ex: Web Log
•Data Collection: User data
•Selection:
–Select log data (dates and locations) to use
•Preprocessing:
– Remove identifying URLs
– Remove error logs
•Transformation:
–Sessionize (sort and group)
•Data Mining:
–Identify and count patterns
–Construct data structure
•Interpretation/Evaluation:
–Identify and display frequently accessed sequences.
•Potential User Applications:
–Cache prediction
–Personalization

Knowledge Discovery Process
flow, according to CRISP-DM
Monitoring
Business
Understanding + Data
Understanding + Data
Preparation 80% of
the time
Modeling (applying
mining algorithm)
20%
CS-5180/6114, CPSD, UET Taxila 26

Business
Understanding
Data
Understanding
Evaluation
Data
Preparation
Modeling
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Collect Initial Data
Initial Data Collection
Report
Describe Data
Data Description Report
Explore Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Set
Data Set Description
Select Data
Rationale for Inclusion /
Exclusion
Clean Data
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Deployment
Phases and TasksPhases and Tasks
CS-5180/6114, CPSD, UET Taxila 27

CS-5180/6114, CPSD, UET Taxila
28
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
HIGH PERFORMANCE
DATA MINING

CS-5180/6114, CPSD, UET Taxila 29
Multi-Dimensional View of Data Mining
•Data to be mined
–Relational, data warehouse, transactional, stream, object-oriented/relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
•Knowledge to be mined
–Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
–Multiple/integrated functions and mining at multiple levels
•Techniques utilized
–Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization,
etc.
•Applications adapted
–Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, Web mining, etc.

CS-5180/6114, CPSD, UET Taxila 30
KDD Issues
•Human Interaction
•Overfitting
•Outliers
•Interpretation
•Visualization
•Large Datasets
•High Dimensionality

CS-5180/6114, CPSD, UET Taxila 31
KDD Issues (cont’d)
•Multimedia Data
•Missing Data
•Irrelevant Data
•Noisy Data
•Changing Data
•Integration
•Application

CS-5180/6114, CPSD, UET Taxila 32
Social Implications of DM
•Privacy
•Profiling
•Unauthorized use

CS-5180/6114, CPSD, UET Taxila 33
Data Mining Metrics
•Usefulness
•Return on Investment (ROI)
•Accuracy
•Space/Time

CS-5180/6114, CPSD, UET Taxila 34
Database Perspective on Data
Mining
•Scalability
•Real World Data
•Updates
•Ease of Use

Examples of Large Datasets
•Government: IRS, NGA, …
•Large corporations
–WALMART: 20M transactions per day
–MOBIL: 100 TB geological databases
–AT&T 300 M calls per day
–Credit card companies
•Scientific
–NASA, EOS project: 50 GB per hour
–Environmental datasets
CS-5180/6114, CPSD, UET Taxila 35

How Data Mining is used
1. Identify the problem
2. Use data mining techniques to
transform the data into information
3. Act on the information
4. Measure the results
CS-5180/6114, CPSD, UET Taxila 36

The Data Mining Process
1. Understand the domain
2. Create a dataset:
–Select the interesting attributes
–Data cleaning and preprocessing
3. Choose the data mining task and the specific algorithm
4. Interpret the results, and possibly return to 2
CS-5180/6114, CPSD, UET Taxila 37

•Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
•Must address:
–Enormity of data
–High dimensionality
of data
–Heterogeneous,
distributed nature
of data
Origins of Data Mining
AI /
Machine Learning
Statistics
Data Mining
Database
systems
CS-5180/6114, CPSD, UET Taxila 38

Data Mining Tasks
1. Classification: learning a function that maps an item into
one of a set of predefined classes
2. Regression: learning a function that maps an item to a
real value
3. Clustering: identify a set of groups of similar items
CS-5180/6114, CPSD, UET Taxila 39

Data Mining Tasks
4. Dependencies and associations:
identify significant dependencies between data attributes
5. Summarization: find a compact description of the dataset
or a subset of the dataset
CS-5180/6114, CPSD, UET Taxila 40

Data Mining Methods
1. Decision Tree Classifiers:
Used for modeling, classification
2. Association Rules:
Used to find associations between sets of
attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
CS-5180/6114, CPSD, UET Taxila
41

Why Data Preprocessing?
•Data in the real world is dirty
–incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
–noisy: containing errors or outliers
–inconsistent: containing discrepancies in codes or names
•No quality data, no quality mining results!
–Quality decisions must be based on quality data
–Data warehouse needs consistent integration of quality data
–Required for both OLAP and Data Mining!
CS-5180/6114, CPSD, UET Taxila 42

Why can Data be Incomplete?
•Attributes of interest are not available (e.g., customer
information for sales transaction data)
•Data were not considered important at the time of
transactions, so they were not recorded!
•Data not recorder because of misunderstanding or
malfunctions
•Data may have been recorded and later deleted!
•Missing/unknown values for some data
CS-5180/6114, CPSD, UET Taxila 43

Data Cleaning
•Data cleaning tasks
–Fill in missing values
–Identify outliers and smooth out noisy data
–Correct inconsistent data
CS-5180/6114, CPSD, UET Taxila 44

Classification: Definition
•Given a collection of records (training set )
–Each record contains a set of attributes, one of the attributes is
the class.
•Find a model for class attribute as a function of
the values of other attributes.
•Goal: previously unseen records should be
assigned a class as accurately as possible.
–A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
CS-5180/6114, CPSD, UET Taxila 45

Classification Example
Tid Home
Owner
Marital
Status
Taxable
Income Default
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
te
g
o
ric
a
l
c a
te
g
o
r ic
a
l
c o
n
tin
u
o
u
s
c
la
s
s
Home
Owner
Marital
Status
Taxable
Income Default
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?
10

Test
Set
Training
Set
Model
Learn
Classifier
CS-5180/6114, CPSD, UET Taxila 46

Example of a Decision Tree
Tid Home
Owner
Marital
Status
Taxable
Income Default
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c a
te
g
o
r ic
a
l
c a
te
g
o
r ic
a
l
c o
n
tin
u
o
u
s
c la
s
s
HO
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
CS-5180/6114, CPSD, UET Taxila 47

Another Example of Decision
Tree
Tid Home
Owner
Marital
Status
Taxable
Income Default
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
te
g
o
ric
a
l
c
a
te
g
o
ric
a
l
c o
n
tin
u
o
u
s
c
la
s
s
MarSt
HO
TaxInc
YESNO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
CS-5180/6114, CPSD, UET Taxila 48

Classification: Application 1
•Direct Marketing
–Goal: Reduce cost of mailing by targeting a set of consumers likely
to buy a new cell-phone product.
–Approach:
•Use the data for a similar product introduced before.
•We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
•Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
–Type of business, where they stay, how much they earn, etc.
•Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
CS-5180/6114, CPSD, UET Taxila 49

Classification: Application 2
•Fraud Detection
–Goal: Predict fraudulent cases in credit card transactions.
–Approach:
•Use credit card transactions and the information on its
account-holder as attributes.
–When does a customer buy, what does he buy, how often
he pays on time, etc
•Label past transactions as fraud or fair transactions. This
forms the class attribute.
•Learn a model for the class of the transactions.
•Use this model to detect fraud by observing credit card
transactions on an account.
CS-5180/6114, CPSD, UET Taxila 50

Clustering Definition
•Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
–Data points in one cluster are more similar to one another.
–Data points in separate clusters are less similar to one another.
•Similarity Measures:
–Euclidean Distance if attributes are continuous.
–Other Problem-specific Measures.
CS-5180/6114, CPSD, UET Taxila 51

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
CS-5180/6114, CPSD, UET Taxila 52

Clustering: Application 1
•Market Segmentation:
–Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
–Approach:
•Collect different attributes of customers based on their
geographical and lifestyle related information.
•Find clusters of similar customers.
•Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
CS-5180/6114, CPSD, UET Taxila 53

Clustering: Application 2
•Document Clustering:
–Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
–Approach: To identify frequently occurring terms in each document. Form
a similarity measure based on the frequencies of different terms. Use it to
cluster.
–Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.
CS-5180/6114, CPSD, UET Taxila 54

Illustrating Document Clustering
•Clustering Points: 3204 Articles of Los Angeles Times.
•Similarity Measure: How many words are common in
these documents (after some word filtering).
Category Total
Articles
Correctly
Placed
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
CS-5180/6114, CPSD, UET Taxila 55

Association Rule Discovery:
Definition
•Given a set of records each of which contain some number of items from a
given collection;
–Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
CS-5180/6114, CPSD, UET Ta xila
56

Numerosity Reduction:
Reduce the volume of data
•Parametric methods
–Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
•Non-parametric methods
–Do not assume models
–Major families: histograms, clustering, sampling
CS-5180/6114, CPSD, UET Taxila 57

Thanks
CS-5180/6114, CPSD, UET Taxila 58