big data and data warehouse unit 1 for college

CHOLMALUAL 15 views 41 slides May 26, 2024
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

data warehousing introduction


Slide Content

11
Data Mining:
Concepts and Techniques
(3
rd
ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

2
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

3
Why Data Mining?
nThe Explosive Growth of Data: from terabytes to petabytes
nData collection and data availability
nAutomated data collection tools, database systems, Web,
computerized society
nMajor sources of abundant data
nBusiness: Web, e-commerce, transactions, stocks, …
nScience: Remote sensing, bioinformatics, scientific simulation, …
nSociety and everyone: news, digital cameras, YouTube
nWe are drowning in data, but starving for knowledge!
n“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

4
Evolution of Sciences
nBefore 1600, empirical science
n1600-1950s, theoretical science
nEach discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
n1950s-1990s, computational science
nOver the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
nComputational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
n1990-now, data science
nThe flood of data from new scientific instruments and simulations
nThe ability to economically store and manage petabytes of data online
nThe Internet and computing Grid that makes all these archives universally accessible
nScientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
nJim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002

5
Evolution of Database Technology
n1960s:
nData collection, database creation, IMS and network DBMS
n1970s:
nRelational data model, relational DBMS implementation
n1980s:
nRDBMS, advanced data models (extended-relational, OO, deductive, etc.)
nApplication-oriented DBMS (spatial, scientific, engineering, etc.)
n1990s:
nData mining, data warehousing, multimedia databases, and Web
databases
n2000s
nStream data management and mining
nData mining and its applications
nWeb technology (XML, data integration) and global information systems

6
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

7
What Is Data Mining?
nData mining (knowledge discovery from data)
nExtraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
nData mining: a misnomer?
nAlternative names
nKnowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
nWatch out: Is everything “data mining”?
nSimple search and query processing
n(Deductive) expert systems

8
Knowledge Discovery (KDD) Process
nThis is a view from typical
database systems and data
warehousing communities
nData mining plays an essential
role in the knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

9
Example: A Web Mining Framework
nWeb mining usually involves
nData cleaning
nData integration from multiple sources
nWarehousing the data
nData cube construction
nData selection for data mining
nData mining
nPresentation of the mining results
nPatterns and knowledge to be used or stored into
knowledge-base

10
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

11
Example: Mining vs. Data Exploration
nBusiness intelligence view
nWarehouse, data cube, reporting but not much mining
nBusiness objects vs. data mining tools
nSupply chain example: tools
nData presentation
nExploration

12
KDD Process: A Typical View from ML and
Statistics
Input Data
Data
Mining
Data Pre-
Processing
Post-
Processing
nThis is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

13
Example: Medical Data Mining
nHealth care & medical data mining – often
adopted such a view in statistics and machine
learning
nPreprocessing of the data (including feature
extraction and dimension reduction)
nClassification or/and clustering processes
nPost-processing for presentation

14
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

15
Multi-Dimensional View of Data Mining
nData to be mined
nDatabase data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
nKnowledge to be mined (or: Data mining functions)
nCharacterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
nDescriptive vs. predictive data mining
nMultiple/integrated functions and mining at multiple levels
nTechniques utilized
nData-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
nApplications adapted
nRetail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.

16
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

17
Data Mining: On What Kinds of Data?
nDatabase-oriented data sets and applications
nRelational database, data warehouse, transactional database
nAdvanced data sets and advanced applications
nData streams and sensor data
nTime-series data, temporal data, sequence data (incl. bio-sequences)
nStructure data, graphs, social networks and multi-linked data
nObject-relational databases
nHeterogeneous databases and legacy databases
nSpatial data and spatiotemporal data
nMultimedia database
nText databases
nThe World-Wide Web

18
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

19
Data Mining Function: (1) Generalization
nInformation integration and data warehouse construction
nData cleaning, transformation, integration, and
multidimensional data model
nData cube technology
nScalable methods for computing (i.e., materializing)
multidimensional aggregates
nOLAP (online analytical processing)
nMultidimensional concept description: Characterization
and discrimination
nGeneralize, summarize, and contrast data
characteristics, e.g., dry vs. wet region

20
Data Mining Function: (2) Association
and Correlation Analysis
nFrequent patterns (or frequent itemsets)
nWhat items are frequently purchased together in your
Walmart?
nAssociation, correlation vs. causality
nA typical association rule
nDiaper  Beer [0.5%, 75%] (support, confidence)
nAre strongly associated items also strongly correlated?
nHow to mine such patterns and rules efficiently in large
datasets?
nHow to use such patterns for classification, clustering,
and other applications?

21
Data Mining Function: (3) Classification
nClassification and label prediction
nConstruct models (functions) based on some training examples
nDescribe and distinguish classes or concepts for future prediction
nE.g., classify countries based on (climate), or classify cars
based on (gas mileage)
nPredict some unknown class labels
nTypical methods
nDecision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
nTypical applications:
nCredit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

22
Data Mining Function: (4) Cluster Analysis
nUnsupervised learning (i.e., Class label is unknown)
nGroup data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
nPrinciple: Maximizing intra-class similarity & minimizing
interclass similarity
nMany methods and applications

23
Data Mining Function: (5) Outlier Analysis
nOutlier analysis
nOutlier: A data object that does not comply with the general
behavior of the data
nNoise or exception? ― One person’s garbage could be another
person’s treasure
nMethods: by product of clustering or regression analysis, …
nUseful in fraud detection, rare events analysis

24
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
nSequence, trend and evolution analysis
nTrend, time-series, and deviation analysis: e.g.,
regression and value prediction
nSequential pattern mining
ne.g., first buy digital camera, then buy large SD
memory cards
nPeriodicity analysis
nMotifs and biological sequence analysis
nApproximate and consecutive motifs
nSimilarity-based analysis
nMining data streams
nOrdered, time-varying, potentially infinite, data streams

25
Structure and Network Analysis
nGraph mining
nFinding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
nInformation network analysis
nSocial networks: actors (objects, nodes) and relationships (edges)
ne.g., author networks in CS, terrorist networks
nMultiple heterogeneous networks
nA person could be multiple information networks: friends,
family, classmates, …
nLinks carry a lot of semantic information: Link mining
nWeb mining
nWeb is a big information network: from PageRank to Google
nAnalysis of Web information networks
nWeb community discovery, opinion mining, usage mining, …

26
Evaluation of Knowledge
nAre all mined knowledge interesting?
nOne can mine tremendous amount of “patterns” and knowledge
nSome may fit only certain dimension space (time, location, …)
nSome may not be representative, may be transient, …
nEvaluation of mined knowledge → directly mine only
interesting knowledge?
nDescriptive vs. predictive
nCoverage
nTypicality vs. novelty
nAccuracy
nTimeliness
n…

27
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

28
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology

29
Why Confluence of Multiple Disciplines?
nTremendous amount of data
nAlgorithms must be highly scalable to handle such as tera-bytes of
data
nHigh-dimensionality of data
nMicro-array may have tens of thousands of dimensions
nHigh complexity of data
nData streams and sensor data
nTime-series data, temporal data, sequence data
nStructure data, graphs, social networks and multi-linked data
nHeterogeneous databases and legacy databases
nSpatial, spatiotemporal, multimedia, text and Web data
nSoftware programs, scientific simulations
nNew and sophisticated applications

30
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

31
Applications of Data Mining
nWeb page analysis: from web page classification, clustering to
PageRank & HITS algorithms
nCollaborative analysis & recommender systems
nBasket data analysis to targeted marketing
nBiological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
nData mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
nFrom major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining

32
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

33
Major Issues in Data Mining (1)
nMining Methodology
nMining various and new kinds of knowledge
nMining knowledge in multi-dimensional space
nData mining: An interdisciplinary effort
nBoosting the power of discovery in a networked environment
nHandling noise, uncertainty, and incompleteness of data
nPattern evaluation and pattern- or constraint-guided mining
nUser Interaction
nInteractive mining
nIncorporation of background knowledge
nPresentation and visualization of data mining results

34
Major Issues in Data Mining (2)
nEfficiency and Scalability
nEfficiency and scalability of data mining algorithms
nParallel, distributed, stream, and incremental mining methods
nDiversity of data types
nHandling complex types of data
nMining dynamic, networked, and global data repositories
nData mining and society
nSocial impacts of data mining
nPrivacy-preserving data mining
nInvisible data mining

35
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

36
A Brief History of Data Mining Society
n1989 IJCAI Workshop on Knowledge Discovery in Databases
nKnowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
n1991-1994 Workshops on Knowledge Discovery in Databases
nAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
n1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
nJournal of Data Mining and Knowledge Discovery (1997)
nACM SIGKDD conferences since 1998 and SIGKDD Explorations
nMore conferences on data mining
nPAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
nACM Transactions on KDD starting in 2007

37
Conferences and Journals on Data Mining
nKDD Conferences
nACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
nSIAM Data Mining Conf. (SDM)
n(IEEE) Int. Conf. on Data Mining
(ICDM)
nEuropean Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
nPacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
nInt. Conf. on Web Search and
Data Mining (WSDM)
nOther related conferences
nDB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
nWeb and IR conferences: WWW,
SIGIR, WSDM
nML conferences: ICML, NIPS
nPR conferences: CVPR,
nJournals
nData Mining and Knowledge
Discovery (DAMI or DMKD)
nIEEE Trans. On Knowledge and
Data Eng. (TKDE)
nKDD Explorations
nACM Trans. on KDD

38
Where to Find References? DBLP, CiteSeer, Google
nData mining and KDD (SIGKDD: CDROM)
nConferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
nJournal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
nDatabase systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
nConferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
nJournals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
nAI & Machine Learning
nConferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
nJournals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
nWeb and IR
nConferences: SIGIR, WWW, CIKM, etc.
nJournals: WWW: Internet and Web Information Systems,
nStatistics
nConferences: Joint Stat. Meeting, etc.
nJournals: Annals of statistics, etc.
nVisualization
nConference proceedings: CHI, ACM-SIGGraph, etc.
nJournals: IEEE Trans. visualization and computer graphics, etc.

39
Chapter 1. Introduction
nWhy Data Mining?
nWhat Is Data Mining?
nA Multi-Dimensional View of Data Mining
nWhat Kind of Data Can Be Mined?
nWhat Kinds of Patterns Can Be Mined?
nWhat Technology Are Used?
nWhat Kind of Applications Are Targeted?
nMajor Issues in Data Mining
nA Brief History of Data Mining and Data Mining Society
nSummary

40
Summary
nData mining: Discovering interesting patterns and knowledge from
massive amount of data
nA natural evolution of database technology, in great demand, with
wide applications
nA KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
nMining can be performed in a variety of data
nData mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
nData mining technologies and applications
nMajor issues in data mining

41
Recommended Reference Books
nS. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
nR. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
nT. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
nU. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
nU. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
nJ. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3
rd
ed., 2011
nD. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
nT. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2
nd
ed., Springer-Verlag, 2009
nB. Liu, Web Data Mining, Springer 2006.
nT. M. Mitchell, Machine Learning, McGraw Hill, 1997
nG. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
nP.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
nS. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
nI. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2
nd
ed. 2005
Tags