Data mining chapter for students of university

hossainsafari4 18 views 30 slides Aug 01, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

Data mining chapter


Slide Content

Data Mining

Prerequisite
Knowledgeof
DiscreteMathematics&ProbabilityTheory
DataStructuresand Algorithms
7

TeachingMaterials
TextBook
Datamining:ConceptsandTechniques,byJiaweiHanand
MichelineKamber,MorganKaufmann,ISBN1‐55860‐489‐8.
ReferenceBooks
DataMiningandAnalysis:FundamentalConceptsand
Algorithms,M.J.Zaki&WagnerM.Jr.,CambridgePress.
IntroductiontoDataMining,byPang‐NingTan,Michael
Steinbach,andVipinKumar,Pearson/AddisonWesley,ISBN0‐
321‐32136‐7.
MachineLearning,byTomM.Mitchell,McGraw‐Hill,ISBN0‐
07‐042807‐7
8

Topics
Introduction
DataPre‐processing
AssociationRuleMining
ClassificationTechniques(SupervisedLearning)
ClusteringTechniques(UnsupervisedLearning)
Semi‐SupervisedLearning
Applications
Socialnetworkanalysis
Opinion mining and sentimentanalysis
Recommendersystemsandcollaborativefiltering
9

Data
DataistheLatinpluralofdatum
Usedtorepresentunprocessedfactsandfigures
withoutanyaddedinterpretation oranalysis.
Generallyassociatedwithsomeentityandoftenviewed
asthelowestlevelofabstractionfromwhich
informationandknowledgearederived.
Datamaybeunstructured, semi‐structured,
and structured
Example:ThepriceofpetrolisRs.90perliter

Information
Informationisinterpreted(processed)datasothatithas
meaningfortheuser.
“ThepriceofpetrolhasrisenfromRs.70toRs.90per
liter”–isinformationforapersonwhotrackspetrol
prices.
Databecomesinformationwhenitisprocessedforsome
purposeandaddsvaluefortherecipient.
Asetofrawsalesfigures–Data
Salesreport(chartplotting,trendanalysis)–Information

Knowledge
Knowledgeisafluidmixofinformation,experienceand
insightthatmaybenefittheindividualorthe
organization.
“WhenpetrolpricesgoupbyRs.20perliter,itislikely
thatbusfarewillriseby25%"isknowledge.
Theboundariesbetweendata,information,and
knowledgeisfuzzy
Whatisdatatoonepersonisinformationtosomeone
else.

SummarizedView
Data–asindatabases,spreadsheets,textfiles…
Information–Processeddata
knowledge–Fluidmixofinformation,experience,and insight
OR,knowledgeisametainformationaboutthepatterns
hiddeninthedata
Thepatternsmustbediscoveredautomatically!!!

DataCategories&MiningTerminologies
DataarestoredinDocuments(Afile)
Unstructured Semi‐structured Structured
Afilestoredon
yourPC
(TextMining)
Awebpage
storedonWWW
(Web Mining)
Adatabase
(DataMining)

What is Data Mining?
AnimportantstepofKDD(Knowledge Discovery
from Databases)process
Dataminingistheautomaticextractionof
interesting knowledge(rules,regularities,
patterns,constraints) fromlargedatasources,
e.g.,databases,texts,web, images,etc.
Identifiedpatternsmustbe:
•Valid,novel(non‐trivial),potentiallyuseful, and
understandable

TheKDDProcess

DataMiningObjectives
Identificationofdataasasourceofusefulinformation
Automaticextractionofvalidandnovelpattersfromthe data
source
Useofdiscoveredpatternsforcompetitiveadvantages when
workinginbusinessenvironment

WhyData Mining?
DataExplosion(InformationOverload)problem
We aredrowningin data,but starvingforknowledge!
Datadataeverywherenoranydropofinsight!
(waterwatereverywherenor any drop todrink)
Explosivegrowthofdata:fromTerabytestoPetabytes
Automateddatacollectiontoolsandmature
database technologyleadtotremendousamounts
ofdatastored indatabases,datawarehouses,and
otherdata repositories

WhyDataMining?Cont…
Majorsourcesofabundantdata
Business:Web,e‐commerce,transactions,stocks,…
Science:Remotesensing,bioinformatics,scientificsimulation
Society and everyone:news,digitalcameras,
Thecomputingpowerisnotanissue.
Dataminingtoolsareavailable
Thecompetitivepressureisverystrong.
Almosteverycompanyisdoing(orhastodo)it

Why Data Mining Important?
Digitizationofbusinessesproducehugeamountofdata
Howtomakebestuseofdata?
Knowledgediscoveredfromdatacanbeusedforcompetitive
advantage.
E‐businessesaregeneratinghugeamountofdatasets
Online retailers(e.g.,amazon.com) arelargelydrivingby data
mining.
Websearch engines areinformationretrieval(textmining)
anddatamining companies

WhyisDataMiningNecessary?
Makeuseofyourdataassets
(knowledge‐based economy)
Biggapfromstoreddatatoknowledge
Transitionwon’toccur automatically.
Manyinterestingthingscan’tbefoundusing
database queries
Customerslikelytobuymyproducts?
Whysalewasdownafterdemonitization?
Whichitems should berecommendedtoaperson
purchasing computer?

Data Mining: On what kind of
data?
RelationalDatabases
DataWarehouses
TransactionalDatabases
AdvancedDBandDataRepositories
Object‐oriented and object‐relationaldatabases
Spatialdatabases
Time‐seriesdataand temporaldata
Textdatabasesand multimediadatabases
Heterogeneousand legacy databases
Webdatabase

Data Mining Functionalities:
Characterization(1)
Adataminingprocessaimstofindrulesthatdescribe the
propertiesofaconcept.
Standardform:
Ifconceptthencharacteristics
C=1A=1&B=3(Support:25%,i.e.,thereare25%
recordsforwhichtheruleistrue)
C=1A=1&B=4(Support:17%)
C=1A=0&B=2(Support:16%)

Data Mining Functionalities:
Discrimination(2)
Adataminingprocesswhichaimsistofindrulesthat
allowustodiscriminatetheobjects(records)
belonging toagivenconcept(oneclass)fromthe
restofrecords (classes)
Standardform:
If characteristicsthenconcept
A=0&B=1C=1(Support:33%,Confidence:83%)
Confidence:The conditional probabilityofthe concept given
the characteristics
A=2&B=0C=1(27%,80%)
A=1&B=1C=1(12%,76%)

Data Mining Functionalities:
Classificationand Prediction (3)
Findingmodels(rules)thatdescribe(characterize)
and/ordistinguish(discriminate)classesor
conceptsfor futureprediction.
Classifycountriesbasedonclimate(characteristics)
Classifycarsbasedongasmileageanduseittopredict
classificationofa newcar
Presentation:
DecisionTree
ClassificationRules
NeuralNetwork
Bayes Network

Data Mining Functionalities:
Prediction (statistical)(4)
ADataMiningprocesstopredictsomeunknownor missing
numericalvalues.
Outputspace:continuous

Data Mining Functionalities:
Association Analysis(5)
ADataMiningprocesswhichaimstoidentify
patterns (akafrequentitemsets)indata
Forexample:
Buy(X, Printer)Buy (X,Cartridge)
Buy (X,Bread)Buy (X, Butter)Buy (X,Milk)

Data Mining Functionalities:
Cluster Analysis(6)
Unsupervisedlearning
Aimstogroupdatatoformnewclasses
Clusterhouses tofind distributionpatterns
Basicprinciple:Maximizingtheintra‐class
similarityand minimizingtheinter‐class
similarity

Data Mining Functionalities:
Outlier Analysis(7)
Outlier:Adataobjectthatdoesnotcomplywiththe
generalbehaviorofthedata
Itcanbeconsideredasnoiseorexception,butisquite
usefulinfrauddetection,rareeventsanalysis,etc.

Major issues in Data Mining
Miningdifferentkindsofknowledgeindatabases
Interactiveminingofknowledgeatmultiplelevelsof
abstraction
Incorporationofbackgroundknowledge
Dataminingquerylanguages
Expressionandvisualizationofdataminingresults
Handlingnoiseandincompletedata
Patternevaluation:theinterestingnessproblem
Efficiencyandscalabilityofdataminingalgorithms
Parallel,distributed,andincrementalminingmethods

Major issues in Data Mining
(cont…)
Handlingrelationalandcomplextypesofdata
Mininginformationfromheterogeneous
databasesand globalinformationsystems
(WWW)
Applicationofdiscoveredknowledge
Domain‐specificdataminingtools
Intelligentquery answering
Processcontrolanddecisionmaking
Integrationofthediscoveredknowledgewith
existing knowledge:Aknowledgefusionproblem
Protectionofdatasecurity,integrity,andprivacy

Data Mining Applications(1)
Targetmarketing,customerrelationmanagement,
marketbasketanalysis,crossselling,
Forecasting,customerretention,qualitycontrol,
competitiveanalysis
Textmining(newsgroup,email,documents)andWeb
analysis.
Intelligentqueryanswering
Buyingpatterns
Decisionsupport
Frauddetection

Data Mining Applications(2)
ScientificApplications
Networks failuredetection
Controllersdesign
GeographicInformationSystems
Genome‐Bioinformatics
Intelligentrobots