What is Data Mining?
AnimportantstepofKDD(Knowledge Discovery
from Databases)process
Dataminingistheautomaticextractionof
interesting knowledge(rules,regularities,
patterns,constraints) fromlargedatasources,
e.g.,databases,texts,web, images,etc.
Identifiedpatternsmustbe:
•Valid,novel(non‐trivial),potentiallyuseful, and
understandable
TheKDDProcess
DataMiningObjectives
Identificationofdataasasourceofusefulinformation
Automaticextractionofvalidandnovelpattersfromthe data
source
Useofdiscoveredpatternsforcompetitiveadvantages when
workinginbusinessenvironment
WhyData Mining?
DataExplosion(InformationOverload)problem
We aredrowningin data,but starvingforknowledge!
Datadataeverywherenoranydropofinsight!
(waterwatereverywherenor any drop todrink)
Explosivegrowthofdata:fromTerabytestoPetabytes
Automateddatacollectiontoolsandmature
database technologyleadtotremendousamounts
ofdatastored indatabases,datawarehouses,and
otherdata repositories
Data Mining: On what kind of
data?
RelationalDatabases
DataWarehouses
TransactionalDatabases
AdvancedDBandDataRepositories
Object‐oriented and object‐relationaldatabases
Spatialdatabases
Time‐seriesdataand temporaldata
Textdatabasesand multimediadatabases
Heterogeneousand legacy databases
Webdatabase
Data Mining Functionalities:
Characterization(1)
Adataminingprocessaimstofindrulesthatdescribe the
propertiesofaconcept.
Standardform:
Ifconceptthencharacteristics
C=1A=1&B=3(Support:25%,i.e.,thereare25%
recordsforwhichtheruleistrue)
C=1A=1&B=4(Support:17%)
C=1A=0&B=2(Support:16%)
Data Mining Functionalities:
Discrimination(2)
Adataminingprocesswhichaimsistofindrulesthat
allowustodiscriminatetheobjects(records)
belonging toagivenconcept(oneclass)fromthe
restofrecords (classes)
Standardform:
If characteristicsthenconcept
A=0&B=1C=1(Support:33%,Confidence:83%)
Confidence:The conditional probabilityofthe concept given
the characteristics
A=2&B=0C=1(27%,80%)
A=1&B=1C=1(12%,76%)
Data Mining Functionalities:
Outlier Analysis(7)
Outlier:Adataobjectthatdoesnotcomplywiththe
generalbehaviorofthedata
Itcanbeconsideredasnoiseorexception,butisquite
usefulinfrauddetection,rareeventsanalysis,etc.
Major issues in Data Mining
Miningdifferentkindsofknowledgeindatabases
Interactiveminingofknowledgeatmultiplelevelsof
abstraction
Incorporationofbackgroundknowledge
Dataminingquerylanguages
Expressionandvisualizationofdataminingresults
Handlingnoiseandincompletedata
Patternevaluation:theinterestingnessproblem
Efficiencyandscalabilityofdataminingalgorithms
Parallel,distributed,andincrementalminingmethods
Major issues in Data Mining
(cont…)
Handlingrelationalandcomplextypesofdata
Mininginformationfromheterogeneous
databasesand globalinformationsystems
(WWW)
Applicationofdiscoveredknowledge
Domain‐specificdataminingtools
Intelligentquery answering
Processcontrolanddecisionmaking
Integrationofthediscoveredknowledgewith
existing knowledge:Aknowledgefusionproblem
Protectionofdatasecurity,integrity,andprivacy