20IT501_DWDM_U5.ppt

520 views 103 slides Aug 02, 2023
Slide 1
Slide 1 of 103
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103

About This Presentation

ADVANCED TRENDS


Slide Content

20IT501 –Data Warehousing and
Data Mining
III Year / VI Semester

UNIT V -ADVANCED TRENDS
TimeSeriesandSequenceDatainTransactional
Databases-Sentimentanalysis-DescriptiveMining
ofComplexDataObjects-SpatialDatabases-
MultimediaDatabases-Textmining-WorldWide
Web-ApplicationsandTrendsinDataMining-
Casestudiesinvolvingclassificationandclustering

Mining Time-Series Data
Atime-seriesdatabaseconsistsofsequencesofvalues
oreventsobtainedoverrepeatedmeasurementsoftime.
Thevaluesaretypicallymeasuredatequaltime
intervals(e.g.,hourly,daily,weekly).
Time-seriesdatabasesarepopularinmany
applications,suchasstockmarketanalysis,economic
andsalesforecastingandetc.

Mining Time-Series Data
Withthegrowingdeploymentofalargenumber
ofsensors,telemetrydevices,andotheron-line
datacollectiontools,theamountoftime-series
dataisincreasingrapidly,oftenintheorderof
gigabytesperday(suchasinstocktrading)or
evenperminute(suchasfromNASAspace
programs).

Mining Time-Series Data
TrendAnalysis
Therearetwogoalsintime-seriesanalysis:
modelingtimeseries(i.e.,togaininsightintothe
mechanismsorunderlyingforcesthatgeneratethetime
series),and
forecastingtimeseries(i.e.,topredictthefuturevalues
ofthetime-seriesvariables).

Mining Time-Series Data
TrendAnalysis–Components:
Trendorlong-termmovements:Theseindicate
thegeneraldirectioninwhichatimeseriesgraph
ismovingoveralongintervaloftime.This
movementisdisplayedbyatrendcurve,oratrend
line.

Mining Time-Series Data
TrendAnalysis–Components:
Cyclicmovementsorcyclicvariations:These
refertothecycles,thatis,thelong-term
oscillationsaboutatrendlineorcurve,whichmay
ormaynotbeperiodic.

Mining Time-Series Data
TrendAnalysis–Components:
Seasonalmovementsorseasonalvariations:Theseare
systematicorcalendarrelated.
Examplesincludeeventsthatrecurannually,suchasthesudden
increaseinsalesofchocolatesandflowersbeforeValentine’s
DayorofdepartmentstoreitemsbeforeChristmas.
Theobservedincreaseinwaterconsumptioninsummerdueto
warmweatherisanotherexample.

Mining Time-Series Data
TrendAnalysis–Components:
Irregularorrandommovements:These
characterizethesporadicmotionoftimeseriesdue
torandomorchanceevents,suchaslabordisputes,
floods,orannouncedpersonnelchangeswithin
companies.

Mining Time-Series Data
TrendAnalysis:
Regressionanalysishasbeenapopulartoolformodelingtime
series,findingtrendsandoutliersinsuchdatasets.
Time-seriesmodelingisalsoreferredtoasthedecompositionof
atimeseriesintothesefourbasicmovements.
Thetime-seriesvariableYcanbemodeledaseithertheproduct
ofthefourvariables(i.e.,Y=TxCxSxI)ortheirsum.

Mining Time-Series Data
TrendAnalysis:
Acommonmethodfordeterminingtrendisto
calculateamovingaverageofordernasthe
followingsequenceofarithmeticmeans:

Mining Time-Series Data
SimilaritySearchinTime-SeriesAnalysis
Unlikenormaldatabasequeries,whichfinddatathat
matchthegivenqueryexactly,asimilaritysearchfinds
datasequencesthatdifferonlyslightlyfromthegiven
querysequence.
Givenasetoftime-seriessequences,S,therearetwo
typesofsimilaritysearches:subsequencematching
andwholesequencematching.

Mining Time-Series Data
SimilaritySearchinTime-SeriesAnalysis
SubsequencematchingfindsthesequencesinS
thatcontainsubsequencesthataresimilartoa
givenquerysequencex,whilewholesequence
matchingfindsasetofsequencesinSthatare
similartoeachother(asawhole).

Mining Sequence Patterns in
Transactional Databases
Asequencedatabaseconsistsofsequencesofordered
elementsorevents,recordedwithorwithoutaconcrete
notionoftime.
Typicalexamplesincludecustomershopping
sequences,Webclickstreams,biologicalsequences,
sequencesofeventsinscienceandengineering,andin
naturalandsocialdevelopments.

Mining Sequence Patterns in
Transactional Databases
Sequentialpatternminingistheminingoffrequently
occurringorderedeventsorsubsequencesaspatterns.
Anexampleofasequentialpatternis“Customerswho
buyaCanondigitalcameraarelikelytobuyanHP
colorprinterwithinamonth.”
Forretaildata,sequentialpatternsareusefulforshelf
placementandpromotions.

Mining Sequence Patterns in
Transactional Databases
Givenasetofsequences,whereeachsequence
consistsofalistofevents(orelements)andeachevent
consistsofasetofitems,andgivenauser-specified
minimumsupportthresholdofminsup,sequential
patternminingfindsallfrequentsubsequences,thatis,
thesubsequenceswhoseoccurrencefrequencyinthe
setofsequencesisnolessthanminsup.”

Mining Sequence Patterns in
Transactional Databases
Thismodelofsequentialpatternminingisan
abstractionofcustomer-shoppingsequence
analysis.

Mining Sequence Patterns in
Transactional Databases
ScalableMethodsforMiningSequential
Patterns:
GSPadoptsacandidategenerate-and-testapproach
usinghorizonaldataformat(wherethedataare
representedas<sequenceID:sequenceofitemsets>,
asusual,whereeachitemsetisanevent).

Mining Sequence Patterns in
Transactional Databases
ScalableMethodsforMiningSequentialPatterns:
SPADEadoptsacandidategenerateand-testapproachusing
verticaldataformat(wherethedataarerepresentedas<itemset:
(sequenceID,eventID)>).
Theverticaldataformatcanbeobtainedbytransformingfroma
horizontallyformattedsequencedatabaseinjustonescan.
PrefixSpanisapatterngrowthmethod,whichdoesnotrequire
candidategeneration.

Mining Sequence Patterns in
Transactional Databases
GSP:GeneralizedSequentialPatterns
Itisanextensionoftheirseminalalgorithmfor
frequentitemsetmining,knownasApriori.
GSPusesthedownward-closurepropertyof
sequentialpatternsandadoptsamultiple-pass,
candidategenerate-and-testapproach.

Mining Sequence Patterns in
Transactional Databases
GSP:GeneralizedSequentialPatterns
Inthefirstscanofthedatabase,itfindsallofthefrequent
items,thatis,thosewithminimumsupport.
Eachsuchitemyieldsa1-eventfrequentsequence
consistingofthatitem.
Eachsubsequentpassstartswithaseedsetofsequential
patterns—thesetofsequentialpatternsfoundinthe
previouspass.

Mining Sequence Patterns in
Transactional Databases
GSP:GeneralizedSequentialPatterns
Thisseedsetisusedtogeneratenewpotentially
frequentpatterns,calledcandidatesequences.
Eachcandidatesequencecontainsonemoreitemthan
theseedsequentialpatternfromwhichitwas
generated(whereeacheventinthepatternmaycontain
oneormultipleitems).

Mining Sequence Patterns in
Transactional Databases
GSP:GeneralizedSequentialPatterns
Itreducessearchspace,ittypicallyneedstoscanthe
databasemultipletimes.
Itwilllikelygenerateahugesetofcandidate
sequences,especiallywhenmininglongsequences.
Thereisaneedformoreefficientminingmethods.

Mining Sequence Patterns in
Transactional Databases
SPADE:(SequentialPAtternDiscoveryusing
Equivalentclasses)
ItisanApriori-basedsequentialpatternmining
algorithmthatusesverticaldataformat.
AswithGSP,SPADErequiresonescantofindthe
frequent1-sequences.

Mining Sequence Patterns in
Transactional Databases
SPADE:(SequentialPAtternDiscoveryusing
Equivalentclasses)
Tofindcandidate2-sequences,wejoinallpairsof
singleitemsiftheyarefrequent(therein,itappliesthe
Aprioriproperty),iftheysharethesamesequence
identifier,andiftheireventidentifiersfollowa
sequentialordering

Mining Sequence Patterns in
Transactional Databases
SPADE:(SequentialPAtternDiscoveryusing
Equivalentclasses)
Thefirstiteminthepairmustoccurasanevent
beforetheseconditem,wherebothoccurinthesame
sequence.
Similarly,wecangrowthelengthofitemsetsfrom
length2tolength3,andsoon

Mining Sequence Patterns in
Transactional Databases
SPADE:(SequentialPAtternDiscoveryusing
Equivalentclasses)
Theuseofverticaldataformat,withthecreationofID
lists,reducesscansofthesequencedatabase.
TheIDlistscarrytheinformationnecessarytofindthe
supportofcandidates.Asthelengthofafrequentsequence
increases,thesizeofitsIDlistdecreases,resultinginvery
fastjoins.

Mining Sequence Patterns in
Transactional Databases
MiningMultidimensional,MultilevelSequential
Patterns
Sequenceidentifiers(representingindividualcustomers,
forexample)andsequenceitems(suchasproductsbought)
areoftenassociatedwithadditionalpiecesofinformation.
Miningmultidimensional,multilevelsequentialpatternsis
thediscoveryofinterestingpatternsinsuchabroad
dimensionalspace,atdifferentlevelsofdetail.

Mining Sequence Patterns in
Transactional Databases
Constraint-BasedMiningofSequential
Patterns:
wepromoteconstraint-basedmining,which
incorporatesuser-specifiedconstraintstoreduce
thesearchspaceandderiveonlypatternsthatare
ofinteresttotheuser.

Mining Sequence Patterns in
Transactional Databases
Constraint-BasedMiningofSequentialPatterns:
First,constraintscanberelatedtotheduration,T,ofa
sequence.
Thedurationmaybethemaximalorminimallengthofthe
sequenceinthedatabase,orauser-specifiedduration
relatedtotime,suchastheyear2005.
Sequentialpatternminingcanthenbeconfinedtothedata
withinthespecifiedduration,T.

Mining Sequence Patterns in
Transactional Databases
Constraint-BasedMiningofSequentialPatterns:
Constraintsrelatedtoaspecificduration,suchasa
particularyear,areconsideredsuccinctconstraints.
Aconstraintissuccinctifwecanenumerateallandonly
thosesequencesthatareguaranteedtosatisfythe
constraint,evenbeforesupportcountingbegins.
Suppose,here,T=2005.Allofthesequencesguaranteed
tosatisfytheconstraintbeforeminingbegins.

Mining Sequence Patterns in
Transactional Databases
Constraint-BasedMiningofSequentialPatterns:
Finally,ausercanspecifyconstraintsonthekindsof
sequentialpatternsbyproviding“patterntemplates”.
Aserialepisodeisasetofeventsthatoccursinatotal
order,whereasaparallelepisodeisasetofevents
whoseoccurrenceorderingisinsignificant.

Mining Sequence Patterns in
Transactional Databases
PeriodicityAnalysisforTime-RelatedSequenceData:
Periodicityanalysisistheminingofperiodicpatterns,that
is,thesearchforrecurringpatternsintime-relatedsequence
data.
Periodicityanalysiscanbeappliedtomanyimportant
areas.Forexample,seasons,tides,planettrajectories,daily
powerconsumptions,dailytrafficpatterns,andweeklyTV
programsallpresentcertainperiodicpatterns.

Mining Sequence Patterns in
Transactional Databases
PeriodicityAnalysisforTime-RelatedSequence
Data:
Afullperiodicpatternisapatternwhereeverypoint
intimecontributes(preciselyorapproximately)tothe
cyclicbehaviorofatime-relatedsequence.
Forexample,allofthedaysintheyearapproximately
contributetotheseasoncycleoftheyear.

Mining Sequence Patterns in
Transactional Databases
PeriodicityAnalysisforTime-RelatedSequenceData:
Apartialperiodicpatternspecifiestheperiodicbehaviorof
atime-relatedsequenceatsomebutnotallofthepointsin
time.
Forexample,SandyreadstheNewYorkTimesfrom7:00
to7:30everyweekdaymorning,butheractivitiesatother
timesdonothavemuchregularity.

Sentiment Analysis
Sentimentanalysis(oropinionmining)isanatural
languageprocessing(NLP)techniqueusedtodetermine
whetherdataispositive,negativeorneutral.
Sentimentanalysisisoftenperformedontextualdata
tohelpbusinessesmonitorbrandandproductsentiment
incustomerfeedback,andunderstandcustomerneeds.

Sentiment Analysis
Sentimentanalysisistheuseofnatural
languageprocessing,textanalysis,
computationallinguistics,andbiometricsto
systematicallyidentify,extract,quantify,and
studyaffectivestatesandsubjective
information.

Sentiment Analysis
Abasictaskinsentimentanalysisisclassifyingthe
polarityofagiventextatthedocument,sentence,or
feature/aspectlevel—whethertheexpressedopinionin
adocument,asentenceoranentityfeature/aspectis
positive,negative,orneutral.Advanced,"beyond
polarity"sentimentclassificationlooks,forinstance,at
emotionalstatessuchasenjoyment,anger,disgust,
sadness,fear,andsurprise.

Sentiment Analysis

Sentiment Analysis
Adifferentmethodfordeterminingsentimentis
theuseofascalingsystemwherebywords
commonlyassociatedwithhavinganegative,
neutral,orpositivesentimentwiththemaregiven
anassociatednumberona−10to+10scale(most
negativeuptomostpositive)orsimplyfrom0to
apositiveupperlimitsuchas+4.

Sentiment Analysis
Subjectivity/objectivityidentification:
Thistaskiscommonlydefinedasclassifyingagiven
text(usuallyasentence)intooneoftwoclasses:
objectiveorsubjective.
Thesubjectivityofwordsandphrasesmaydependon
theircontextandanobjectivedocumentmaycontain
subjectivesentences(e.g.,anewsarticlequoting
people'sopinions).

Sentiment Analysis
Subjectivity/objectivityidentification:
Subjectiveandobjectiveidentification,emerging
subtasksofsentimentanalysistousesyntactic,
semanticfeatures,andmachinelearningknowledgeto
identifyasentenceordocumentarefactsoropinions.
Thetermobjectivereferstotheincidentcarryfactual
information.

Sentiment Analysis
Subjectivity/objectivityidentification:
Exampleofanobjectivesentence:
“TobeelectedpresidentoftheIndia,acandidatemust
beatleastthirty-fiveyearsofage.”
Thetermsubjectivedescribestheincidentcontains
non-factualinformationinvariousforms,suchas
personalopinions,judgment,andpredictions.

Sentiment Analysis
Subjectivity/objectivityidentification:
Exampleofasubjectivesentence:
'WeIndiansneedtoelectaPrimeMinisterwhois
matureandwhoisabletomakewisedecisions.'

Descriptive Mining of
Complex Data Objects
Applicationrequirementshavemotivatedthe
designanddevelopmentofobject-relational
andobject-orienteddatabasesystems.
Thesesystemsorganizealargesetofcomplex
dataobjectsintoclasses,whichareinturn
organizedintoclass/subclasshierarchies.

Descriptive Mining of
Complex Data Objects
Eachobjectinaclassisassociatedwith
anobject-identifier
asetofattributesthatmaycontainsophisticated
datastructures
asetofmethods

Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Animportantfeatureofobject-relationaland
object-orienteddatabasesistheircapabilityof
storing,accessing,andmodelingcomplex
structure-valueddata,suchasset-andlist-valued
dataanddatawithnestedstructures.

Descriptive Mining of
Complex Data Objects
Setvalueddatacanbegeneralizedby,
Generalizationofeachvalueinthesettoits
correspondinghigher-levelconcept,or
Derivationofthegeneralbehavioroftheset,

Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Aset-valuedattributemaybegeneralizedtoaset-
valuedorasingle-valuedattribute;
Asingle-valuedattributemaybegeneralizedtoaset-
valuedattributeifthevaluesformalatticeor
“hierarchy”orifthegeneralizationfollowsdifferent
paths.

Ex.1 : Generalization of a set-valued attribute.
•Supposethatthehobbyofapersonisaset-
valuedattributecontainingthesetofvalues
{tennis,hockey,soccer,violin,SimCity}.
•Thissetcanbegeneralizedtoasetofhigh-
levelconcepts,suchas{sports,music,
computergames}
•generalizedvaluetoindicatehowmany
elementsaregeneralizedtothatvalue,asin
{sports(3),music(1),computergames(1)},
wheresports(3)indicatesthreekindsofsports,
andsoon.

Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
List-valuedattributesandsequence-valued
attributescanbegeneralizedinamannersimilarto
thatforset-valuedattributesexceptthattheorder
oftheelementsinthelistorsequenceshouldbe
preservedinthegeneralization.

•Example-2: Generalization of list-valued attributes:
•Considerthefollowinglistorsequenceofdatafora
person’seducationrecord:
“((B.Sc.inElectricalEngineering,U.B.C.,Dec.,
1998),
(M.Sc.inComputerEngineering,U.Maryland,May,
2001),
(Ph.D.inComputerScience,UCLA,Aug.,2005))”

Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Acomplexstructure-valuedattributemaycontain
sets,tuples,lists,trees,records,andtheir
combinations,whereonestructuremaybenested
inanotheratanylevel.

Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Ingeneral,astructure-valuedattributecanbegeneralized
inseveralways,suchas
generalizingeachattributeinthestructurewhilemaintainingthe
shapeofthestructure
flatteningthestructureandgeneralizingtheflattenedstructure,
summarizingthelow-levelstructuresbyhigh-levelconceptsor
aggregation
returningthetypeoranoverviewofthestructure

Descriptive Mining of
Complex Data Objects
AggregationandApproximationinSpatialand
MultimediaDataGeneralization
Aggregationandapproximationareanotherimportant
meansofgeneralization.
Theyareespeciallyusefulforgeneralizingattributes
withlargesetsofvalues,complexstructures,and
spatialormultimediadata.

Descriptive Mining of
Complex Data Objects
AggregationandApproximationinSpatialand
MultimediaDataGeneralization
Let’stakespatialdataasanexample.
Wewouldliketogeneralizedetailedgeographicpoints
intoclusteredregions,suchasbusiness,residential,
industrial,oragriculturalareas,accordingtoland
usage.

Descriptive Mining of
Complex Data Objects
AggregationandApproximationinSpatialandMultimedia
DataGeneralization
Amultimediadatabasemaycontaincomplextexts,graphics,
images,videofragments,maps,voice,music,andotherformsof
audio/videoinformation.
Multimediadataaretypicallystoredassequencesofbyteswith
variablelengths,andsegmentsofdataarelinkedtogetheror
indexedinamultidimensionalwayforeasyreference.

Descriptive Mining of
Complex Data Objects
AggregationandApproximationinSpatialandMultimedia
DataGeneralization
Generalizationonmultimediadatacanbeperformedby
recognitionandextractionoftheessentialfeaturesand/or
generalpatternsofsuchdata.
Foranimage,thesize,color,shape,texture,orientation,and
relativepositionsandstructuresofthecontainedobjectsor
regionsintheimagecanbeextractedbyaggregationand/or
approximation.

Spatial Data Mining
Aspatialdatabasestoresalargeamountofspace-related
data,suchasmaps,preprocessedremotesensingormedical
imagingdata,andVLSIchiplayoutdata.
Spatialdataminingreferstotheextractionofknowledge,
spatialrelationships,orotherinterestingpatternsnot
explicitlystoredinspatialdatabases.
Suchminingdemandsanintegrationofdataminingwith
spatialdatabasetechnologies.

Spatial Data Mining
Acrucialchallengetospatialdataminingisthe
explorationofefficientspatialdatamining
techniquesduetothehugeamountofspatialdata
andthecomplexityofspatialdatatypesand
spatialaccessmethods.
Thetermgeostatisticsisoftenassociatedwith
continuousgeographicspace.

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP
Aspatialdatawarehouseisasubject-oriented,
integrated,time-variant,andnonvolatilecollectionof
bothspatialandnonspatialdatainsupportofspatial
dataminingandspatial-datarelateddecision-making
processes.

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofData:
Anonspatialdimension-“hot”fortemperatureand
“wet”forprecipitation
Aspatial-to-nonspatialdimensionisadimension
Aspatial-to-spatialdimension-regionscovering0-5
degrees(Celsius),5-10degrees,andsoon.

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
Numericalmeasure:onlynumericaldata.For
example,onemeasureinaspatialdatawarehouse
couldbethemonthlyrevenueofaregion,sothata
roll-upmaycomputethetotalrevenuebyyear,by
county,andsoon.

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
Spatialmeasure:Acollectionofpointerstospatial
objects.Theregionswiththesamerangeof
temperatureandprecipitationwillbegroupedintothe
samecell,andthemeasuresoformedcontainsa
collectionofpointerstothoseregions

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP

Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
regionnamedimension:probelocation<district
<city<region
timedimension:hour<day<month<season<
province

Multimedia Data Mining
Amultimediadatabasesystemstoresand
managesalargecollectionofmultimediadata,
suchasaudio,video,image,graphics,speech,
text,document,andhypertextdata,which
containtext,textmarkups,andlinkages.

Multimedia Data Mining
SimilaritySearchinMultimediaData
description-basedretrievalsystems,whichbuildindices
andperformobjectretrievalbasedonimagedescriptions,
suchaskeywords,captions,size,andtimeofcreation;
content-basedretrievalsystems,whichsupportretrieval
basedontheimagecontent,suchascolorhistogram,
texture,pattern,imagetopology,andtheshapeofobjects
andtheirlayoutsandlocationswithintheimage.

Multimedia Data Mining
SimilaritySearchinMultimediaData–
Approaches:
Colorhistogram–basedsignature:twoimageswith
similarcolorcompositionbutthatcontainvery
differentshapesortexturesmaybeidentifiedas
similar,althoughtheycouldbecompletelyunrelated
semantically.

Multimedia Data Mining
SimilaritySearchinMultimediaData–
Approaches:
Multifeaturecomposedsignature:color
histogram,shape,imagetopology,andtexture.The
extractedimagefeaturesarestoredasmetadata,
andimagesareindexedbasedonsuchmetadata.

Multimedia Data Mining
SimilaritySearchinMultimediaData–Approaches:
Wavelet-basedsignature:Waveletscaptureshape,
texture,andimagetopologyinformationinasingleunified
framework.
Wavelet-basedsignaturewithregion-based
granularity:thecomputationandcomparisonofsignatures
areatthegranularityofregions,nottheentireimage.

Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Amultimediadatacubecancontainadditional
dimensionsandmeasuresformultimediainformation,
suchascolor,texture,andshape.
Amultimediadataminingsystemprototypecalled
MultiMediaMiner,whichextendstheDBMinersystem
byhandlingmultimediadata.

Multimedia Data Mining
MultidimensionalAnalysisofMultimedia
Data
Eachimagecontainstwodescriptors:afeature
descriptorandalayoutdescriptor.
Theoriginalimageisnotstoreddirectlyinthe
database;onlyitsdescriptorsarestored.

Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Thedescriptioninformationencompassesfieldslikeimagefile
name,imageURL,imagetype(e.g.,gif,tiff,jpeg,mpeg,bmp,
avi)
Thefeaturedescriptorisasetofvectorsforeachvisual
characteristic.Themainvectorsareacolorvectorcontainingthe
colorhistogramquantizedto512colors(8 8 8forR G B),an
MFC(MostFrequentColor)vector,andanMFO(MostFrequent
Orientation)vector.

Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Thelayoutdescriptorcontainsacolorlayoutvectorandanedge
layoutvector.Regardlessoftheiroriginalsize,allimagesare
assignedan8x8grid.
Themostfrequentcolorforeachofthe64cellsisstoredinthe
colorlayoutvector,andthenumberofedgesforeachorientation
ineachofthecellsisstoredintheedgelayoutvector.
Othersizesofgrids,like4x4,2x2,and1x1,caneasilybe
derived.

Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Amultimediadatacubecanhavemanydimensions:
thesizeoftheimageorvideoinbytes;thewidthandheightofthe
frames(orpictures),constitutingtwodimensions;thedateon
whichtheimageorvideowascreated(orlastmodified);theformat
typeoftheimageorvideo;theframesequencedurationinseconds;
theimageorvideoInternetdomain;theInternetdomainofpages
referencingtheimageorvideo(parentURL);thekeywords;acolor
dimension;anedge-orientationdimension;

Multimedia Data Mining
ClassificationandPredictionAnalysisofMultimediaData
Datapreprocessingisimportantwhenminingimagedataand
canincludedatacleaning,datatransformation,andfeature
extraction.
Asidefromstandardmethodsusedinpatternrecognition,such
asedgedetectionandHoughtransformations,techniquescanbe
explored,suchasthedecompositionofimagestoeigenvectorsor
theadoptionofprobabilisticmodelstodealwithuncertainty.

Multimedia Data Mining
AudioandVideoDataMining
MPEG-k(developedbyMPEG:MovingPictureExpertsGroup)
andJPEGaretypicalvideocompressionschemes.
ThemostrecentlyreleasedMPEG-7,formallynamed
“MultimediaContentDescriptionInterface,”isastandardfor
describingthemultimediacontentdata.
Itsupportssomedegreeofinterpretationoftheinformation
meaning,whichcanbepassedonto,oraccessedby,adeviceora
computer.

Multimedia Data Mining
AudioandVideoDataMining
TheaudiovisualdatadescriptioninMPEG-7
includesstillpictures,video,graphics,audio,
speech,three-dimensionalmodels,andinformation
abouthowthesedataelementsarecombinedinthe
multimediapresentation.

Text Mining
Textdatabases(ordocumentdatabases),
whichconsistoflargecollectionsof
documentsfromvarioussources,suchasnews
articles,researchpapers,books,digital
libraries,e-mailmessages,andWebpages.

Text Mining
TextDataAnalysisandInformationRetrieval
Informationretrievalisconcernedwiththe
organizationandretrievalofinformationfromalarge
numberoftext-baseddocuments.
Applications:On-linelibrarycatalogsystems,on-line
documentmanagementsystems,andthemorerecently
developedWebsearchengines.

Text Mining
TextDataAnalysisandInformationRetrieval:
Tolocaterelevantdocumentsinadocumentcollectionbasedon
auser’squery,whichisoftensomekeywordsdescribingan
informationneed.
Ausertakestheinitiativeto“pull”therelevantinformationout
fromthecollection(short-term)
To“push”anynewlyarrivedinformationitemtoauserifthe
itemisjudgedasbeingrelevanttotheuser’sinformationneed.
(Longterm)

Text Mining
BasicMeasuresforTextRetrieval:Precision
andRecall:
Precision:Thisisthepercentageofretrieved
documentsthatareinfactrelevanttothequery
(i.e.,“correct”responses).Itisformallydefinedas

Text Mining
BasicMeasuresforTextRetrieval:Precision
andRecall:
Recall:Thisisthepercentageofdocumentsthat
arerelevanttothequeryandwere,infact,
retrieved.Itisformallydefinedas

Text Mining
BasicMeasuresforTextRetrieval:Precisionand
Recall:
Aninformationretrievalsystemoftenneedstotrade
offrecallforprecisionorviceversa.Onecommonly
usedtrade-offistheF-score,whichisdefinedasthe
harmonicmeanofrecallandprecision:

Text Mining
TextRetrievalMethods-Booleanretrievalmodel
Documentisrepresentedbyasetofkeywordsandauser
providesaBooleanexpressionofkeywords,suchas“carand
repairshops,”“teaorcoffee,”or“databasesystemsbutnot
Oracle.”
TheretrievalsystemwouldtakesuchaBooleanqueryand
returndocumentsthatsatisfytheBooleanexpression.
Booleanretrievalmethodgenerallyonlyworkswellwhenthe
userknowsalotaboutthedocumentcollection

Text Mining
TextRetrievalMethods-VectorSpacemodel
representadocumentandaquerybothasvectorsina
high-dimensionalspacecorrespondingtoallthe
keywordsanduseanappropriatesimilaritymeasureto
computethesimilaritybetweenthequeryvectorand
thedocumentvector.
Thesimilarityvaluescanthenbeusedforranking
documents.

Text Mining
TextRetrievalMethods-VectorSpacemodel
Thefirststepinmostretrievalsystemsistoidentify
keywordsforrepresentingdocuments,apreprocessing
stepoftencalledtokenization.
Astoplistisasetofwordsthataredeemed
“irrelevant.”Forexample,a,the,of,for,with,andso
onarestopwords,eventhoughtheymayappear
frequently.

Text Mining
TextRetrievalMethods-VectorSpacemodel
Agroupofdifferentwordsmaysharethesameword
stem.
Atextretrievalsystemneedstoidentifygroupsof
wordswherethewordsinagrouparesmallsyntactic
variantsofoneanotherandcollectonlythecommon
wordstempergroup.

Text Mining
TextRetrievalMethods-VectorSpacemodel
Thetermfrequencybethenumberofoccurrences
oftermtinthedocumentd,thatis,freq(d,t).

Text Mining
TextRetrievalMethods-VectorSpacemodel
Arepresentativemetricisthecosinemeasure,
definedasfollows.
Letv1andv2betwodocumentvectors.Their
cosinesimilarityisdefinedas

Text Mining
TextIndexingTechniques-invertedindex:
Aninvertedindexisanindexstructurethatmaintains
twohashindexedorB+-treeindexedtables:document
tableandtermtable,where
documenttableconsistsofasetofdocumentrecords,each
containingtwofields:docidandpostinglist,whereposting
listisalistofterms(orpointerstoterms)thatoccurinthe
document,sortedaccordingtosomerelevancemeasure.

Text Mining
TextIndexingTechniques-invertedindex:
Aninvertedindexisanindexstructurethatmaintains
twohashindexedorB+-treeindexedtables:document
tableandtermtable,where
termtableconsistsofasetoftermrecords,eachcontaining
twofields:termidandpostinglist,wherepostinglist
specifiesalistofdocumentidentifiersinwhichtheterm
appears.

Text Mining
TextIndexingTechniques-signaturefile:
Asignaturefileisafilethatstoresasignaturerecordfor
eachdocumentinthedatabase
Eachsignaturehasafixedsizeofbbitsrepresenting
terms.
Eachbitofadocumentsignatureisinitializedto0.Abitis
setto1ifthetermitrepresentsappearsinthedocument.

Mining the World Wide Web
TheWorldWideWebservesasahuge,widely
distributed,globalinformationservicecenter
fornews,advertisements,consumer
information,financialmanagement,education,
government,e-commerce,andmanyother
informationservices.
TheWebalsocontainsarichanddynamic
collectionofhyperlinkinformationandWeb
pageaccessandusageinformation,providing
richsourcesfordatamining.

MiningtheWebPageLayoutStructure:
•Webpagesarealsoregardedassemi-structureddata.
ThebasicstructureofaWebpageisitsDOM
(DocumentObjectModel)structure.
•TheDOMstructureofaWebpageisatreestructure,
whereeveryHTMLtaginthepagecorrespondstoa
nodeintheDOMtree.TheWebpagecanbesegmented
bysomepredefinedstructuraltags.Usefultagsinclude
<P>(paragraph),<TABLE>(table),<UL>(list),<H1>
to<H6>(heading),etc.ThustheDOMstructurecanbe
usedtofacilitateinformationextraction.

APPLICATIONS AND TRENDS
IN DATA MINING
•Financial Data Analysis
•Retail Industry
•Telecommunication Industry
•Biological Data Analysis
•Other Scientific Applications
•Intrusion Detection

•Financial Data Analysis
–Designandconstructionofdatawarehousesfor
multidimensionaldataanalysisanddatamining.
–Loanpaymentpredictionandcustomercreditpolicy
analysis.
–Classificationandclusteringofcustomersfor
targetedmarketing.
–Detectionofmoneylaunderingandotherfinancial
crimes.

•Retail Industry
–DataMininghasitsgreatapplicationinRetail
Industrybecauseitcollectslargeamountdatafromon
sales,customerpurchasinghistory,goods
transportation,consumptionandservices.

•TelecommunicationIndustry
–TodaytheTelecommunicationindustryisoneofthe
mostemergingindustriesprovidingvariousservices
suchasfax,pager,cellularphone,Internetmessenger,
images,e-mail,webdatatransmissionetc.

•BiologicalDataAnalysis
–Nowadaysweseethatthereisvastgrowthinfield
ofbiologysuchasgenomics,proteomics,functional
Genomicsandbiomedicalresearch.Biologicaldata
miningisveryimportantpartofBioinformatics.

•Other Scientific Applications
–Theapplicationsdiscussedabovetendtohandle
relativelysmallandhomogeneousdatasetsforwhich
thestatisticaltechniquesareappropriate.Huge
amountofdatahavebeencollectedfromscientific
domainssuchasgeosciences,astronomyetc.

•IntrusionDetection
–Intrusionreferstoanykindofactionthatthreatens
integrity,confidentiality,oravailabilityofnetwork
resources.Inthisworldofconnectivitysecurityhas
becomethemajorissue.Withincreasedusageof
internetandavailabilityoftoolsandtricksforintruding
andattackingnetworkpromptedintrusiondetectionto
becomeacriticalcomponenttofnetwork
administration.
Tags