20IT501 –Data Warehousing and
Data Mining
III Year / VI Semester
UNIT V -ADVANCED TRENDS
TimeSeriesandSequenceDatainTransactional
Databases-Sentimentanalysis-DescriptiveMining
ofComplexDataObjects-SpatialDatabases-
MultimediaDatabases-Textmining-WorldWide
Web-ApplicationsandTrendsinDataMining-
Casestudiesinvolvingclassificationandclustering
Mining Time-Series Data
Atime-seriesdatabaseconsistsofsequencesofvalues
oreventsobtainedoverrepeatedmeasurementsoftime.
Thevaluesaretypicallymeasuredatequaltime
intervals(e.g.,hourly,daily,weekly).
Time-seriesdatabasesarepopularinmany
applications,suchasstockmarketanalysis,economic
andsalesforecastingandetc.
Mining Time-Series Data
Withthegrowingdeploymentofalargenumber
ofsensors,telemetrydevices,andotheron-line
datacollectiontools,theamountoftime-series
dataisincreasingrapidly,oftenintheorderof
gigabytesperday(suchasinstocktrading)or
evenperminute(suchasfromNASAspace
programs).
Mining Time-Series Data
TrendAnalysis
Therearetwogoalsintime-seriesanalysis:
modelingtimeseries(i.e.,togaininsightintothe
mechanismsorunderlyingforcesthatgeneratethetime
series),and
forecastingtimeseries(i.e.,topredictthefuturevalues
ofthetime-seriesvariables).
Mining Time-Series Data
TrendAnalysis–Components:
Trendorlong-termmovements:Theseindicate
thegeneraldirectioninwhichatimeseriesgraph
ismovingoveralongintervaloftime.This
movementisdisplayedbyatrendcurve,oratrend
line.
Mining Time-Series Data
TrendAnalysis–Components:
Cyclicmovementsorcyclicvariations:These
refertothecycles,thatis,thelong-term
oscillationsaboutatrendlineorcurve,whichmay
ormaynotbeperiodic.
Descriptive Mining of
Complex Data Objects
Applicationrequirementshavemotivatedthe
designanddevelopmentofobject-relational
andobject-orienteddatabasesystems.
Thesesystemsorganizealargesetofcomplex
dataobjectsintoclasses,whichareinturn
organizedintoclass/subclasshierarchies.
Descriptive Mining of
Complex Data Objects
Eachobjectinaclassisassociatedwith
anobject-identifier
asetofattributesthatmaycontainsophisticated
datastructures
asetofmethods
Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Animportantfeatureofobject-relationaland
object-orienteddatabasesistheircapabilityof
storing,accessing,andmodelingcomplex
structure-valueddata,suchasset-andlist-valued
dataanddatawithnestedstructures.
Descriptive Mining of
Complex Data Objects
Setvalueddatacanbegeneralizedby,
Generalizationofeachvalueinthesettoits
correspondinghigher-levelconcept,or
Derivationofthegeneralbehavioroftheset,
Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
Aset-valuedattributemaybegeneralizedtoaset-
valuedorasingle-valuedattribute;
Asingle-valuedattributemaybegeneralizedtoaset-
valuedattributeifthevaluesformalatticeor
“hierarchy”orifthegeneralizationfollowsdifferent
paths.
Ex.1 : Generalization of a set-valued attribute.
•Supposethatthehobbyofapersonisaset-
valuedattributecontainingthesetofvalues
{tennis,hockey,soccer,violin,SimCity}.
•Thissetcanbegeneralizedtoasetofhigh-
levelconcepts,suchas{sports,music,
computergames}
•generalizedvaluetoindicatehowmany
elementsaregeneralizedtothatvalue,asin
{sports(3),music(1),computergames(1)},
wheresports(3)indicatesthreekindsofsports,
andsoon.
Descriptive Mining of
Complex Data Objects
GeneralizationofStructuredData
List-valuedattributesandsequence-valued
attributescanbegeneralizedinamannersimilarto
thatforset-valuedattributesexceptthattheorder
oftheelementsinthelistorsequenceshouldbe
preservedinthegeneralization.
Spatial Data Mining
Acrucialchallengetospatialdataminingisthe
explorationofefficientspatialdatamining
techniquesduetothehugeamountofspatialdata
andthecomplexityofspatialdatatypesand
spatialaccessmethods.
Thetermgeostatisticsisoftenassociatedwith
continuousgeographicspace.
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP
Aspatialdatawarehouseisasubject-oriented,
integrated,time-variant,andnonvolatilecollectionof
bothspatialandnonspatialdatainsupportofspatial
dataminingandspatial-datarelateddecision-making
processes.
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofData:
Anonspatialdimension-“hot”fortemperatureand
“wet”forprecipitation
Aspatial-to-nonspatialdimensionisadimension
Aspatial-to-spatialdimension-regionscovering0-5
degrees(Celsius),5-10degrees,andsoon.
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
Numericalmeasure:onlynumericaldata.For
example,onemeasureinaspatialdatawarehouse
couldbethemonthlyrevenueofaregion,sothata
roll-upmaycomputethetotalrevenuebyyear,by
county,andsoon.
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
Spatialmeasure:Acollectionofpointerstospatial
objects.Theregionswiththesamerangeof
temperatureandprecipitationwillbegroupedintothe
samecell,andthemeasuresoformedcontainsa
collectionofpointerstothoseregions
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP
Spatial Data Mining
SpatialDataCubeConstructionandSpatial
OLAP–TypesofMeasure:
regionnamedimension:probelocation<district
<city<region
timedimension:hour<day<month<season<
province
Multimedia Data Mining
Amultimediadatabasesystemstoresand
managesalargecollectionofmultimediadata,
suchasaudio,video,image,graphics,speech,
text,document,andhypertextdata,which
containtext,textmarkups,andlinkages.
Multimedia Data Mining
SimilaritySearchinMultimediaData–
Approaches:
Colorhistogram–basedsignature:twoimageswith
similarcolorcompositionbutthatcontainvery
differentshapesortexturesmaybeidentifiedas
similar,althoughtheycouldbecompletelyunrelated
semantically.
Multimedia Data Mining
SimilaritySearchinMultimediaData–
Approaches:
Multifeaturecomposedsignature:color
histogram,shape,imagetopology,andtexture.The
extractedimagefeaturesarestoredasmetadata,
andimagesareindexedbasedonsuchmetadata.
Multimedia Data Mining
SimilaritySearchinMultimediaData–Approaches:
Wavelet-basedsignature:Waveletscaptureshape,
texture,andimagetopologyinformationinasingleunified
framework.
Wavelet-basedsignaturewithregion-based
granularity:thecomputationandcomparisonofsignatures
areatthegranularityofregions,nottheentireimage.
Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Amultimediadatacubecancontainadditional
dimensionsandmeasuresformultimediainformation,
suchascolor,texture,andshape.
Amultimediadataminingsystemprototypecalled
MultiMediaMiner,whichextendstheDBMinersystem
byhandlingmultimediadata.
Multimedia Data Mining
MultidimensionalAnalysisofMultimedia
Data
Eachimagecontainstwodescriptors:afeature
descriptorandalayoutdescriptor.
Theoriginalimageisnotstoreddirectlyinthe
database;onlyitsdescriptorsarestored.
Multimedia Data Mining
MultidimensionalAnalysisofMultimediaData
Thedescriptioninformationencompassesfieldslikeimagefile
name,imageURL,imagetype(e.g.,gif,tiff,jpeg,mpeg,bmp,
avi)
Thefeaturedescriptorisasetofvectorsforeachvisual
characteristic.Themainvectorsareacolorvectorcontainingthe
colorhistogramquantizedto512colors(8 8 8forR G B),an
MFC(MostFrequentColor)vector,andanMFO(MostFrequent
Orientation)vector.
Text Mining
TextRetrievalMethods-VectorSpacemodel
representadocumentandaquerybothasvectorsina
high-dimensionalspacecorrespondingtoallthe
keywordsanduseanappropriatesimilaritymeasureto
computethesimilaritybetweenthequeryvectorand
thedocumentvector.
Thesimilarityvaluescanthenbeusedforranking
documents.
Text Mining
TextRetrievalMethods-VectorSpacemodel
Thefirststepinmostretrievalsystemsistoidentify
keywordsforrepresentingdocuments,apreprocessing
stepoftencalledtokenization.
Astoplistisasetofwordsthataredeemed
“irrelevant.”Forexample,a,the,of,for,with,andso
onarestopwords,eventhoughtheymayappear
frequently.
Text Mining
TextRetrievalMethods-VectorSpacemodel
Agroupofdifferentwordsmaysharethesameword
stem.
Atextretrievalsystemneedstoidentifygroupsof
wordswherethewordsinagrouparesmallsyntactic
variantsofoneanotherandcollectonlythecommon
wordstempergroup.
Text Mining
TextRetrievalMethods-VectorSpacemodel
Thetermfrequencybethenumberofoccurrences
oftermtinthedocumentd,thatis,freq(d,t).
Text Mining
TextRetrievalMethods-VectorSpacemodel
Arepresentativemetricisthecosinemeasure,
definedasfollows.
Letv1andv2betwodocumentvectors.Their
cosinesimilarityisdefinedas
Text Mining
TextIndexingTechniques-invertedindex:
Aninvertedindexisanindexstructurethatmaintains
twohashindexedorB+-treeindexedtables:document
tableandtermtable,where
documenttableconsistsofasetofdocumentrecords,each
containingtwofields:docidandpostinglist,whereposting
listisalistofterms(orpointerstoterms)thatoccurinthe
document,sortedaccordingtosomerelevancemeasure.
Text Mining
TextIndexingTechniques-invertedindex:
Aninvertedindexisanindexstructurethatmaintains
twohashindexedorB+-treeindexedtables:document
tableandtermtable,where
termtableconsistsofasetoftermrecords,eachcontaining
twofields:termidandpostinglist,wherepostinglist
specifiesalistofdocumentidentifiersinwhichtheterm
appears.
Text Mining
TextIndexingTechniques-signaturefile:
Asignaturefileisafilethatstoresasignaturerecordfor
eachdocumentinthedatabase
Eachsignaturehasafixedsizeofbbitsrepresenting
terms.
Eachbitofadocumentsignatureisinitializedto0.Abitis
setto1ifthetermitrepresentsappearsinthedocument.
Mining the World Wide Web
TheWorldWideWebservesasahuge,widely
distributed,globalinformationservicecenter
fornews,advertisements,consumer
information,financialmanagement,education,
government,e-commerce,andmanyother
informationservices.
TheWebalsocontainsarichanddynamic
collectionofhyperlinkinformationandWeb
pageaccessandusageinformation,providing
richsourcesfordatamining.
APPLICATIONS AND TRENDS
IN DATA MINING
•Financial Data Analysis
•Retail Industry
•Telecommunication Industry
•Biological Data Analysis
•Other Scientific Applications
•Intrusion Detection