Lecture#14 Clustering in querie eees.ppt

ifraghaffar859 11 views 19 slides Jul 22, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

lec


Slide Content

Clustering in Queries
by
Dr Wareesa Sharif

Clustering
Clusteringalgorithmsgroupasetofdocumentsinto
subsetsorclusters.TheCLUSTERalgorithms’goalisto
createclustersthatarecoherentinternally,butclearly
differentfromeachother.Inotherwords,documents
withinaclustershouldbeassimilaraspossible;and
documentsinoneclustershouldbeasdissimilaras
possiblefromdocumentsinotherclusters.Clusteringis
themostcommonformofunsupervisedlearning.In
clustering,itisthedistributionandmakeupofthedata
thatwilldetermineclustermembership.

Flat Clustering
Flatclusteringcreatesaflatsetofclusterswithoutanyexplicit
structurethatFLATCLUSTERING wouldrelateclustersto
eachother.Hierarchicalclusteringcreatesahierarchyof
clusters

Hard Clustering, Soft Clustering
Asecondimportantdistinctioncanbemadebetween
hardandsoftcluster ingalgorithms.Hardclustering
computesahardassignment–eachdocumentHARD
CLUSTERINGisamemberofexactlyonecluster.
Theassignmentofsoftclusteringalgo-SOFT
CLUSTERING algorithmsissoft–adocument’s
assignmentisadistributionoverallclusters.
Inasoftassignment,adocumenthasfractional
membershipinseveralclusters.Latentsemanticindexing,
aformofdimensionalityreduction,isasoftclustering
algorithm

Search result clustering
SearchresultclusteringwhereSEARCH RESULT
CLUSTERINGbysearchresultswemeanthedocumentsthat
werereturnedinresponsetoaquery.Thedefaultpresentation
ofsearchresultsininformationretrievalisasimplelist.Users
scanthelistfromtoptobottomuntiltheyhavefoundthe
informationtheyarelookingfor.Instead,searchresult
clusteringclus tersthesearchresults,sothatsimilar
documentsappeartogether.Itisofteneasiertoscanafew
coherentgroupsthanmanyindividualdocuments.Thisis
particularlyusefulifasearchtermhasdifferentwordsenses.

SCATTER-GATHER
Scatter-Gatherclustersthewholecollectiontogetgroups
ofdocumentsthattheusercanselectorgather.The
selectedgroupsaremergedandtheresultingsetisagain
clustered.Thisprocessisrepeateduntilaclusterofinterest
isfound.AnexampleofausersessioninScatter-Gather.A
collectionofNewYorkTimesnewsstoriesisclustered
(“scattered”)intoeightclusters(toprow).Theuser
manuallygathersthreeoftheseintoasmallercollection
InternationalStoriesandperformsanotherscattering
operation.Thisprocessrepeatsuntilasmallclusterwith
relevantdocumentsisfound.

We can define the goal in hard flat clustering as follows.
(i) a set of documents D = {d1, . . . , dN}, (ii) a desired
number of clusters K, and
(iii) an objective function that evaluates the quality of a
clustering, we want to OBJECTIVE FUNCTION compute an
assignment γ : D → {1, . . . , K} that minimizes (or, in other
cases, maximizes) the objective function. In most cases, we
also demand thatγ is subjective, i.e., that none of the K
clusters is empty. The objective function is often defined in
terms of similarity or distance
between documents.

Fordocuments,thetypeofsimilaritywewantis
usuallytopicsimilarityorhighvaluesonthesame
dimensionsinthevectorspacemodel.Forexam ple,
documentsaboutChinahavehighvalueson
dimensionslikeChinese,Beijing,andMaowhereas
documentsabouttheUKtendtohavehighvaluesfor
London,BritainandQueen.Weapproximatetopic
similaritywithcosinesimilarityorEuclideandistancein
vectorspaceIfweintendtocapturesimilarityofatype
otherthantopic,forexample,similarityoflan guage,
thenadifferentrepresentationmaybeappropriate.
Whencomputingtopicsimilarity,stopwordscanbe
safelyignored,buttheyareimportantcuesfor
separatingclustersofEnglish(inwhichtheoccurs
frequentlyandinfrequently)andFrenchdocuments(in
whichtheoccursinfrequentlyandfrequently)

EXHAUSTIVE and non-EXHAUSTIVEsearch
Someresearchersdistinguishbetweenexhaustiveclusteringsthat
assignEXHAUSTIVE searchdocumenttoaclusterandnon-
exhaustiveclusterings,inwhichsomedocumentswillbeassigned
tonocluster.Non-exhaustiveclusteringsinwhicheachdocumentis
amemberofeithernoclusteroroneclusterarecalledexclusive.

Evaluation of clustering
Typical objective functions in clustering formalize the goal of
attaining high intra-cluster similarity (documents within a cluster
are similar) and low inter cluster similarity (documents from
different clusters are dissimilar). This is an internal criterion for
the quality of a clustering.

Cardinality
Adifficultissueinclusteringisdeterminingthenumberof
clustersorCARDINALITYofaclustering,whichwedenote
byK.OftenKisnothingmorethanagoodguessbasedon
experienceordomainknowledge.ButforK-means,
wewillalsointroduceaheuristicmethodforchoosingK
andanattempttoincorporatetheselectionofKintothe
objectivefunction.Sometimestheap plicationputs
constraintsontherangeofK.
Forexample,theScatter-Gatherinterfacecouldnotdisplay
morethanaboutK=10clustersperlayerbecauseofthe
sizeandresolutionofcomputermonitorsintheearly1990s.

where Ω = {ω1, ω2, . . . , ωK} is the set of clusters and C =
{c1, c2, . . . , cJ} is the set of classes. We interpret ωk as the
set of documents in ωk and cj as the set of documents in cj

Normalised Mututal Information
High purity is easy to achieve when the number of clusters is
large –in particular, purity is 1 if each document gets its own
cluster. Thus, we cannot use purity to trade off the quality of the
clustering against the number of clusters.
A measure that allows us to make this tradeoff is normalized
mutual information or NMI

Thanx for Listening
Tags