dimensionality reduction with clustering

RibenzRijal 53 views 29 slides Jun 13, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

dimensionality reduction


Slide Content

“Comparative study
of dimensionality
reductiontechniques
in combination with
clustering algorithms
for high-dimensional
data analysis.”RibenchRijal
CDCSIT, TU
Roll no.: 09
Symbol no.: 146/071

Contents
uIntroduction
uProblem Statement
uObjectives
uDimensionality Reduction Techniques
uClustering Algorithms
uDatasets
uEvaluating Parameters
uImplementation
uResult
uLimitations and Future Recommendations
uReferences 2

Introduction: Dimensionality Reduction
uIn this data-driven world, data is growing exponentially, so is its
complexity.
uHigher dimension data results in increased computational complexity,
and difficulties in visualization and interpretation.
uDimensionality reduction transforms data from higher dimension into
lower dimensional representation.
uIt improves efficiency, reduces noise and redundant information.
3

Introduction: Clustering
uGrouping similar data together.
uExample of unsupervised algorithms.
uUsed for pattern recognition, data exploration, outlier detection, etc.
uDistance based, density based, graph-based algorithms.
4

Problem Statement
uVarious methods of clustering and dimensionality reduction are used.
uHow dimensionality reduction impacts performance of clustering
algorithms?
uWhich reduction technique is suitable for various clustering algorithm
types?
uA thorough study is essential to answer the questions and suggest best
practices.
5

Objectives
uTo compare and evaluate the performance of different dimensionality
reduction techniques in combination with clustering algorithms for
high-dimensional data analysis.
uTo assess the impact of dimensionality reduction on the quality of
clustering results, including various measures such as clustering
accuracy, compactness, and separation.
6

Dimensionality Reduction Techniques
Component Based
uPCA,
uFactor Analysis,
uIndependent Component
Analysis, and so on
Projection Based
uISOMAP,
ut-SNE,
uUMAP, and so on
7

Dimensionality Reduction Techniques: PCA
uLinear technique, finds orthogonal projections that capture maximum
variance. The new uncorrelated variables are called Principal
Components. 8

Dimensionality Reduction Technique: t-SNE
ut-Distributed Stochastic
Neighbor Embedding
uMinimize divergence
between probability
distribution in high and
low data points
9

Dimensionality Reduction Technique: UMAP
uUniform Manifold
Approximation Projection
uPreserves both local and global
structures
uModels data as topological
graph and optimizes weighted
cross-entropy loss
10

Clustering Algorithms: DBSCAN
uDensity-Based Spatial
Clustering of Applications with
Noise.
uGroups together data points
that are close to each other
uDoes not require predefined
number of clusters.
uRobust to noise and outliers
11

Clustering Algorithms: BIRCH
uBalanced Iterative Reducing
and Clustering using Hierarchies
uHighly Scalable
uBuilds a CF tree and clusters
data points based on the initial
tree.
uThe global clusters are then
refined for better result
uEfficient in both time and
space.
12

Clustering Algorithms: Spectral
uGraph-based
uConstructs similarity graph and
uses eigenvalues and
eigenvectors from it.
uClustering is done based on
spectral properties of the
graph.
uCan handle complex data
structures.
13

Datasets
Wine Dataset
uHighly used public dataset
u13 features (dimensions), 178
instances
uModerate size
uKnown ground truth
Iris Dataset
uData of 3 iris flowers
u4 features, 150 instances
uSmall size
uBalanced class distribution
uDistinct clusters
14

Evaluating Parameters
uCluster Accuracy
uCluster Purity
uComputation Time
uSilhouette score
u-1 to 1 value, higher the better
uProvides how distinct clusters are from neighboring clusters
uRand index
u0 to 1 value, higher the better.
uProvides the correctness of data points assigned to each clusters
w.r.t. the ground truth.
15

Evaluating Parameters
Silhouette ScoreRand Index
16

Implementation
uPython programming
uVarious python libraries like numpy, scikit-learn, matplotlib, pandas,
etc.
uTools like JupyterNotebook, Pycharm, etc.
uOperating Systems: Mac OS, Windows.
17

Results: Execution time
18
0100200300400500600700800
PCA
t-SNE
UMAP
Time for Dimensionality reduction (ms)
WineIris

Results: Execution Time
uTime taken to cluster data (milliseconds)
19
Dimensionality
Reduction
Technique
Iris DatasetWine Dataset
BIRCHSpectralDBSCANBIRCHSpectralDBSCAN
None1.51271.793.934.3300.091.43
PCA0.98310.012.62.971075.50.4
t-SNE3.31317.132.416.64283.820.42
UMAP1.86274.163.832.14301.061.38

Results: Rand Indices (Purity)
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
NonePCAt-SNEUMAP
Iris Rand Index wrt GT
BIRCHSpectralDBSCAN
0
0.2
0.4
0.6
0.8
1
1.2
PCAt-SNEUMAP
Iris Rand Index wrt IC
BIRCHSpectralDBSCAN

Results: Rand Indices (Purity)
21
0
0.2
0.4
0.6
0.8
1
1.2
PCAt-SNEUMAP
Wine Rand Index wrt IC
BIRCHSpectralDBSCAN
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
NonePCAt-SNEUMAP
Wine Rand Index wrt GT
BIRCHSpectralDBSCAN

Results: Silhoette score (Compactness)
22
-1.5-1-0.500.511.522.5
PCA
t-SNE
UMAP
PCA
t-SNE
UMAP
Iris
Wine
Silhoette Scores
BIRCHSpectralDBSCAN

Results: Clusters (Iris data)
23

Results: Clusters (Wine data)
24

Results: Key Findings
uPCAisveryfastamongtheselectedtechniques.Itperformsverywellonlarge
datasets.
uUMAPcombinedwithSpectralorBIRCHalgorithmsresultedinmoredistinct,
andcompactclusters.
uUMAPisaslightimprovementovert-SNEincontextofspeedandrobustness
ofclusters.
uBIRCHalgorithmconsistentlyperformsbetterafterreduction.
uUMAPimprovesclusterquality.
ut-SNEandUMAPevenimproveclusterpurityinsomecasesthanwhendata
areclusteredwithoutreducingtheirdimensions.
uUMAP-BIRCHcombinationisbetterforsmallerdatasets,whilePCA-BIRCH
combinationismorepracticalforlargerdatasets.
25

Conclusion
uThisstudyprovidesvaluableinsightsintothefusionofdimensionality
reductiontechniquesandclusteringalgorithmsforhigh-dimensionaldata
analysis.
uTheUMAP-BIRCHcombinationexcelsinsmallerdatasets,exhibitinghigh
clusteringquality,whilethePCA-BIRCHcombinationemergesasapragmatic
choiceforlargerdatasetsduetoitscomputationalefficiency.
26

Limitations and Future Recommendations
27
uSmallDatasizes:awiderrangecanbeselected.
uDataNormalizationnotconsidered:thispreprocessingstepcanbeconsidered
forfurtherstudies.
uOnlytabulardatataken:Studyinghowthetechniquesbehaveondifferent
datamodalitiescanextendtheapplicabilitytoawiderrangeofreal-world
scenarios.
uFutureresearchcouldinvolvesystematicallyevaluatingdifferent
dimensionalitysettingstoidentifythemostinformativerepresentationfor
clustering,leadingtomoreinformeddecisionsinpracticalapplications.
uWiderrangeofdimensionalityreductiontrechniquesandclusteringalgorithms
canbetaken.

References
uC. C. Aggarwal, Data Mining: The Textbook, Springer, 2015.
uZ. Zhang, "Dimensionality Reduction: A Comparative Review," Big Data Clustering:
Algorithms, Applications, and Challenges, pp. 71-113, 2016.
uWu, W., Ma, S., & Zhou, F. A ReviewonDimensionalityReductionofHigh-
Dimensional Data. IEEE TransactionsonKnowledgeandData Engineering, vol. 31
no. 12, pp. 2341-2357, 2019.
uAsif, A., Zheng, C., Teng, Z., & Xie, S. DimensionalityReductionTechniquesfor
High-Dimensional Data Visualization. AppliedSciences, vol. 11 no. 12, pp. 5645,
2021.
uR. Fisher, "Iris -UCI Machine Learning Repository," [Online]. Available:
https://archive.ics.uci.edu/dataset/53/iris. [Accessed 29 June 2023].
uS. Aeberhardand M. Forina, "Wine--UCI Machine Learning Repository," [Online].
Available: https://archive.ics.uci.edu/dataset/109/wine. [Accessed 22 June
2023].
28

Thank You !
29
Tags