Types of data in cluster analysis and Constraints

keerthunarasu 16 views 20 slides Sep 14, 2025
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Types of data in cluster analysis and Constraints


Slide Content

Types of data in cluster analysis and Constraints based cluster analysis ,outlier analysis Presented by, N. Siva keerthana, M.Sc Computer science, Department of computer science, Nadar Saraswathi college of Arts and Science

cluster analysis Cluster Analysis in data mining is an unsupervised learning techniques that groups similar data points into clusters, based on shared characteristics. The Process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as group and so may be considered as a form of data compression.

requirements Scalability Many Clustering algorithms work well on small data sets containing fewer than several hundred data objects. A large database may contain millions objects. High Dimensionality A Database or a data warehouse can contain several dimensional attributes . Many Clustering algorithms are good handling low-dimensional data, involving only two or three dimensions.

Interpretability Users except clustering results to be interpretable, comprehensible and usable . Clustering may need to be tied to specific semantic interpretation and application Constraints based clustering Real world application may need to perform clustering under various kinds of constraints. suppose that your job is to choose the location for a given number of new automatic banking machines(ATMs) in a city.

Types of data in cluster analysis Cluster analysis utilizes several types of data, primarily interval/scale data (continuous numerical values), binary data (yes/no or 0/1), and categorical data (nominal and ordinal, with specific distinctions. Interval-Scaled Variables Binary Variables Categorical Variables Ordinal variables Ratio-Scaled Variables `

Interval-scaled variables An interval scale variable is a type of numerical data where the order of values matters. And the intervals (distances) between consecutive values are equal and meaningful, but it lacks a true, absolute zero point You cannot perform multiplication or division, because a value of zero on an interval scale is arbitrary rather than indicating a complete absence of the measured quantity.   While you can perform addition and subtraction to find differences (e.g., one temperature is 10 degrees hotter than another)

diagram

Binary variables A binary variable is a type of categorical variable that can only take on two possible values, often represented as 0 or 1, True or False, or "yes" or "no“ These variables are used in many fields to indicate the presence or absence of a characteristic, and they form the basis of binary classification and digital computing.  Application-They are used in machine learning (for tasks like image recognition), medical diagnoses, and other analytical contexts where a choice between two options is needed. 

Categorical variables A categorical variable is a type of variable that represents qualitative data and can take on one of a limited number of distinct categories or labels, rather than numerical values These variables describe a "quality" or "characteristic" and are further classified as either nominal (no inherent order, like eye color) or ordinal (can be logically ranked or ordered, like shoe size).  Types of Categorical Variables Nomial Variables Ordinary Variables

Constraint based cluster analysis It uses User-defined constraints to guide the clustering process. These constraints may specify certain relationships between data points such as which points should or should not be in the same cluster. In healthcare, clustering patient data might take into account both  genetic factors  and  lifestyle choices . Constraints specify that patients with similar genetic backgrounds should be grouped together while also considering their lifestyle choices to refine the clusters.

diagram

Constraints on individual object We can specify constraints on the objects to be clustered ,this constraints confines the set of object to be clustered. It can easily be handled by preprocessing (e.g., Performing selection using an SQL query),after which the problem reduces to an instance of unconstraints clustering. Constraints on the selection of clustering parameters A User may like to set a desired range for each clustering, clustering parameters are usually quite specific to the given clustering algorithm. Although such user specified parameter may strongly influence The clustering results, they are usually confined to the algorithm.

User-Constrained Cluster Analysis A user can like to specify desired features of the resulting clusters, which can strongly hold the clustering process. Each station must serve a minimum of 100 high-value customers. Each station must serve a minimum of 5,000 ordinary customers. Constraint-based clustering will consider such constraints during the clustering procedure. Semi-Supervised Cluster Analysis The quality of unsupervised clustering can be essentially improved using some weak form of supervision. This can be in the form of pairwise constraints (i.e., pairs of objects labeled as owned by the same or different cluster). Such a constrained clustering process is known as semi-supervised clustering.

Outlier analysis Outlier analysis is the process of identifying and examining data points that significantly deviate from the expected patterns or norms within a dataset. Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different manner. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining. An outlier cannot be termed as a noise or error. Instead, they are suspected of not being generated by the same method as the rest of the data objects. 

diagram

Statistical distribution- based outlier detection Statistical distribution-based outlier detection identifies outliers as data points that fall far from the mean, variance, or a specified distribution model of the data. Methods include the Z-score, which measures how many standard deviations a point is from the mean. And the Interquartile Range (IQR), which uses quartiles to define an outlier range .

diagram

Distance based outlier detection Distance-based outlier detection identifies outliers by assessing if a data point has an insufficient number of neighboring points within a specified radius or distance threshold. Distance-Based methods scale better to multi-dimensional space and can be computed more efficiently than the statistical-based method. Identifying Distance-based outliers is an important and useful data mining activity.  Distance-Based Methods usually depend on a Multi-dimensional Index, Which is used to retrieve the neighborhood of each object.

diagram

THANK YOU