PPT_ Module_2_suruchi presentation notes

vinuthak18 39 views 94 slides Aug 19, 2024
Slide 1
Slide 1 of 94
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94

About This Presentation

PPT_Module_2_suruchi.pptx


Slide Content

MODULE-2 EXPLORATORY DATA ANALYSIS AND The Data Science Process .

WHAT IS EDA? The analysis of datasets based on various numerical methods and graphical tools. Exploring data for patterns, trends, underlying structure, deviations from the trend, anomalies and strange structures. It facilitates discovering unexpected as well as conforming the expected. Another definition: An approach/philosophy for data analysis that employs a variety of techniques (mostly graphical). 2

3

AIM OF THE EDA Maximize insight into a dataset Uncover underlying structure Extract important variables Detect outliers and anomalies Test underlying assumptions Develop valid models Determine optimal factor settings (Xs) 4

AIM OF THE EDA The goal of EDA is to open-mindedly explore data. Tukey: EDA is detective work… Unless detective finds the clues, judge or jury has nothing to consider. Here, judge or jury is a confirmatory data analysis Tukey: Confirmatory data analysis goes further, assessing the strengths of the evidence. With EDA, we can examine data and try to understand the meaning of variables. What are the abbreviations stand for. 5

Exploratory vs Confirmatory Data Analysis EDA CDA No hypothesis at first Generate hypothesis Uses graphical methods (mostly) Start with hypothesis Test the null hypothesis Uses statistical models 6

STEPS OF EDA Generate good research questions Data restructuring: You may need to make new variables from the existing ones. Instead of using two variables, obtaining rates or percentages of them Creating dummy variables for categorical variables Based on the research questions, use appropriate graphical tools and obtain descriptive statistics. Try to understand the data structure, relationships, anomalies, unexpected behaviors. Try to identify confounding variables, interaction relations and multicollinearity, if any. Handle missing observations Decide on the need of transformation (on response and/or explanatory variables). Decide on the hypothesis based on your research questions 7

AFTER EDA Confirmatory Data Analysis: Verify the hypothesis by statistical analysis Get conclusions and present your results nicely. 8

Classification of EDA * Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate). Non-graphical methods generally involve calculation of summary statistics, while graphical methods obviously summarize the data in a diagrammatic or pictorial way. Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA. *Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf 9

EXAMPLE 1 Data from the Places Rated Almanac *Boyer and Savageau, 1985) 9 variables fro 329 metropolitan areas in the USA Climate mildness Housing cost Health care and environment Crime Transportation supply Educational opportunities and effort Arts and culture facilities Recreational opportunities Personal economic outlook + latitude and longitude of each city Questions: How is climate related to location? Are there clusters in the data (excluding location)? Are nearby cities similar? Any relation bw economic outlook and crime? What else??? 10

EXAMPLE 2 In a breast cancer research, main questions of interest might be Does any treatment method result in a higher survival rate? Can a particular treatment be suggested to a woman with specific characteristic? Is there any difference between patients in terms of survival rates (e.g. Are white woman more likely to survive compare the black woman if they are both at the same stage of disease?) 11

EXAMPLE 3 In a project, investigating the well-being of teenagers after an economic hardship, main questions can be Is there a positive ( and significant) effect of economic problems on distress? Which other factors can be most related to the distress of teenagers? e.g. age, gender,…? 12

EXAMPLE 4 * New cancer cases in the U.S. based on a cancer registry • The rows in the registry are called observations they correspond to individuals • The columns are variables or data fields they correspond to attributes of the individuals https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf 13

Examples of Variables Identifier(s): - patient number, - visit # or measurement date (if measured more than once) Attributes at study start (baseline): - enrollment date, - demographics (age, BMI, etc.) - prior disease history, labs, etc. - assigned treatment or intervention group - outcome variable Attributes measured at subsequent times - any variables that may change over time - outcome variable 14

Data Types and Measurement Scales Variables may be one of several types, and have a defined set of valid values. Two main classes of variables are: Continuous Variables: (Quantitative, numeric). Continuous data can be rounded or \binned to create categorical data. Categorical Variables: (Discrete, qualitative). Some categorical variables (e.g. counts) are sometimes treated as continuous. 15

Categorical Data Unordered categorical data (nominal) 2 possible values (binary or dichotomous) Examples: gender, alive/dead, yes/no. Greater than 2 possible values - No order to categories Examples: marital status, religion, country of birth, race. Ordered categorical data (ordinal) Ratings or preferences Cancer stage Quality of life scales, National Cancer Institute's NCI Common Toxicity Criteria (severity grades 1-5) Number of copies of a recessive gene (0, 1 or 2) 16

EDA Part 2: Summarizing Data With Tables and Plots Examine the entire data set using basic techniques before starting a formal statistical analysis. Familiarizing yourself with the data. Find possible errors and anomalies. Examine the distribution of values for each variable. 17

Summarizing Variables Categorical variables Frequency tables - how many observations in each category? Relative frequency table - percent in each category. Bar chart and other plots. Continuous variables Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical. Plots specific to Continuous variables. The goal for both categorical and continuous data is data reduction while preserving/extracting key information about the process under investigation. 18

Categorical Data Summaries Tables Cancer site is a variable taking 5 values categorical or continuous? ordered or unordered? 19

Frequency Table Frequency Table: Categories with counts Relative Frequency Table: Percentage in each category 20

Graphing a Frequency Table - Bar Chart: Plot the number of observations in each category: 21

Continuous Data - Tables Example: Ages of 10 adult leukemia patients: 35; 40; 52; 27; 31; 42; 43; 28; 50; 35 One option is to group these ages into decades and create a categorical age variable: 22

We can then create a frequency table for this new categorical age variable. 23

Continuous data - plots A histogram is a bar chart constructed using the frequencies or relative frequencies of a grouped (or \binned") continuous variable It discards some information (the exact values), retaining only the frequencies in each \bin" 24

Age histogram of 10 adult leukemia patients 25

Data Science Process The data science process typically involves several key steps: Problem Formulation: Understanding the business problem or question that needs to be addressed and formulating it into a clear and well-defined data science problem. Data Acquisition: Data acquisition involves gathering the relevant data required to solve the problem. This data can come from various sources such as: Databases APIs Files Web scraping. Ensuring data quality and addressing issues such as missing values or inconsistencies is important in this step. Data Preparation: Exploratory Data Analysis (EDA):

Data Science Process … contd. Feature Engineering: Feature engineering involves creating new features or transforming existing features to improve the performance of machine learning models (feature scaling, dimensionality reduction, or creating interaction terms) Data Preparation: Involves cleaning, preprocessing, and transforming the raw data into a format suitable for analysis. Tasks in this step may include handling missing values, encoding categorical variables, scaling numerical features, and splitting the data into training and testing sets. Exploratory Data Analysis (EDA) : EDA is the process of analyzing and visualizing data to gain insights and understand patterns, trends, and relationships within the data. (Statistics, Data visualization, and Correlation analysis). Feature Engineering: Involves creating new features or transforming existing features to improve the performance of machine learning models. Model Selection and Training: The appropriate machine learning or statistical models are selected based on the nature of the problem and the characteristics of the data. The selected models are then trained on the training data using appropriate algorithms and techniques.

Data Science Process … contd. Model Evaluation: Once the models are trained, they need to be evaluated to assess their performance and generalization ability. This involves using evaluation metrics: Accuracy Precision Recall F1-score, or ROC-AUC Model Tuning: Model tuning involves optimizing the hyperparameters of the selected models to improve their performance further. Techniques used include: Grid search Random search Bayesian optimization are commonly used for hyperparameter tuning. Deployment and Monitoring : Once a satisfactory model is obtained, it can be deployed into production to make predictions on new data.

Data Science Process … contd. Documentation and Communication : Throughout the entire data science process, documentation plays a crucial role in maintaining transparency, reproducibility, and knowledge sharing. Results, insights, methodologies, and findings should be well-documented and effectively communicated to stakeholders.

Tutorial Content Overview of Python Libraries for Data Scientists 30 Reading Data; Selecting and Filtering the Data; Data manipulation, sorting, grouping, rearranging Plotting the data Descriptive statistics Inferential statistics

Python Libraries for Data Science Many popular Python toolboxes/libraries: NumPy SciPy Pandas SciKit -Learn Visualization libraries matplotlib Seaborn and many more … 31

Python Libraries for Data Science NumPy: introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance many other python libraries are built on NumPy Link: http://www.numpy.org/ 32

Python Libraries for Data Science SciPy : collection of algorithms for linear algebra, differential equations, numerical integration , optimization, statistics and more part of SciPy Stack built on NumPy Link: https://www.scipy.org/scipylib/ 33

Python Libraries for Data Science Pandas: adds data structures and tools designed to work with table-like data ( similar to Series and Data Frames in R) provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc. allows handling missing data Link: http://pandas.pydata.org/ 34

Link: http://scikit-learn.org/ Python Libraries for Data Science SciKit -Learn: provides machine learning algorithms: classification, regression, clustering, model validation etc. built on NumPy , SciPy and matplotlib 35

matplotlib : python 2D plotting library which produces publication quality figures in a variety of hardcopy formats a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts , histograms, pie charts etc. relatively low-level; some effort needed to create advanced visualization Link: https://matplotlib.org/ Python Libraries for Data Science 36

37 Seaborn: based on matplotlib provides high level interface for drawing attractive statistical graphics Similar (in style) to the popular ggplot2 library in R Link: https://seaborn.pydata.org/ Python Libraries for Data Science

38 Login to the Shared Computing Cluster Use your SCC login information if you have SCC account If you are using tutorial accounts see info on the blackboard Note: Your password will not be displayed while you enter it.

39 Selecting Python Version on the SCC # view available python versions on the SCC [scc1 ~] module avail python # load python 3 version [scc1 ~] module load python/3.6.2

40 Download tutorial notebook # On the Shared Computing Cluster [scc1 ~] cp /project/scv/examples/python/data_analysis/dataScience.ipynb . # On a local computer save the link: http://rcs.bu.edu/examples/python/data_analysis/dataScience.ipynb

Start Jupyter nootebook # On the Shared Computing Cluster [scc1 ~] jupyter notebook 41

In [ ]: Loading Python Libraries #Import Python Libraries import numpy as np import scipy as sp import pandas as pd import matplotlib as mpl import seaborn as sns Press Shift+Enter to execute the jupyter cell 42

43 In [ ]: Reading data using pandas #Read csv file df = pd.read_csv( " http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv " ) There is a number of pandas commands to read other data formats: pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA']) pd.read_stata('myfile.dta') pd.read_sas('myfile.sas7bdat') pd.read_hdf('myfile.h5','df') Note: The above command has many optional arguments to fine-tune the data import process.

In [3]: Exploring data frames #List first 5 records df.head() Out[3]: 44

45 Data Frame data types Pandas Type Native Python Type Description object string The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). int64 int Numeric characters. 64 refers to the memory allocated to hold this character. float64 float Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. datetime64, timedelta[ns] N/A (but see the datetime module in Python’s standard library) Values meant to hold time data. Look into these for time series experiments.

46 In [4]: Data Frame data types #Check a particular column type df [ 'salary' ] .dtype Out[4]: dtype('int64') In [5]: #Check types for all the columns df.dtypes Out[4]: rank discipline phd service sex salary dtype: object object object int64 int64 object int64

47 Data Frames attributes Python objects have attributes and methods . df.attribute description dtypes list the types of the columns columns list the column names axes list the row labels and column names ndim number of dimensions size number of elements shape return a tuple representing the dimensionality values numpy representation of the data

48 Data Frames methods df.method() description head( [n] ), tail( [n] ) first/last n rows describe() generate descriptive statistics (for numeric columns only) max(), min() return max/min values for all numeric columns mean(), median() return mean/median values for all numeric columns std() standard deviation sample([n]) returns a random sample of the data frame dropna() drop all the records with missing values Unlike attributes, python methods have parenthesis. All attributes and methods can be listed with a dir() function: dir( df )

49 Selecting a column in a Data Frame Method 1: Subset the data frame using column name: df['sex'] Method 2 : Use the column name as an attribute: df.sex Note: there is an attribute rank for pandas data frames, so to select a column with a name "rank" we should use method 1.

Data Frames groupby method Using "group by" method we can: Split the data into groups based on some criteria Calculate statistics (or apply a function) to each group Similar to dplyr () function in R In [ ]: #Group data using rank df_rank = df.groupby( [ 'rank' ]) In [ ]: #Calculate mean value for each numeric column per each group df_rank.mean() 50

Data Frames groupby method Once groupby object is create we can calculate various statistics for each group: In [ ]: #Calculate mean salary for each professor rank: df.groupby( 'rank' )[[ 'salary' ]].mean() Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object. When double brackets are used the output is a Data Frame 51

52 Data Frames groupby method groupby performance notes: no grouping/splitting occurs until it's needed. Creating the groupby object only verifies that you have passed a valid mapping by default the group keys are sorted during the groupby operation. You may want to pass sort=False for potential speedup: In [ ]: #Calculate mean salary for each professor rank: df.groupby([ 'rank'] , sort= False )[[ 'salary' ]].mean()

53 Data Frame: filtering To subset the data we can apply Boolean indexing. This indexing is commonly known as a filter. For example if we want to subset the rows in which the salary value is greater than $120K: In [ ]: #Calculate mean salary for each professor rank: df_sub = df[ df[ 'salary'] > 120000 ] In [ ]: #Select only those rows that contain female professors: df_f = df[ df[ 'sex'] == 'Female' ] Any Boolean operator can be used to subset the data : > greter ; <less ; == equal; >= greater or equal; <= less or equal; != not equal;

54 Data Frames: Slicing There are a number of ways to subset the Data Frame: one or more columns one or more rows a subset of rows and columns Rows and columns can be selected by their position or label

55 Data Frames: Slicing When selecting one column, it is possible to use single set of brackets, but the resulting object will be a Series (not a DataFrame ): In [ ]: #Select column salary: df [ 'salary' ] When we need to select more than one column and/or make the output to be a DataFrame , we should use double brackets: In [ ]: #Select column salary: df [[ 'rank' , 'salary' ]]

56 Data Frames: Selecting rows If we need to select a range of rows, we can specify the range using":" In [ ]: #Select rows by their position: df [ 10:20 ] Notice that the first row has a position 0, and the last value in the range is omitted: So for 0:10 range the first 10 rows are returned with the positions starting with 0 and ending with 9

Data Frames: method loc If we need to select a range of rows, using their labels we can use method loc: In [ ]: #Select rows by their labels: df_sub.loc [ 10:20 , [ 'rank','sex','salary' ]] Out[ ]: 57

Data Frames: method iloc If we need to select a range of rows and/or columns, using their positions we can use method iloc : In [ ]: #Select rows by their labels: df_sub.iloc [ 10:20 , [ 0, 3, 4, 5 ]] Out[ ]: 58

Data Frames: method iloc (summary) df.iloc [ ] df.iloc [ i ] # First row of a data frame #(i+1)th row df.iloc [ -1 ] # Last row df.iloc [:, ] # First column df.iloc [:, -1 ] # Last column df.iloc [ 0:7 ] df.iloc [ :, 0:2 ] df.iloc [ 1:3, 0:2 ] 59 #First 7 rows #First 2 columns #Second through third rows and first 2 columns #1 st df.iloc [[ 0,5 ] , [ 1,3 ]] and 6 th rows and 2 nd and 4 th columns

Data Frames: Sorting We can sort the data by a value in the column. By default the sorting will occur in ascending order and a new data frame is return. In [ ]: # Create a new data frame from the original sorted by the column Salary df_sorted = df .sort_values( by = 'service' ) df_sorted.head() Out[ ]: 60

Data Frames: Sorting We can sort the data using 2 or more columns: In [ ]: df_sorted = df .sort_values( by =[ 'service' , 'salary '], ascending = [ True , False ]) df_sorted.head( 10 ) Out[ ]: 61

Missing Values Missing values are marked as NaN In [ ]: # Read a dataset with missing values flights = pd.read_csv( " http://rcs.bu.edu/examples/python/data_analysis/flights.csv " ) In [ ]: # Select the rows that have at least one missing value flights[flights.isnull().any(axis=1)].head() Out[ ]: 62

63 Missing Values There are a number of methods to deal with missing values in the data frame: df.method() description dropna() Drop missing observations dropna(how='all') Drop observations where all cells is NA dropna(axis=1, how='all') Drop column if all the values are missing dropna(thresh = 5) Drop rows that contain less than 5 non-missing values fillna(0) Replace missing values with zeros isnull() returns True if the value is missing notnull() Returns True for non-missing values

64 Missing Values When summing the data, missing values will be treated as zero If all values are missing, the sum will be equal to NaN cumsum() and cumprod() methods ignore missing values but preserve them in the resulting arrays Missing values in GroupBy method are excluded (just like in R) Many descriptive statistics methods have skipna option to control if missing data should be excluded . This value is set to True by default (unlike R)

65 Aggregation Functions in Pandas Aggregation - computing a summary statistic about each group, i.e. compute group sums or means compute group sizes/counts Common aggregation functions: min, max count, sum, prod mean, median, mode, mad std, var

Aggregation Functions in Pandas agg() method are useful when multiple statistics are computed per column: In [ ]: flights[[ 'dep_delay' , 'arr_delay' ]].agg([ 'min' , 'mean' , 'max' ]) Out[ ]: 66

67 Basic Descriptive Statistics df.method() description describe Basic statistics (count, mean, std, min, quantiles, max) min, max Minimum and maximum values mean, median, mode Arithmetic average, median and mode var, std Variance and standard deviation sem Standard error of mean skew Sample skewness kurt kurtosis

Regression Regression is the relationship between dependent and independent variable. Types: Simple Regression & Multiple Regression When value of y changes corresponding to the value of X Y<-X (Simple Regression) When value of Y1 and Y2 changes corresponding to the value of x Y 1 Y 2 <- X ( Multiple Regression) Linear and Non-Linear Regression

Regression …contd Basic outline of how linear regression works with numerical data: Data Collection: Data Exploration and Visualization: Model Building: Build the linear regression model. The simple linear regression equation for one independent variable is: Y=β 0​ +β 1​ X+ε Parameter Estimation: Model Evaluation: Prediction: Interpretation:

Regression …contd Step1: Example: Lets say we have collected data on the number of hours studied (XXX) and the corresponding test scores (YYY) for a group of students.

Regression …contd Step 2: Data Exploration and Visualization

Regression …contd  

Regression …contd Step 4: Parameter Estimation Use the Ordinary Least Squares (OLS) method to estimate the coefficients 𝛽0​ and 𝛽1 ​ . Step 5: Model Evaluation- After fitting the model, we'll evaluate its performance using metrics like R2 (coefficient of determination), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), etc . Step 6: Prediction - After the model is built, we can use it to predict test scores for new values of hours studied. Step 7: Interpretation - Interpret the coefficients to understand the relationship between hours studied and test scores. Proceed using

Regression …contd

Regression …contd Step 2: Split Data into Training and Testing Sets Step 3: Train the Linear Regression Model Step 4: Make Predictions

Regression …contd Step 5: Evaluate the Model Step 6: Visualize the Model This will plot the test data points and the regression line predicted by our model.

KNN For Regression Step 1: Import Libraries and Load Data

KNN For Classification Step 2: Split Data into Training and Testing Sets Step 3: Train the kNN Regressor Step 4: Make Predictions

KNN For Classification Step 4: Make Predictions Step 5: Evaluate the Model Step 6: Visualize the Model- We can visualize the regression line along with the test data.

KNN for Classification Step 1: Import Libraries and Load Data

KNN for Classification Step 2: Train the kNN Classifier Step 3: Make Predictions Step 4: Evaluate the Model

KNN for Classification Step 5: Visualize the Results (Optional) = This will plot the decision boundary along with the training data. points colored by their class labels.

Linear Regression and k-Nearest Neighbors (kNN) are generally poor choices for filtering spam due to various reasons: Linear Regression: Assumption of Linearity : Linear regression assumes a linear relationship between the independent and dependent variables. However, in the case of spam filtering, the relationship between the features (e.g., words in an email) and the target variable (spam or not spam) is unlikely to be linear. Spam emails often contain complex patterns and combinations of words, making it difficult to capture with a simple linear model. Sensitivity to Outliers : Linear regression is sensitive to outliers in the data. Spam filtering datasets may contain outliers or noisy data points, which can significantly affect the coefficients of the linear model and lead to poor performance. Limited Complexity : Linear regression is a simple model that cannot capture complex relationships between features and the target variable. Spam filtering often requires a more sophisticated model that can identify subtle patterns and interactions among features. k-Nearest Neighbors (kNN): High Computational Cost : kNN requires storing all training data points in memory, making it computationally expensive and memory-intensive, especially for large datasets. In spam filtering, where the number of emails can be enormous, the computational cost of kNN becomes prohibitive. Curse of Dimensionality : kNN performance degrades as the number of features (dimensions) increases. In spam filtering, emails may contain a large number of features (e.g., words, metadata), leading to the curse of dimensionality problem. As the dimensionality increases, the distance between data points becomes less meaningful, making it challenging for kNN to accurately classify emails.

Sensitive to Irrelevant Features : kNN considers all features equally, regardless of their relevance to the classification task. In spam filtering, some features may be irrelevant or even misleading (e.g., email timestamps), yet kNN treats them with the same importance as relevant features, leading to suboptimal performance. Imbalanced Data : kNN can struggle with imbalanced datasets, where one class (e.g., spam) is significantly more prevalent than the other class (e.g., non-spam). Since kNN relies on the majority class in the neighborhood, it may bias towards the dominant class and misclassify minority class instances. Better Alternatives: For spam filtering, more advanced machine learning techniques are often preferred, such as: Naive Bayes : Suitable for text classification tasks like spam filtering, Naive Bayes assumes independence between features and can handle high-dimensional data efficiently. Support Vector Machines (SVM) : SVMs are effective for binary classification tasks like spam filtering and can handle high-dimensional data while avoiding overfitting. Ensemble Methods (e.g., Random Forest, Gradient Boosting) : Ensemble methods combine multiple weak learners to improve predictive performance and robustness against overfitting. These methods offer better performance and scalability compared to linear regression and kNN for spam filtering tasks. Additionally, specialized techniques such as deep learning with neural networks can also be explored for more advanced spam filtering systems. Linear Regression and k-Nearest Neighbors (kNN) are generally poor choices for filtering spam due to various reasons: …contd

KNN Numerical From the given data-set find (x, y) = 57,170 whether belongs to Under or Normal Weights

Solution: In this approach we are going to use Euclidean distance formulae, n(no of records)=9 and assuming K value as 3 ############################################## d1 = sqrt ((x2-x1)² + (y2-y1)²) x1=167,y1=51 and x2=170,y2=57 d1 = sqrt ((170–167)² + (57–51)²) d1 = 6.7 ############################################### ############################################## d2 = sqrt ((x2-x1)² + (y2-y1)²) x1=183,y1=56 and x2=170,y2=57 d2 = sqrt ((170–183)² + (57–56)²) d2 = 13 ############################################### ############################################## d3 = sqrt ((x2-x1)² + (y2-y1)²) x1=176,y1=69 and x2=170,y2=57 d3 = sqrt ((170–176)² + (57–69)²) d3 = 13.4 ############################################### ##############################################

d4 = sqrt ((x2-x1)² + (y2-y1)²) x1=173,y1=64 and x2=170,y2=57 d4 = sqrt ((170–173)² + (57–64)²) d4 = 7.6 ############################################### ############################################## d5 = sqrt ((x2-x1)² + (y2-y1)²) x1=172,y1=65 and x2=170,y2=57 d5 = sqrt ((170–172)² + (57–65)²) d5 = 8.2 ############################################### ############################################## d6 = sqrt ((x2-x1)² + (y2-y1)²) x1=173,y1=64 and x2=170,y2=57 d6 = sqrt ((170–173)² + (57–64)²) d6 = 4.1 ############################################### ############################################## d7 = sqrt ((x2-x1)² + (y2-y1)²) x1=169,y1=58 and x2=170,y2=57 d7 = sqrt ((170–169)² + (57–58)²) d7 = 1.414

d8 = sqrt ((x2-x1)² + (y2-y1)²) x1=173,y1=57 and x2=170,y2=57 d8 = sqrt ((170–173)² + (57–57)²) d8 = 3 ############################################### ############################################## d9 = sqrt ((x2-x1)² + (y2-y1)²) x1=170,y1=55 and x2=170,y2=57 d9 = sqrt ((170–170)² + (57–55)²) d9 = 2

KMeans Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60), (179,68), (182,72), (188,77) up to two iterations and show the clusters. Initially choose first two objects as initial centroids. Solution: Given, number of clusters to be created (K) = 2 say c1 and c2, number of iterations = 2 and The given data points can be represented in tabular form as:

Kmeans … contd. Taking first two objects as initial centroids: Centroid for first cluster c1 = (185, 72) Centroid for second cluster c2 = (170, 56)

KMEANS …contd. The resulting cluster after first iteration is:

KMEANS …contd. Iteration 2:  Now calculating centroid for each cluster: Now, again calculating similarity:

KMEANS …contd. Representing above information in tabular form.

KMEANS …contd. The resulting cluster after second iteration is: