I had performed the PCA which is a very used ML algorithum on the dataset of Airlines, but there is a error that since age is a demographic variable so you should ignore it!!
LET’S UNDERSTAND THE DATA ! Age Data Set Size Rows : 25976 columns : 24 VARIABLE OF INTEREST: CHOOSING 16 VARIABLES Inflight wifi service Departure/Arrival time convenient Cleanliness Gate location Online boarding Seat comfort Checkin service Inflight entertainment Ease of Online booking On-board service Leg room service Baggage handling Food and drink Inflight service Satisfaction Satisfaction
Converting Satisfaction Levels to Binary Values. Creating a Subset DataFrame Correlation Heatmap EDA (EXPLORATORY DATA ANALYSIS) STEP 1:
H0: The observed variables in the dataset are not correlated, and therefore, the correlation matrix is an identity matrix (spherical) Hypothesis: V/S H1: The observed variables in the dataset are correlated, and the correlation matrix is not an identity matrix (non-spherical) Statistic: Chi Sq : 154473.615 P value: 0.0 p-value is less than any conventional significance level. Therefore, We would reject the null hypothesis Bartlett’s Test of Sphericity ( Adequacy Test) STEP 2:
Kaiser-Meyer-Olkin (KMO) Test The observed variables in the dataset are not suitable for structure detection, indicating that the partial correlations are close to zero. The observed variables in the dataset are suitable for structure detection, indicating that the partial correlations are significantly different from zero. V/S H0: H1: KMO Value= 0.7785712021381315 KMO value above 0.5 is considered meritorious for factor analysis.
PCA Factor Extraction STEP 3: We computed the eigenvalues & individual variance explained
We derived 6 factors instead of 4, as it resulted in a more meaningful interpretation. Using Equmax Factor Extraction OBTAINING LOADING MATRIX
Factor rotation matrix is a mathematical transformation applied to the factor loadings matrix to achieve a simpler and more interpretable factor structure Rotation Matrix
COMMUNALITIES & SPECIFIC VARIANCE
OBTAINING FACTOR SCORES After the completion of factor analysis, we obtained factor scores
WE EMPLOYED LOGISTIC REGRESSION IN THE CLASSIFICATION PROCESS WE TRANSFORMED SATISFACTION RATINGS INTO BINARY OUTPUTS, REPRESENTED BY 0'S (NEUTRAL OR DISSATISFIED) AND 1'S (SATISFIED) Classification Method STEP 4:
Logistic Regression using Scikit-learn Data Splitting: Split into X (Factor Score) and y Classifier Initialization: Initiate Logistic Regression. Hyperparameter Tuning: Use GridSearchCV. 2 1 3
0.8058 ACCURACY 0.7979 Train Test R Sq Value Confusion Matrix 1 1
Generated predictions for the initial 20 values using the trained model. Prediction
CONCLUSION EDA successfully uncovered valuable insights, adding a better understanding of the dataset. PCA efficiently reduced the dimensionality of the data, capturing its essential features and enhancing interpretability for more efficient analysis or modeling. Logistic regression proves to be a valuable and interpretable model for binary classification tasks. Deriving inference from the analysis adds a crucial layer of understanding, translating data patterns into meaningful insights for informed decision-making.