R204585L. RMABIKA. Customer Churn Prediction Presentation 2.pptx

CynthiaMabika 21 views 18 slides Oct 16, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

customer churn


Slide Content

Hybrid Ensemble Learning Classifier for dealing with IMBALANCED DATA Student Name: Ruvarashe Cynthia Mabika Registration Number: R204585l Supervisor: Mr. G Mbizo Degree Program: Bachelor of Science Honors Data Science And Informatics

Introduction Customer churn (also known as customer attrition) occurs when a customer stops using a company's products or services. Customer churn affects profitability, especially in industries where revenues are heavily dependent on subscriptions (e.g. banks, telephone and internet service providers. Therefore, customer churn analysis is essential as it can help a business identify problems in its services and make correct strategy decisions that would lead to higher customer satisfaction and higher customer retention. Researchers and practitioners now try to build effective churn prediction models (Sharma et al., 2020). However, class imbalances often seen in customer churn datasets make it hard for machine learning models to achieve high prediction performance. Imbalanced data is a dataset with a skewed class proportion, class imbalances where the number of non-churners is usually higher than that of churners often seen in customer churn datasets make it hard for machine learning models to achieve high prediction performance. Resampling methods can balance the class proportions to improve the classification performance for an imbalanced dataset. The most popular resampling method is Synthetic Minority Oversampling Technique (SMOTE). Recently, hybrid resampling methods has been proposed as a more effective way to handle imbalanced data. However, few studies have applied those methods in customer churn prediction In previous studies on customer churn prediction, the most often used classification algorithms are Logistic Regression, KNN, and Decision Tree (Pamina et al., 2019). Recent studies showed that ensemble learning methods achieve high performance in classification problems. However, only a few studies have applied these algorithms in customer churn prediction.

Problem statement Predicting customer churn accurately is crucial for organizations to proactively implement retention strategies and reduce revenue loss. However, traditional churn prediction models often encounter limitations in handling imbalanced datasets and achieving high predictive performance. The objective of this project is to develop an advanced customer churn prediction system by leveraging hybrid resampling and ensemble learning techniques. The primary goal is to enhance the accuracy and robustness of churn prediction models, specifically addressing the challenges associated with imbalanced datasets and traditional modeling techniques.

Research Gap While hybrid resampling techniques have shown promise in addressing class imbalance in churn prediction, there may be limited studies that systematically compare and evaluate different hybrid resampling methods. Although ensemble learning has been widely used in various machine learning domains, its application in customer churn prediction may be relatively scarce. By addressing these research gaps, this study aims to advance the existing knowledge in the field of customer churn prediction by incorporating hybrid resampling techniques, ensemble learning approaches and feature importance analysis

Research Questions What are the most effective hybrid resampling techniques for addressing imbalanced datasets in customer churn prediction? How does the ensemble learning approach enhance customer churn prediction compared to individual models? How does the hybrid resampling and ensemble learning approach compare to traditional churn prediction models?

Research Objectives To compare and evaluate the effectiveness of different hybrid resampling techniques To assess the performance of ensemble learning in improving churn prediction accuracy compared to individual models. To compare the performance of the hybrid resampling and ensemble learning approach with traditional churn prediction models in terms of accuracy, precision, recall, and F1 score.

literature review Numerous studies have been conducted on the prediction of customer churn in the telecommunication sector, these studies have gathered much evidence on the effectiveness of machine learning in predicting customer churn (Ahmad et al., 2019). This unit provides a summary of some of the related works.

Year Author Algorithms Used Results Identified Gap 2022 Takuma K., et. al Logistic Regression, SVM, Random Forest, XGBoost , LightGBM , and CatBoost . To resample the data, they used SMOTE, SMOTE Tomek Links, and SMOTE-ENN. The model performance was evaluated by accuracy, recall, precision, F1-score, and ROC-AUC scores and results showed that Boosting algorithms outperformed the traditional classification algorithms and Random Forest. For F1-score, XGBoost combined with SMOTE Tomek Links achieved the highest F1-score. These results suggest that Boosting algorithms performed better than traditional classification algorithms for the ROC-AUC score, despite that the hybrid resampling methods did not necessarily improve the model performance. The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2022 Wagh et. al Random Forest, Decision Tree Used up sampling and ENN, In comparison to a decision tree classifier, a random forest classifier produces better results. With an overall accuracy of 99 %, the random forest classifier predicts churn. The classifier matrix has a precision of 99 %, a recall factor of 99 %, and an accuracy of 99.09 % The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2022 Fujo S. W., et. al Deep-BP-ANN To solve unbalanced issue, the Random Oversampling technique was used to balance both datasets. In predicting customer churn, our findings outperform ML techniques: XG_Boost , Logistic_Regression , Naïve_Bayes , and KNN. 2022 Makurumidze L ., et. al Gradient boost, Random forest, Adaboost and Decision tree Gradient boost and random forest classifiers performed well out of the four on a publicly available bank dataset, but then went on to use Random forest for implementation The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques

Year Author Algorithms Used Results Identified Gap 2020 Afifah R., et. al Naïve Bayes with and without SMOTE To resample the data, they used SMOTE. The model performance was evaluated by accuracy, recall, precision, F1-score, results showed that Naive Bayes with SMOTE outperformed the one without SMOTE. The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2023 Teuku A., et. al XGboost , Bernouli Naïve Bayes, Decision Tree XGBOOST archived the highest accuracy and F1 score of 81.95% and 74.76% respectively. In efficiency, Naïve Bayes and Decision Tree outperformed XGboost with AUC of 0.7469 and 0.7468 respectively The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2023 Lewlisa S., et. al CNN, Gradient boost, Random forest, Adaboost , KNN, ANN, ERT, XGBOOST CNN and ANN performed better with an accuracy of 99% and 98% respectively 2023 Harini T., et al Logistic Regression, Naïve Bayes Decision tree, The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques

Year Author Algorithms Used Results Identified Gap 2023 Boyuan Z. Logistic Regression, Decision Tree, KNN, Gaussian Naïve Bayes, CNN, Random Forest, XGBoost , LightGBM , and XGBoost . To resample the data, they used SMOTE. The model performance was evaluated by accuracy, recall, precision, F1-score, and ROC-AUC scores and results showed that Random Forest outperformed all the other 8 algorithms with an accuracy of 79.6%. The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2022 Shiyunyang Z. Random Forest, Decision Tree With an overall accuracy of 99 %, the random forest classifier predicts churn. The classifier matrix has an accuracy of 91 % The available studies simply used undersampling and oversampling techiniques in order to handle data imbalances, The use of hybrid ensemble learning classifiers in improving model accuracy were not effectively used. This study uses SMOTE, SMOTE ENN and SMOTE TOMEK-LINKS to create new dataset that will be used to create ensemble models and evaluate each model performance on these different datasets to see how customer churn datasets behave on hybrid ensemble techniques. 2023 Nayem T. KNN, SVM, LR, RF, AdaBoost,LightGBM , GradientBoost , XGBoost XGBoost outperformed all the other algorithms and produced an accuracy of 95.74%.

Functionality

Methodology This study will employ the CRISP-DM (Cross Industrial Standard Process for Data Mining) methodology, the following steps will be taken to achieve the research objectives: Business understanding Data understanding Data preparation Modeling Evaluation Deployment The CRISP-DM methodology is iterative, meaning that it allows for revisiting previous phases as new insights or challenges arise during the project.

Methodology Business understanding: This phase focuses on understanding the objectives and requirements of the project, what the project really wants to accomplish It also involves selecting technologies and tools as well as defining detailed plans for each project phase Data understanding : This subsection briefly discusses the dataset used. The dataset used is a publicly available TELCOM-IBM dataset. The dataset contains 7044 instances and 21 features. The variable named “Churn” is binary (Yes or No), and it will be used as the label in the following analysis. It contains 1,869 churners and 5,174 non-churners. The percentage of churners is 26.53%, and thus, the dataset can be regarded as an imbalanced dataset.

Methodology Data Preparation and Data Analysis : Out of the 21 features, “ SeniorCitizen ” and “tenure” are integer type data, and “ MonthlyCharges ” is float type. The remaining 18 variables are object type data. Since the objective of this study is to develop models with high prediction performance, rather than reveal the causal relationship between predictors and outcome, the analysis will include all these variables. However, variables with no useful information for prediction will be dropped to improve the model parsimony. Modeling: In this subsection, the machine learning training process is described, selecting appropriate modeling techniques, algorithms, and tools, Various machine-learning techniques were selected for experimental evaluation. Among them were KNN, SVM, LR, Adaboost , LGBM, Gradientboosting , and XGBoost . These models were selected based on their popularity in the literature and their ability to handle large datasets

Methodology Evaluation : This study evaluates the model performance by accuracy, F1-score, , precision, recall and ROC-AUC score Deployment : This phase involves planning and implementing the deployment strategy, Create reports, visualizations, or dashboards for presenting the results, Document the final project report, including the deployed models and recommendations .

Progress made Date Progress Review 17/10/2023 Presented project proposal to my supervisor Proposal got approved 20/11/2023 Downloaded Telcom-IBM dataset on Kaggle and shared it with my Supervisor Supervisor agreed that I can use that dataset 27/11/2023 Started Data preprocessing 13/12/2023 Documented Chapter 1, 2 and part of 3 and shared it with my supervisor He proposed I research on different big data methodologies that can be used for customer churn prediction projects.

Conclusion The scope of this thesis is to explore the effectiveness of hybrid resampling together with ensemble learning techniques in predicting customer churn in the telecom sector. The results of this research will help telecom companies to identify customers who are likely to churn and take appropriate measures to retain them.

references Fujo , S.W., Subramanian, S. and Khder , M.A., 2022. Customer churn prediction in telecommunication industry using deep learning. Information Sciences Letters, 11(1), p.24. Kimura, T., 2022. CUSTOMER CHURN PREDICTION WITH HYBRID RESAMPLING AND ENSEMBLE LEARNING. Journal of Management Information & Decision Sciences, 25(1). Makurumidze , L., Manjoro , W.S. and Makondo , W., 2022. Implementing Random Forest to Predict Churn. Pamina, J. and Raja, Beschi and SathyaBama , S. and S, Soundarya and Sruthi, M. S. and S, Kiruthika and V J, Aiswaryadevi and G, Priyanka, An Effective Classifier for Predicting Churn in Telecommunication (June 6, 2019). Jour of Adv Research in Dynamical & Control Systems, Vol. 11, 01-Special Issue, 2019 , Available at SSRN: https://ssrn.com/abstract=3399937 Wagh , S.K. and Wagh , K.S., Customer Churn Prediction in Telecom Sector Using Machine Learning Techniques. Available at SSRN 4158415.
Tags