Enhancing E-Commerce Efficiency: Predicting Delivery Times with Machine Learning
jadavvineet73
75 views
27 slides
Sep 17, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
In the fast-paced world of e-commerce, timely delivery is crucial for customer satisfaction and operational efficiency. Our project, "Enhancing E-Commerce Efficiency: Predicting Delivery Times with Machine Learning," aims to revolutionize the way delivery times are estimated and managed. f...
In the fast-paced world of e-commerce, timely delivery is crucial for customer satisfaction and operational efficiency. Our project, "Enhancing E-Commerce Efficiency: Predicting Delivery Times with Machine Learning," aims to revolutionize the way delivery times are estimated and managed. for more information visit: https://bostoninstituteofanalytics.org/
Size: 1.24 MB
Language: en
Added: Sep 17, 2024
Slides: 27 pages
Slide Content
E-commerce Product Delivery Prediction Made by: Akshay Jambekar
Agenda PROJECT OVERVIEW Dataset Overview Data Pre-Processing Label Encoding Normalization EDA Model Building and Evaluation Hyperparameter Tuning in Machine Learning Data Augmentation Observation and Conclusion
PROJECT OVERVIEW E-commerce Product Delivery Prediction Classification TASK! Objective: Enhance delivery predictions for an international e-commerce company specializing in electronics. Approach: Leverage machine learning models to predict on-time delivery of products. Impact: Improve customer satisfaction Optimize logistics operations Gain insights into factors affecting delivery performance
Dataset Overview Our data set consists of 10999 records with below attributes Customer ID: Unique identifier for customers Warehouse Block: Warehouse sections (A, B, C, D, E) Mode of Shipment: Shipping methods (Ship, Flight, Road) Customer Care Calls: Number of shipment inquiry calls Customer Rating: Customer rating (1 = Worst, 5 = Best) Cost of Product: Product cost in US dollars Prior Purchases: Number of previous purchases Product Importance: Product importance (Low, Medium, High) Gender: Customer gender (Male, Female) Discount Offered: Discount on the product Weight in Grams: Product weight in grams Reached on Time (Y/N): Target variable (1 = Not on time, 0 = On time)
Data Pre-Processing Warehouse block: Mode_of_Shipment : Product_importance : Gender: Warehouse block Encoded A B 1 C 2 D 3 F 4 Label Encoding for Categorical Attributes Label Encoding: Converts categorical data into numerical values for machine learning models. Mode_of_Shipment Encoded Flight Road 1 Ship 2 Product_importance Encoded High Low 1 Medium 2 Gender Encoded F M 1
Data Pre-Processing Min-Max Scaling Min-Max Scaling is used to normalize the numerical features of the dataset. The scaling transforms the values to a range between 0 and 1, which helps in improving the performance of machine learning models Formula: X scaled X scaled = Scaled value X = Original value = Minimum value in the dataset = Maximum value in the dataset Numeric Columns Scaled: Customer_care_calls Customer_rating Cost_of_the_Product Prior_purchases Discount_offered Weight_in_gms This transformation ensures that all these features are on the same scale, eliminating any biases due to different ranges.
Data Pre-Processing Min-Max Scaling What is Min-Max Scaling? A normalization technique that scales the data to a fixed range, typically [0, 1] . Ensures all features contribute equally to model performance. Formula : Where X is the original feature value, X_min is the minimum value of the feature, and X_max is the maximum value. Standardization What is Standardization? A scaling technique that transforms data to have a mean of 0 and a standard deviation of 1. Ensures that each feature contributes equally to model performance. Formula: Where X is the original feature value, μ is the mean of the feature, and σ is the standard deviation.
EDA Interpretation from Correlation matrix :- Discount Offered have high positive correlation with Reached on Time or Not of 40%. Weights in gram have negative correlation with Reached on Time or Not -27%. Discount Offered and weights in grams have negative correlation -38%. Customer care calls and weights in grams have negative correlation -28%. Customer care calls and cost of the product have positive correlation of 32%. Prior Purchases and Customer care calls have slightly positive correlation.
EDA From the above plots, we can conclude following:- Warehouse block F have has more values than all other Warehouse blocks. In mode of shipment columns we can clearly see that ship delivers the most of products to the customers. Most of the customers calls 3 or 4 times to the customer care centers. Customer Ratings does not have much variation.
EDA From this plots, we can conclude following:- Most of the customers have 3 prior purchases. We can say that most of the products are of low Importance. Gender Column doesn't have much variance. More products doesn't reach on time than products reached on time.
EDA Warehouse F: The majority of products did not reach on time (higher red bars). This warehouse has the highest overall count of deliveries, with a significant number of late deliveries. Warehouse D: Similar numbers of on-time and late deliveries, but slightly more late deliveries. Warehouses A, B, and C: In all three blocks, there are more late deliveries than on-time deliveries. The pattern suggests that on-time delivery is a general issue across these blocks, though they handle fewer shipments than Warehouse F.
EDA Gender: There is almost s imilar numbers of on-time and late deliveries, but slightly more late deliveries for both genders male and female
EDA Low Product Importance: A significant number of products were delivered on time (purple bar). However, even more products did not reach on time (red bar), indicating a higher failure rate for low-importance products. Medium Product Importance: Similar to low importance, most products did not reach on time, though the gap between late and on-time deliveries is more pronounced here. This category has a higher count of delayed deliveries. High Product Importance: The majority of deliveries reached on time. There are fewer late deliveries in this category, suggesting better performance for high-importance products.
EDA Mode of shipment: Shipments by Ship: Most shipments made by ship does not reach on time. Shipments by Flight and Road: While a significant number of shipments reached on time, there were also notable instances where they did not.
EDA Customer care calls A higher number of customer care calls (particularly 3 or 4) is associated with a higher likelihood of deliveries not reaching on time . Fewer calls (2) and more than 6 calls show a more balanced distribution, but there's still a slight trend towards deliveries not reaching on time.
EDA Prior Purchases Customers with 2 to 4 prior purchases seem to experience a higher likelihood of late deliveries. As the number of prior purchases increases (beyond 5), the outcomes start to balance, but there is still a slight tendency for deliveries to be late
EDA Discount offered Higher Discounts and Late Deliveries: There is a clear trend where higher discounts are associated with deliveries that did not reach on time. The larger variability in discounts for late deliveries suggests that offering higher discounts may be related to logistical challenges or delays. Lower Discounts and On-Time Deliveries: Deliveries that reached on time are associated with consistently lower discounts, with very little variation in the discount offered.
EDA Weight in gms For deliveries that were on time (Reached.on.Time_Y.N = 0), the weight distribution is relatively compact, mostly between 4000 and 5000 grams, with a few outliers on the lower end. For deliveries that were not on time (Reached.on.Time_Y.N = 1), the weight distribution is much wider, ranging from approximately 2000 to 4500 grams.
EDA
Model Building and Evaluation We have divided our data in 70-30 format, 70% of data is used for Training 30% of data is used for Testing We have trained model using below ML algorithms: Logistic Regression KNN Classification Random Forest Support Vector Machine Gradient Boosting Classification
Model Comparison - Base The graph shows the comparison between Logistics Regression, KNN, Random forest, SVM and Gradient Boosting for basic model without any model optimization. Random forest, SVM and Gradient Boosting gives 67% of accuracy for all three model
Hyperparameter Tuning in Machine Learning Definition: Hyperparameters are configuration settings used to control the learning process of machine learning models. Unlike model parameters, they are not learned from data. Importance: Hyperparameters significantly influence the model's performance. Proper tuning can improve accuracy, generalization, and reduce overfitting. Methods for Hyperparameter Tuning: Grid Search : Exhaustive search over a manually specified subset of hyperparameters. Random Search : Random combinations of hyperparameters are tried. Bayesian Optimization : Probabilistic model used to select the next set of hyperparameters based on prior performance.
Model Comparison – Hyperparameter Tuning The graph shows the comparison between Logistics Regression, KNN, Random forest, SVM and Gradient Boosting for optimized models using hyper parameter tuning using grid search. Gradient Boosting and Random forest classifier gives best accuracy of 69%.
Data Augmentation Problem: Imbalanced Data Imbalance in the target variable can lead to biased model performance. Example: In a classification problem, one class may dominate others. SMOTE (Synthetic Minority Over-sampling Technique) Generates synthetic samples by interpolating between existing minority class instances. Advantage : Introduces new, varied data points rather than simply duplicating, reducing the risk of overfitting. Use Case : Effective for balancing datasets in machine learning tasks like classification. Outcome: Improved balance in the target variable. Enhanced model performance by addressing class imbalance and reducing bias towards the majority class.
Model Comparison – Data Augmentation The graph shows the comparison between Logistics Regression, KNN, Random forest, SVM and Gradient Boosting for Augmented data There is no significant difference noted after data augmentation. Gradient Boosting Classifier give best accuracy of 69%.
Observation and Conclusion Weight in gms , Cost of Product, Discount offered this feature highly contributes in predicting Product was delivered on time or not. The highest test accuracy observed is 69% using gradient boosting.