How to Perform Exploratory Data Analysis Using Python.pptx
opinafees
161 views
10 slides
Sep 10, 2024
Slide 1 of 10
1
2
3
4
5
6
7
8
9
10
About This Presentation
Master Exploratory Data Analysis with Python: A Comprehensive Guide to Retail Data Insights
Dive into the world of data with Abdullah Al Nafees' expert-led presentation on Exploratory Data Analysis (EDA) using Python, tailored for learners pursuing project completion on Coursera. This session f...
Master Exploratory Data Analysis with Python: A Comprehensive Guide to Retail Data Insights
Dive into the world of data with Abdullah Al Nafees' expert-led presentation on Exploratory Data Analysis (EDA) using Python, tailored for learners pursuing project completion on Coursera. This session from B'Deshi Research Lab provides an insightful exploration into the 'Online Retail dataset', which comprises 5,000 meticulously chosen records that reveal critical data handling and analysis techniques.
Key Learning Outcomes:
1. Effective Data Management: Learn to manage and clean data efficiently, focusing on crucial aspects like handling missing values, particularly in 'CustomerID' and 'Description' columns, to ensure data integrity.
2. Statistical Analysis and Visualization: Utilize Python's robust libraries to compute descriptive statistics and create insightful visualizations. Understand data distributions through detailed statistical summaries and explore the relationship between variables such as 'Quantity' and 'Unit Price' using scatter plots and box plots.
3. Geographical Insights: Analyze transaction distributions across countries with a specific spotlight on the UK market, understanding regional trends and customer behaviors.
4. Practical Application: Equip yourself with practical skills for real-world data application, preparing for advanced techniques in predictive modelling.
Size: 740.7 KB
Language: en
Added: Sep 10, 2024
Slides: 10 pages
Slide Content
"How to Perform Exploratory Data Analysis Using Python" Presenter a : Abdullah Al Nafees Affiliations: a Sylhet Engineering College (SEC), School of Applied Sciences & Technology, Shahjalal University of Science and Technology, Tilagarh , Alurtol Road, Sylhet 3100, Bangladesh. For Contact: [email protected] Purpose: Coursera Project Completion Portfolio - Perform exploratory data analysis on retail data with Python
Introduction to EDA What is Exploratory Data Analysis (EDA)? GTFS (General Transit Feed Specification) is a standardized format used by public transit agencies to share their schedule and route information with developers. It enables developers to create applications that provide users with accurate transit information, such as trip planners and schedule viewers.
Objectives of EDA Identify anomalies and outliers.
Dataset Description Dataset: Online Retail dataset. Number of Records Analysed : 5,000 (Sample). Key Columns: InvoiceNo : Transaction ID. StockCode : Product ID. Quantity: Number of units sold. UnitPrice : Price per unit. CustomerID : Customer identifier. Country: Customer's country. Dataset contains both numerical and categorical data. import pandas as pd # Load dataset df = pd.read_excel ('Online Retail.xlsx') # Basic information about the dataset df.info() # Display first few rows of the dataset df.head () Use Python's Pandas library to load and inspect the dataset.
Missing Values Handling Missing Values Description Column: 12 missing values.CustomerID Column: Missing in a significant number of rows (over 1,200 entries). Action Taken: Removed rows with missing CustomerID for more accurate analysis. # Check for missing values in the dataset df.isnull ().sum() # Drop rows with missing CustomerID df_cleaned = df.dropna (subset=[' CustomerID ']) # Confirm that missing values are handled df_cleaned.isnull ().sum() Identified missing values in Description and CustomerID columns. Rows with missing CustomerID were dropped for accurate analysis.
Descriptive Statistics Descriptive Statistics of Numerical Data Quantity: Mean: 11.33 units per transaction. Max: 2,880 units (with some negative values indicating returns). High variability with a standard deviation of 166.3. Unit Price: Mean: £3.18 per unit. Max: £295, showing a wide range in product pricing. Minimum: £0.03, likely discounted items or very small products. # Get summary statistics for numerical columns df_cleaned [['Quantity', ' UnitPrice ']].describe() Use descriptive statistics to get insights into Quantity and Unit Price.
Country-wise Transaction Distribution Top Countries by Transaction Count The United Kingdom accounts for the majority of transactions (3,632). Norway, Germany, EIRE, and France also contribute to sales but on a much smaller scale. # Count transactions by country country_sales = df_cleaned ['Country']. value_counts () # Show top 10 countries country_sales.head (10) Analyze the number of transactions per country.
Quantity vs Unit Price Relationship between Quantity and Unit Price Significant variance in Quantity and UnitPrice across transactions. Some outliers with unusually high quantities or prices, indicating special orders or returns. import matplotlib.pyplot as plt # Scatter plot of Quantity vs UnitPrice plt.scatter ( df_cleaned ['Quantity'], df_cleaned [' UnitPrice ']) plt.xlabel ('Quantity') plt.ylabel (' UnitPrice ') plt.title ('Quantity vs Unit Price') plt.show () Use a scatter plot to visualize the relationship between Quantity and Unit Price.
Outliers and Anomalies Identifying Outliers and Anomalies Outliers were detected in both Quantity and UnitPrice . Negative quantities indicate product returns. Extremely high prices represent expensive products. Further cleaning may be needed to handle outliers. import seaborn as sns # Box plot for Quantity sns.boxplot (x= df_cleaned ['Quantity']) plt.title ('Box Plot of Quantity') plt.show () # Box plot for UnitPrice sns.boxplot (x= df_cleaned [' UnitPrice ']) plt.title ('Box Plot of UnitPrice ') plt.show () Identify outliers in Quantity and UnitPrice using a box plot.
Conclusion and Next Steps Summary of Findings The majority of sales come from the UK, with significant variability in product prices and quantities sold. Dataset contains missing values and outliers that require further cleaning. Next steps could include: Investigating the reasons for outliers. Performing feature engineering for predictive modelling. # Further analysis and potential feature engineering df_cleaned [' TotalPrice '] = df_cleaned ['Quantity'] * df_cleaned [' UnitPrice '] # Additional summary after feature engineering df_cleaned [['Quantity', ' UnitPrice ', ' TotalPrice ']].describe() Present a clean summary with next steps.