Data Analysis and Statistical Techniques.pptx

SeanMontanaOmondi 0 views 50 slides Oct 11, 2025
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Data Analysis and Statistical Techniques


Slide Content

TITLE: RESEARCH METHODOLOGY AND ANALYTICS TRAINING COURSE PRESENTED BY: MR. DARWIN MONG’ARE DATE: 17 TH SEPT. 2025

Data Techniques Analysis and Statistical

The purpose To answer the research question and to help determine the trends and relationships among the variables

Step 1 - Ask : Define the Problem Objective: Understand the business problem and stakeholder needs Key Questions: What problem are we trying to solve? What does success look like? Who are the stakeholders? Example : Problem: E-commerce company wants to reduce cart abandonment Success: Identify top 3 reasons for abandonment & recommend solutions Stakeholders: Marketing Director, UX Team, CFO

Step 2 - Prepare: Data Collection Objective: Gather relevant data from various sources Data Types: First-party: Company's website analytics, sales data Second-party: Partner data (payment processor abandonment rates) Third-party: Industry reports on e-commerce trends Example Sources: Google Analytics, CRM database, customer surveys, payment processor reports Storage: SQL database or Excel/Google Sheets

Step 3 - Process: Data Cleaning Objective: Ensure data quality and remove bias Common Issues & Solutions: Missing values: Remove or impute (e.g., average order value) Duplicates: Identify and remove identical records Formatting: Standardize dates, categories, currencies Bias check: Ensure all customer segments represented Example : Remove test transactions from production data Standardize state abbreviations (CA vs. California) Check for international customer representation

Transaction ID Customer ID Order Value (KES) County Country Payment Method Date TX001 CUST001 1500 Nairobi Kenya M- Pesa 02/01/2023 TX002 CUST002 missing Nairobi County Kenya M- Pesa 2023-01-03 TX003 CUST001 1500 Nairobi Kenya M- Pesa 02/01/2023 TX004 test_user Test County Kenya N/A 05-01-2023 TX005 CUST003 3000 Kisumu Kenya Airtel Money 06/01/2023 TX006 CUST004 4500 Kampala Uganda M- Pesa 07/01/2023 Example Data Raw Transaction Data (Before Cleaning):

Example Data Cleaned Transaction Data (After Cleaning) Transaction ID Customer ID Order Value (KES) County Country Payment Method Date TX001 CUST001 1500 Nairobi Kenya M- Pesa 2023-01-02 TX002 CUST002 3000 Nairobi Kenya M-Pesa 2023-01-03 TX005 CUST003 3000 Kisumu Kenya Airtel Money 2023-01-06 TX006 CUST004 4500 Kampala Uganda M- Pesa 2023-01-07

Step 4 - Analyze: Data Analysis Objective: Explore data, identify patterns and relationships Tools & Techniques: Excel: Pivot tables, VLOOKUP, basic statistics SQL: Querying, joining tables, aggregations Python/R: Statistical testing, predictive modeling Example Analysis: Calculate abandonment rate by traffic source Identify average time between cart add and abandonment Correlation between page load time and abandonment

Example Scenario: E-commerce Cart Abandonment Analysis (Kenya) Sample Raw Data: Session ID User ID Traffic Source Page Load Time (sec) Cart Added Time Abandoned Time Completed Purchase Amount (KES) S001 U001 Google Ads 3.2 10:00 AM 10:03 AM No S002 U002 Organic Search 2.5 11:00 AM N/A Yes 2,500 S003 U003 Instagram 6.8 1:00 PM 1:01 PM No S004 U004 Google Ads 4.0 3:00 PM 3:05 PM No S005 U005 Instagram 2.0 5:00 PM N/A Yes 3,200

Abandonment Rate by Traffic Source Goal: Identify which sources bring users who abandon carts most often. Method: Group by traffic source and calculate: Abandonment Rate = (# of sessions with abandoned cart) / (total sessions ) Result Traffic Source Total Sessions Abandoned Sessions Abandonment Rate Google Ads 2 2 100% Organic Search 1 0% Instagram 2 1 50% Insight: Google Ads has the highest cart abandonment rate (100%), suggesting possible targeting or landing page issues. Analysis Examples:

Analysis Examples: Average Time Between Cart Add and Abandonment Goal: Understand user patience by measuring how quickly users abandon carts after adding items. Method: Time Difference = Abandoned Time – Cart Added Time Result: Session ID Time Before Abandonment S001 3 minutes S003 1 minute S004 5 minutes Average time to abandonment: = (3 + 1 + 5) / 3 = 3 minutes Insight : Most users abandon within a very short time window—consider adding nudges or popups 1–2 minutes after cart add.

Step 5 - Share: Data Visualization Objective: Communicate insights effectively Tool Options: Tableau: Interactive dashboards Power BI: Business reporting Python/R: Custom visualizations ( ggplot , matplotlib ) Excel: Charts and graphs Example Visualizations: Funnel chart showing drop-off points Heatmap of abandonment by time/day Comparative charts by device type

Step 6 - Act: Presenting Insights Objective: Drive decision-making with data storytelling Effective Presentation Includes: Clear narrative structure (problem → analysis → insight → recommendation) Visualizations tailored to audience Actionable recommendations Limitations and next steps Example Recommendation: Optimize checkout page load speed (technical) Implement exit-intent popup with discount (marketing) Simplify checkout form (UX) Projected impact: 15% reduction in abandonment

Kinds of Data Analysis Descriptive Statistics Inferential Statistics

Descriptive Analysis Refers to the description of the data from a particular sample Hence the conclusion must refer only to the sample In other words, these summarizes the data and describe sample characteristics Descriptive Statistics Are numerical values obtained from the sample that gives meaning to the data collected.

Classification of Descriptive Analysis Frequency Distribution A systematic arrangement of numeric values from the lowest to the highest or highest to lowest Formula: Ef = N Where: E = Sum of f = Frequency N = Sample Size

Classification of Descriptive Analysis b. Measure of Central Tendency A statistical index that describes the average of the set values Kind of Averages Mode – a numerical value in a distribution that occurs most frequently Median – an index of average position in a distribution of numbers Mean – the point on the score scale that is equal to the sum of the scores divided by the number of scores

Classification of Descriptive Analysis Formulae X = Where: X =the mean ∑ = the sum of N = the number of cases  

Classification of Descriptive Analysis c. Measure of Variability Statistics that concern the degree to which the scores in a distribution are different from or similar to each other

Classification of Descriptive Analysis Commonly used measures of Variability 1. Range The distance between the highest score and the lowest score in the distribution Example The range of learning centre A 500(750 – 250)and the range for learning center is about 300 (650 -350)

Classification of Descriptive Analysis Commonly used measures of Variability 2 . Standard Deviation The mostly commonly used measure of variability that indicates the average to which scores deviate from the mean

Classification of Descriptive Analysis d. Bivariate Descriptive Statistics Derived from the simultaneous analysis of two variables to examine the relationship between the variables Commonly used Bivariate Descriptive Analysis a. Contingency Tables Is essentially a two-dimensional frequency distributions in which the frequencies of two variables are cross-tabulated b. Correlation The most common method of describing the relationship between two measures

2. Inferential Analysis The use of statistical test, either to test for significant relationships among variables or to find statistical support for hypothesis Inferential Statistics Are numerical values that enable the researcher to draw conclusion about a population based on characteristics of a population sample This is based on the laws of probability

2. Inferential Analysis Inferential Statistics Parameter – a characteristic of population Statistic – characteristic of a sample Not possible to study the whole population, so we study a sample and make predictions or statements relating to our findings

Level of Significance An important factor in determining the representativeness of the sample of population and the degree to which the chance affects the finding The level of significance is a numeric al value selected by the researcher before data collection to indicate the probability of erroneous findings being accepted as true This value is represented typically as 0.01 or 0.05 (Massey, 1991)

Common methods Hypothesis testing (t-tests, chi-square tests, ANOVA) Confidence intervals Regression analysis Correlation analysis

1. Hypothesis – Testing Procedures The outcome of the study perhaps may retain, revise or reject the hypothesis and this determines the acceptability of hypotheses and the theory from which it was derived t-test Is used to examine the difference between the means of two independent groups Example: Is the average height of men different from women ?

2. Analysis of Variance (ANOVA) Is used to test the significance of differences between means of two or more groups Example: Does fertilizer type affect plant growth differently across four groups? 3. Chi-square This is used to test the hypotheses about the proportion of elements that falls into various cells of a contingency table Example: Is there an association between smoking status and lung disease?

2. Confidence Intervals Provide a range of values within which we expect the population parameter (like a mean) to lie, with a certain level of confidence (usually 95 %). Example: Sample average height: 170 cm 95 % Confidence Interval: 168 cm to 172 cm Interpretation : 95% confident the true population average height is between 168 and 172 cm.

3. Regression Analysis Used to understand the relationship between a dependent variable and one or more independent variables . Types Simple regression: One predictor variable Example : Hours studied predicting exam score . Multiple regression: Multiple predictors Example : Hours studied, attendance, and sleep predicting exam score.

4. Correlation Analysis Measures the strength and direction of the linear relationship between two variables. Correlation coefficient (r): +1: Perfect positive correlation 0: No correlation -1: Perfect negative correlation Example: Correlation between time spent exercising and cardiovascular health.

Data cleaning and preprocessing techniques

Data Cleaning Data cleaning, also known as data cleansing or scrubbing, is the process of identifying and correcting errors or inconsistencies in datasets. It involves handling missing values, removing duplicates, correcting inaccuracies, and ensuring data consistency. The goal is to improve the quality and reliability of the data, making it suitable for analysis.

Data Preprocessing Data preprocessing involves a broader set of activities that prepare raw data for analysis. It includes cleaning, but also encompasses tasks such as feature scaling, handling categorical variables, data transformation, and splitting data into training and testing sets. The purpose is to make the data more suitable for machine learning algorithms and statistical analysis.

Statistical Software

Statistical Software Specialized tools for managing, analyzing, and visualizing numerical data. Provide an interface (Graphical or Code-based) to execute complex statistical operations efficiently and accurately. Eliminate the need for error-prone, time-consuming manual calculations. Core Purpose: To transform raw data into meaningful insights and support data-driven decision-making.

Key Uses of Statistical Software Data Management Import, clean, filter, and merge large datasets. Descriptive Statistics Summarize data (Mean, Median, Standard Deviation, Frequencies). Data Visualization Create charts, graphs, and plots (Histograms, Scatter Plots). Inferential Statistics Hypothesis Testing (T-tests, ANOVA) Correlation & Regression Analysis Advanced Modeling (Factor Analysis, Cluster Analysis) Reporting Generate publication-ready tables and reports.

Introduction to SPSS (IBM SPSS Statistics ) A widely-used software package known for its user-friendly, point-and-click interface. Origin: Developed for social sciences, now used in healthcare, marketing, education, and government. Key Strength: Makes advanced statistical analysis accessible to non-programmers.

SPSS Workflow & Procedure Define Variables In Variable View, set names, types, and labels. Enter/Import Data Use Data View or import from Excel/CSV. Analyze Data Use the Analyze menu for procedures (e.g., Descriptives , T-Tests, Regression). View Output Results appear in the Output Viewer window. Save Data (. sav file) and Output (. spv file).

Introduction to Stata A powerful, integrated package popular in economics, biomedicine, and political science. Primary Interface: Command-line driven, but includes a GUI. Known for strong data management. Key Strength: Excellence in advanced econometrics, reproducibility, and publication-quality graphics.

Key Features of Stata Reproducibility: Analysis is scripted using .do files, ensuring full documentation and replication. Extensibility: Users can write custom commands (`ado-files). Active Community: Large community contributes new statistical methods and provides support. Powerful Graphics: High-quality, customizable graphs for publications.

Introduction to Python for Statistics A general-purpose programming language, not just statistical software. How it works: Uses open-source libraries (packages) for data analysis. Key Strength: Unmatched flexibility, integration, and automation for data science and machine learning.

Key Python Libraries for Data Analysis Pandas : Data manipulation and analysis with DataFrame objects. NumPy : Foundation for numerical computation (arrays, matrices). SciPy & statsmodels : Advanced statistical testing, modeling, and analysis. Matplotlib & Seaborn : Creating static, animated, and interactive visualizations.

How to Choose the Right Tool ? Choose SPSS if: You are a student or researcher. You need to perform standard analyses quickly. You have no coding experience. Choose Stata if: You are an academic researcher (e.g., in economics). You need advanced econometrics and reproducibility. You value a strong balance between power and ease-of-use. Choose Python if: You aim for a career in data science or programming. You need custom analyses, automation, or machine learning. Flexibility and integration with other systems are critical.

THE END THANK YOU!
Tags