BA_Module_2.pptxCollegeModulefor passingout

sourabh715771 7 views 36 slides Oct 30, 2025
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

College degree is required for this on this


Slide Content

Module 2 www.vgu.ac.in Descriptive Analytics By, Dr. Shreya Mathur Assistant Professor, FoM, VGU, Jaipur

www.vgu.ac.in In today’s fast-paced and competitive business landscape, data has become one of the most valuable assets for organizations. The ability to collect, analyze, and interpret data effectively can provide a significant advantage in decision-making. Among the various types of data analytics, descriptive analytics plays a fundamental role by helping organizations summarize historical data to gain meaningful insights. By focusing on "what has happened," descriptive analytics lays the groundwork for more advanced analytical techniques, such as predictive and prescriptive analytics. Descriptive analytics is the process of analyzing historical data to identify patterns, trends, and key metrics. It uses statistical methods and data visualization techniques to transform raw data into meaningful summaries that are easy to interpret. Unlike predictive analytics, which forecasts future outcomes, or prescriptive analytics, which suggests actions to achieve desired results, descriptive analytics focuses solely on providing a clear understanding of past performance. Key tools and techniques used in descriptive analytics include: Data aggregation: Combining data from multiple sources to create a unified view. Data visualization: Using charts, graphs, and dashboards to represent data visually. Statistical summaries: Calculating measures like mean, median, mode, standard deviation, and percentages to describe data distribution. Trend analysis: Identifying patterns or shifts over time. By leveraging these tools, businesses can gain insights into various aspects of their operations, customer behavior, and market dynamics. INTRODUCTION

www.vgu.ac.in Descriptive analytics plays a significant role in research by summarizing and interpreting large datasets, helping researchers identify patterns, trends, and relationships within the data. It provides the foundation for further analysis, such as predictive or inferential analytics, and supports evidence-based conclusions. Data Summarization Descriptive analytics helps researchers summarize complex datasets into meaningful insights using statistical measures such as:\n Central tendency: Mean, median, mode. Dispersion: Range, variance, standard deviation. Frequencies: Counts and percentages. Example: In a study examining student performance, researchers calculate the average test scores, identify the most frequent grades, and measure score variability across different schools. Trend Analysis Researchers use descriptive analytics to identify trends and patterns over time. This allows them to observe changes and gain insights into recurring behaviors. Example: A public health study may analyze historical data on disease outbreaks to observe seasonal patterns, such as increased flu cases during winter months. Data Visualization Descriptive analytics leverages visual tools like bar charts, line graphs, pie charts, and heatmaps to make the data easier to interpret. Visual representations help communicate findings effectively to stakeholders or other researchers. Example: In climate research, temperature changes over decades can be visualized using line graphs, revealing long-term warming trends INTRODUCTION

www.vgu.ac.in Demographic Analysis Descriptive analytics is widely used to analyze demographic data, such as age, gender, income, or geographic distribution, in various fields of research. Example: A market research study for a new product analyzes customer demographics to determine the target audience based on age group, income levels, and location. Comparative Analysis Researchers use descriptive analytics to compare different groups or variables within a dataset to identify disparities or variations. Example: In an educational study, researchers might compare test scores between students in urban and rural schools to examine disparities in access to quality education. Identifying Outliers and Anomalies Descriptive analytics helps detect unusual data points or anomalies that might indicate errors or noteworthy phenomena. Example: In a financial research study, sudden spikes in transaction volumes might be flagged as potential fraud or unusual market behaviour INTRODUCTION

www.vgu.ac.in The mean (commonly known as the average) is a measure of central tendency used to summarize a dataset by dividing the sum of all values by the number of values. It provides a single representative value that describes the entire dataset. Formula for the Mean: Mean= Sum of all the Values/ No of the Values When to Use the Mean The mean is best used when:\n The Data is Symmetrical: When the dataset does not contain extreme outliers, and the values are evenly distributed, the mean accurately represents the central tendency. Example: In a class of 30 students, their test scores (e.g., 70, 75, 80, 85, 90) can be averaged to determine the overall class performance. The Data is Numerical and Continuous: For quantitative data where values are measured on a scale (e.g., height, weight, income). Example: A company calculates the mean monthly sales revenue for the past year to evaluate overall performance. When Comparing Groups: The mean allows for comparisons between groups of data, provided the distributions are similar. Example: Comparing the mean annual salary of employees in two different departments. Measures of Central Tendency: Mean

www.vgu.ac.in When Not to Use the Mean The mean may not be appropriate in certain scenarios, including:\n Presence of Outliers: If the dataset contains extreme values, the mean can be skewed and may not accurately represent the central tendency. Example: In a group of salaries (e.g., $40,000, $50,000, $55,000, $60,000, $1,000,000), the mean salary would be significantly higher due to the $1,000,000 outlier, making it a poor representative of most salaries. Alternative Measure: Use the median, which is not affected by outliers. Skewed Data: In cases where the data is not symmetrically distributed (e.g., heavily right- or left-skewed), the mean can misrepresent the dataset. Example: Analyzing household income in a city where a small percentage of households earn significantly more than the rest. Alternative Measure: The median would be better for skewed data. Categorical Data: The mean cannot be applied to qualitative or categorical data, such as survey responses (e.g., "Very Satisfied," "Neutral," "Dissatisfied"). Example: Analyzing the average satisfaction level in customer feedback data. Alternative Measure: Use the mode to identify the most frequent category. Small Sample Sizes: In datasets with very few observations, the mean may not provide meaningful insights. Example: If a dataset has only 3 values (e.g., 1, 2, 100), the mean is not representative of the majority. Measures of Central Tendency: Mean

www.vgu.ac.in A manufacturing company tracks the daily production output of widgets over a month. The outputs for 30 days are recorded as: 200, 210, 195, 205, 200, etc. Use the Mean: To calculate the average daily production, the company sums up all outputs and divides by 30. This provides a quick summary of overall performance. Real-Life Example: When Not to Use the Mean Scenario: A tech company analyzes employee salaries across the organization. The dataset includes 100 employees earning $50,000 to $100,000 annually and 5 executives earning $1,000,000 each. Don\u2019t Use the Mean: The mean salary would be heavily skewed upward by the executives\u2019 high salaries, giving a misleading impression of what most employees earn. Alternative: Use the median salary to better represent the central tendency of employee earnings. When to Use Mean & Not

www.vgu.ac.in The median is the middle value in a dataset when the numbers are arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle one. If it has an even number of values, the median is the average of the two middle numbers. Unlike the mean, the median is not affected by outliers or skewed data, making it a reliable measure of central tendency in certain situations. Measures of Central Tendency: Median

www.vgu.ac.in Skewed Data Distributions: When the data is heavily skewed (right or left), the median gives a better representation of the central tendency. Example: Household income in a city, where a few high-income households can inflate the mean, making the median a more accurate measure. Data with Outliers: In datasets with extreme values (very high or very low), the median remains unaffected, unlike the mean. Example: A group of employees earning $30,000 to $50,000 annually with one executive earning $1,000,000. The median salary reflects the typical earnings better than the mean. Ordinal Data: When the data has an order but no numerical significance, such as survey ratings (e.g., "Strongly Agree," "Agree," "Neutral"), the median can indicate the central response. Example: Analyzing customer satisfaction survey results to find the most typical level of satisfaction. Where Median Works Best

www.vgu.ac.in Symmetrical Data: When the data is symmetrically distributed without outliers, the mean provides a more precise and mathematically sound measure of central tendency. Example: Measuring the average test scores of a class where all scores are clustered around the middle. Small Sample Sizes: In small datasets, the median might not provide an accurate representation because it relies on only one or two middle values. Example: In a dataset with just 3 values (10, 15, 50), the median is 15, which doesn’t reflect the high variability in the data. Quantitative Analysis: If detailed calculations or further statistical analysis (like standard deviation) is required, the mean is preferred as it uses all data points. Example: Calculating the average daily temperature over a month for climate research. When Not to Use the Median

www.vgu.ac.in A real estate company analyzes house prices in a metropolitan city. The prices of houses in the dataset are: $150,000, $175,000, $200,000, $225,000, $1,500,000. Use the Median: When summarizing the central tendency of house prices, the median ($200,000) provides a better representation than the mean ($450,000), which is skewed by the luxury home priced at $1,500,000. When Not to Use the Median An educator evaluates the test scores of 10 students: 80, 85, 87, 89, 90, 91, 92, 93, 94, 95. Don’t Use the Median: Since the scores are symmetrically distributed, the mean (89.6) is a more precise measure of the overall performance compared to the median (90). The median ignores the variation in scores at the upper and lower ends. Examples

www.vgu.ac.in

www.vgu.ac.in In research, outliers are data points that differ significantly from other observations in a dataset. These values are either much higher or much lower than the rest of the data and may arise due to variability in the data, experimental errors, or unique factors. Outliers can heavily influence statistical analysis, leading to skewed results or misinterpretations if not handled properly. Types of Outliers Univariate Outliers: These are unusual values in a single variable. Example: In a dataset of student test scores ranging from 50–90, a score of 5 is a univariate outlier. Multivariate Outliers: These occur when unusual combinations of values exist in multiple variables. Example: A person with a high salary but extremely low work experience in an employee dataset. Global Outliers: Data points that are extreme compared to the entire dataset. Example: A house priced at $5,000,000 in a neighborhood where most homes are priced around $200,000. Contextual (Conditional) Outliers: Values that are outliers only under specific conditions or contexts. Example: A temperature of 25°C might be typical in summer but an outlier in winter. Definition of Outliers in Research

www.vgu.ac.in How to Identify Outliers

www.vgu.ac.in Healthcare Research: A hospital collects data on patient recovery times after surgery. Outlier: A patient took 200 days to recover, while most took 10–30 days. Action: Investigate the patient’s medical history. If it was due to a pre-existing condition, it might be valid and require separate analysis. Market Research: A company analyzes monthly sales data across stores. Outlier: One store reported 10× the sales of others. Action: Check if this was due to a promotional event or a reporting error. If legitimate, the store's strategy could offer insights for other locations. Education Research: A study evaluates students’ test scores in a standardized exam. Outlier: A student scored 0 despite attending the exam. Action: Investigate whether the student left the test incomplete or faced other challenges. This could help identify gaps in the examination process. Environmental Research: Monitoring daily rainfall in a region. Outlier: One day reported 500mm of rainfall, while the average is 50mm. Action: Cross-check with weather reports to determine if it was an actual extreme event or a measurement error. Examples

www.vgu.ac.in Outliers are extreme values that differ significantly from the rest of the dataset. They can distort statistical analyses but also provide valuable insights. Handling outliers depends on their cause, context, and the research goals. Use appropriate statistical and visual methods to identify, analyze, and decide how to treat outliers. Outliers

www.vgu.ac.in The mode is the value that appears most frequently in a dataset. It is a measure of central tendency that is particularly useful for understanding the most common or popular item, category, or occurrence in a dataset. Unlike the mean or median, the mode can be used for both numerical and categorical data , making it versatile for various research scenarios. Key Characteristics of the Mode Usability with Categorical Data: The mode is the only measure of central tendency that can be applied to qualitative (non-numeric) data. Example: The most popular color of cars sold in a city. Robust to Outliers: The mode is not influenced by extreme values in the dataset. Multiple Modes: A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (more than two modes). Descriptive Statistics - Mode

www.vgu.ac.in 1. Analyzing Categorical Data The mode helps identify the most frequent category or characteristic in a dataset. Example: In a survey on customer preferences for mobile phone brands, the mode reveals the most preferred brand. 2. Describing the Most Common Outcome When the research question focuses on identifying the most common occurrence or behavior. Example: In education research, the mode can identify the most frequently chosen subject by students. 3. Understanding Frequency in Discrete Data The mode is used to determine the most frequent numerical value, especially for discrete data. Example: In healthcare, the mode might identify the most common number of visits a patient makes to the clinic in a year. 4. Highlighting Trends in Consumer Behavior Businesses use the mode to analyze trends and preferences. Example: Retailers might use the mode to find the most commonly sold product size or the most purchased product during a sale. 5. Studying Nominal Data For datasets that involve non-numeric labels or categories, the mode provides valuable insights. Example: A political survey might use the mode to determine the most frequently supported political party. Mode in Research

www.vgu.ac.in Standard deviation (SD) is a measure of the dispersion or spread of data points in a dataset. In research and analytics, it quantifies how much individual data points differ from the mean (average) of the dataset. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation means that the data points are spread out over a wider range of values. Standard Deviation

www.vgu.ac.in Data Consistency: In experiments, a low standard deviation suggests that measurements or observations are consistent and reliable. Risk and Uncertainty: In business or finance analytics, a high standard deviation often indicates higher risk or uncertainty in outcomes or predictions. Comparing Datasets: Standard deviation helps in comparing the variability of different datasets. A researcher might compare the spread of test scores between two groups, for example, to determine which group has more variability. Outlier Detection: Standard deviation can help identify outliers. Values that are more than 2 or 3 standard deviations away from the mean can often be considered outliers. Control in Experiments: In controlled experiments, researchers use standard deviation to measure the precision of measurements and ensure that the experimental conditions are stable. Applications in Research and Analytics

www.vgu.ac.in Quantifying Variability or Dispersion: Use standard deviation when you need to measure how spread out the data is around the mean. Example: Comparing the performance consistency of students in two different classes. Risk Assessment: In finance and investment analysis , standard deviation is used to evaluate the volatility or risk of an asset or portfolio. A higher standard deviation indicates a higher level of risk. Example: When analyzing stock price fluctuations, a high standard deviation shows a volatile market, while a low standard deviation indicates stability. Quality Control in Manufacturing: Use standard deviation to assess product consistency and process stability . In a production environment, you can monitor how much individual products deviate from the desired specification. Example: In a factory producing tablets, a smaller standard deviation in weight means more consistent products. Comparing Data Sets: Standard deviation can help compare the consistency of two or more datasets or experimental conditions. Example: Comparing the test scores of two classes to determine which has more consistent performance. Statistical Testing (e.g., Hypothesis Testing): In inferential statistics, standard deviation is used to calculate the standard error , which is important in hypothesis testing and confidence intervals. Example: When conducting t-tests or z-tests, standard deviation helps in determining whether observed differences are statistically significant. When to Use Standard Deviation

www.vgu.ac.in When the Data is Highly Skewed or has Outliers: Standard deviation can be heavily influenced by extreme values (outliers) or skewed distributions . In such cases, it may not accurately represent the spread of the majority of the data. Alternative: Consider using the Interquartile Range (IQR) or Median Absolute Deviation (MAD) , which are more robust measures of spread in the presence of outliers. Example: If you're studying the income distribution of a population where most people earn modest salaries, but a small percentage earn extremely high wages, the standard deviation will be inflated and not represent the typical income. When Data is Categorical (Nominal or Ordinal Data): Standard deviation is only applicable to quantitative data. It cannot be calculated for categorical data, which lacks a meaningful numeric scale. Alternative: For categorical data, you can use measures such as mode , frequency distributions , or Chi-square tests . Example: If you're studying the most popular types of fruits in a market (Apple, Banana, Orange), standard deviation cannot be used as the data is non-numeric. When Data is Non-Normal (Not Bell-Shaped Distribution): Standard deviation assumes that the data follows a normal distribution (bell-shaped curve). While it can be used for non-normal distributions, its interpretation may not be meaningful if the data is heavily skewed or multimodal (multiple peaks). Alternative: In non-normal distributions, you might use skewness or kurtosis to better understand the shape of the distribution. Example: For income distribution or house prices, which are often skewed, the standard deviation might not give you a good understanding of central tendency or variability. In such cases, measures like the log-transformation or median might provide better insight. When Not to Use Standard Deviation

www.vgu.ac.in When the Data Set is Too Small: Standard deviation is less reliable when applied to very small sample sizes. Small samples are more susceptible to random variability, and the standard deviation may not be a stable or accurate estimate of the population's true spread. Alternative: For small sample sizes, standard error of the mean or confidence intervals can provide a better measure of precision. Example: If you are working with a sample of just 3 data points, the standard deviation may fluctuate significantly even if the sample doesn't truly represent the population. When the Data is Expressed in Different Units or Scales: Standard deviation can be misleading if you are comparing datasets that use different units or scales of measurement (e.g., weight in kilograms vs. height in meters). It is important to standardize the data first. Alternative: Use z-scores (standardized scores) if you need to compare datasets on different scales. Example: Comparing the monthly revenue of two companies, where one operates in the U.S. and the other in Europe (different currencies). Standard deviation should be adjusted or converted to the same unit of measurement for proper comparison. When Not to Use Standard Deviation

www.vgu.ac.in Standard error (SE) is a statistical measure that quantifies the variability of a sample statistic, like the sample mean, when estimating a population parameter. It represents the standard deviation of the sampling distribution of that statistic. In simpler terms, it indicates how much the sample statistic is likely to vary from the true population value.  What it tells you: Precision of an estimate: A smaller standard error suggests that the sample statistic is a more precise estimate of the population parameter. Reliability of results: A smaller standard error generally indicates that the results of a study are more reliable and likely to be replicated. Uncertainty in estimation: The standard error helps quantify the uncertainty associated with using a sample statistic to estimate a population parameter.  Standard Error

www.vgu.ac.in Standard error (SE) is a statistical measure that quantifies the variability of a sample statistic, like the sample mean, when estimating a population parameter. It represents the standard deviation of the sampling distribution of that statistic. In simpler terms, it indicates how much the sample statistic is likely to vary from the true population value.  What it tells you: Precision of an estimate: A smaller standard error suggests that the sample statistic is a more precise estimate of the population parameter. Reliability of results: A smaller standard error generally indicates that the results of a study are more reliable and likely to be replicated. Uncertainty in estimation: The standard error helps quantify the uncertainty associated with using a sample statistic to estimate a population parameter.  Standard Error

www.vgu.ac.in There are several ways in which the standard error impacts research studies. Here are a few insights from different points of view: 1. Sample size: The standard error is inversely proportional to the sample size. A larger sample size reduces the standard error, which in turn increases the precision of the estimate. For example, a survey of 1000 people is likely to produce a more accurate estimate of public opinion than a survey of 100 people because the larger sample size reduces the standard error. 2. Confidence intervals: The standard error is used to calculate confidence intervals, which give a range of values within which the true population parameter is likely to fall. The wider the confidence interval, the less precise the estimate. Researchers use confidence intervals to determine the statistical significance of their findings. 3. Statistical tests: The standard error is used in many statistical tests, such as t-tests and ANOVA, to determine whether there is a significant difference between groups. A larger standard error reduces the power of the statistical test, making it less likely to detect a true difference between groups. 4. Publication bias: Publication bias occurs when studies with positive results are more likely to be published than studies with negative results. The standard error can be used to detect publication bias by examining the funnel plot, which plots the effect size against the standard error. A symmetrical funnel plot indicates that there is no publication bias, while an asymmetrical funnel plot indicates that smaller studies with larger standard errors are missing. Standard Error

www.vgu.ac.in Precision of the Estimate: Smaller SE means that the sample mean is a more precise estimate of the population mean. The smaller the standard error, the closer the sample mean is likely to be to the true population mean. Larger SE means that the sample mean is less precise , with more potential variability from the true population mean. Influence of Sample Size: Larger sample size (n) leads to a smaller standard error , which means that the estimate becomes more reliable as the sample size increases. This is why larger sample sizes are preferred in research and analytics to achieve more accurate estimations . Smaller sample size (n) results in a larger standard error , which reflects greater uncertainty about the population parameter. Confidence Intervals: Standard error plays a critical role in calculating confidence intervals (CI) . A confidence interval provides a range within which the true population parameter (such as the mean) is likely to lie. The smaller the standard error, the narrower the confidence interval, indicating greater precision in the estimate. Formula for confidence interval : CI=mean±(Z×SE)\text{CI} = \text{mean} \pm (Z \times \text{SE})CI=mean±(Z×SE) Where ZZZ is the Z-value for the desired confidence level (e.g., 1.96 for 95% confidence). Sampling Distribution: Standard error measures the spread of the sample means in the sampling distribution. The sampling distribution is the distribution of a sample statistic (like the mean) from all possible samples of the same size taken from the population. A smaller SE means that the sampling distribution is tighter around the population mean. Example: If you take multiple random samples of a certain size from a population, you would expect the sample means to vary. The SE tells you how much these sample means will vary on average. Comparison Across Samples: Standard error helps to compare the variability of different samples. If you have multiple samples from different groups or time periods, the one with the smaller standard error will give you more confidence that its mean is closer to the true population mean. Test Statistics (Hypothesis Testing): SE is used in hypothesis testing, especially when calculating z-scores or t-scores , which help determine if the observed sample mean is significantly different from the population mean. Example : In a t-test, the t-statistic is calculated as: t=sample mean−population mean SE of the sample meant = \frac{\text{sample mean} - \text{population mean}}{\text{SE of the sample mean}}t=SE of the sample mean sample mean−population mean​ A smaller SE will lead to a larger t-value, which may indicate statistical significance. Interpretation of Standard Error in Analytics

www.vgu.ac.in In data analysis, both range and inter-quartile range (IQR) are measures of variability or spread of data. They help to understand the distribution and spread of data points, providing insights into how data is distributed across different values. Here's an explanation of each 1. Range Definition: The range is the simplest measure of variability. It is the difference between the maximum and minimum values in a dataset. Range=Maximum Value−Minimum Value\text{Range} = \text{Maximum Value} - \text{Minimum Value}Range=Maximum Value−Minimum Value Interpretation in Analytics: The range gives you an idea of how spread out the data is. A large range indicates that the data points are spread out over a wide range of values, while a small range suggests the data points are clustered closely together. Example: Consider the following dataset: [10, 12, 15, 22, 25, 30, 35] Maximum value = 35 Minimum value = 10 Range = 35 - 10 = 25 When to Use: Use range when you want to quickly understand the total span of data values. It is useful for datasets with no extreme outliers . When NOT to Use: The range can be heavily affected by outliers or extreme values , so it is not suitable for datasets with significant anomalies or non-normal distributions. In these cases, it may give a misleading picture of variability. Range and Inter-quartile Range (IQR) in Analytics

www.vgu.ac.in Definition: The interquartile range (IQR) is the range between the first quartile (Q1) and the third quartile (Q3) of the dataset. It is the range of the middle 50% of the data. IQR=Q3−Q1\text{IQR} = Q3 - Q1IQR=Q3−Q1 Where: Q1 (First Quartile) is the median of the lower half of the data. Q3 (Third Quartile) is the median of the upper half of the data. Interpretation in Analytics: The IQR gives a better sense of the spread of the central data points, and it is less sensitive to outliers compared to the range. It’s often used to understand the distribution of data, especially in the presence of skewness or outliers. Example: Consider the following dataset: [10, 12, 15, 22, 25, 30, 35] Arrange the data in ascending order: [10, 12, 15, 22, 25, 30, 35] Median (Q2) = 22 Q1 (First Quartile) = Median of [10, 12, 15] = 12 Q3 (Third Quartile) = Median of [25, 30, 35] = 30 IQR = Q3 - Q1 = 30 - 12 = 18 Interquartile Range (IQR)

www.vgu.ac.in When to Use: Use IQR when you want a measure of spread that is robust to outliers . It focuses on the middle 50% of the data and is often used for understanding variability when there are outliers. It is useful in the context of box plots , where the IQR is represented as the "box," and the whiskers extend to the minimum and maximum values within 1.5*IQR from Q1 and Q3. When NOT to Use: IQR may not provide a complete understanding of the data distribution when the data is highly skewed or has extreme values on both ends. If you are interested in the spread of all data points (including outliers), the IQR may not give you the full picture, and you might prefer using the range. Interquartile Range (IQR)

www.vgu.ac.in Normal distribution (also called the Gaussian distribution or bell curve ) is a probability distribution that is symmetric around the mean, with most of the data points clustering around the center and fewer appearing at the extremes. The mean, median, and mode of a normal distribution are equal. Characteristics of Normal Distribution Symmetry : The distribution is perfectly symmetrical around the mean. Bell-Shaped Curve : Most data points are concentrated around the mean, with fewer in the tails. Mean = Median = Mode : The central tendency measures are the same. Defined by Mean (µ) and Standard Deviation (σ) : The curve is completely determined by its mean (center) and standard deviation (spread). Empirical Rule (68-95-99.7 Rule) : 68% of the data falls within 1 standard deviation (σ) from the mean. 95% falls within 2σ . 99.7% falls within 3σ . Normal Distribution in Research: Usage, Interpretation, and Application

www.vgu.ac.in Normal distribution is widely used in various research fields, including social sciences, medicine, business analytics, and engineering , due to its natural occurrence in real-world data. Below are key areas where it is applied: 1. Statistical Inference & Hypothesis Testing Many statistical tests (e.g., t-tests, ANOVA, Z-tests ) assume normality in data. Researchers test whether a dataset follows a normal distribution before applying these statistical methods. If data is normally distributed, parametric tests (which rely on normality) can be used to draw reliable conclusions. 2. Standardization & Z-Scores In research, data is often standardized using Z-scores , which measure how far a data point is from the mean in terms of standard deviations. Example: If a test score has a mean of 70 and a standard deviation of 10, a score of 80 has a Z-score of (80-70)/10 = 1 , meaning it is one standard deviation above the mean . 3. Probability Calculations Researchers use normal distribution to calculate probabilities and make predictions about populations. Example: If IQ scores are normally distributed with μ = 100 and σ = 15 , what is the probability that a randomly chosen individual has an IQ above 130? How is Normal Distribution Used in Research

www.vgu.ac.in 4. Quality Control & Process Optimization In manufacturing and business analytics, normal distribution is used in quality control to detect defects and variations in production processes. Example: If a factory produces light bulbs with a lifespan of 1,000 hours (mean) and a standard deviation of 50 hours , quality analysts can predict how many bulbs will last beyond 1,100 hours . 5. Regression Analysis & Machine Learning Many regression models assume normally distributed residuals. In machine learning , normality assumptions help in model selection and validation. 6. Risk Assessment & Decision Making In finance, normal distribution is used to model stock returns, risk assessment, and investment decisions. In medicine, it helps in understanding the distribution of blood pressure, cholesterol levels, and other health indicators . How is Normal Distribution Used in Research

www.vgu.ac.in Understanding Population Behavior A normal distribution helps estimate the likelihood of different outcomes in a population. Example: If height follows a normal distribution, we can estimate what percentage of people are taller than a given height. 2. Identifying Outliers Data points that fall far from the mean (beyond 3σ) are considered outliers , which may indicate measurement errors or anomalies . Example: A student scoring 30% in an exam where the class average is 80% (with σ = 5) may need further analysis. 3. Setting Thresholds & Confidence Intervals In research, normal distribution helps set confidence intervals (e.g., a 95% confidence interval includes values within ±1.96σ from the mean). Example: If a medical test detects glucose levels in a population with a mean of 100 mg/dL and σ = 10 , a 95% confidence interval for normal glucose levels would be 80 to 120 mg/dL . 4. Evaluating Standard Scores Standard scores (Z-scores) help compare different datasets. Example: If SAT scores (mean = 1,050, σ = 100) and ACT scores (mean = 21, σ = 4.7) are normally distributed, Z-scores help compare performances across the two exams. Interpretation of Normal Distribution

www.vgu.ac.in ✅ Use Normal Distribution When: Data is continuous (e.g., height, weight, test scores, IQ). The dataset is symmetric with no significant skewness. You need to apply parametric tests (t-tests, ANOVA, regression). The Central Limit Theorem (CLT) applies (even if the population isn’t normal, sample means tend to be normally distributed when n > 30). ❌ Do Not Use Normal Distribution When: ✖Data is skewed or has significant outliers. ✖The dataset is small (<30 samples) unless normality is confirmed. ✖You are working with categorical data (e.g., gender, colors, yes/no responses). ✖The variance changes significantly over time (heteroscedasticity). When to Use Normal Distribution?

www.vgu.ac.in THANK YOU
Tags