statistics for data science, python and Machine learning
Size: 8.1 MB
Language: en
Added: Jun 27, 2024
Slides: 71 pages
Slide Content
Shekhar S Babar Statistic for Data Science SHEKHAR S BABAR
Uses of Statistics in Business Decision Making Quantitative information about production, sale, purchase, finance, etc. can be obtained. This helps businessmen in formulating suitable policies. By using the techniques of time series analysis which are based on statistical methods, the businessman can predict future business. Nowadays, a large part of modern business is being organized around systems of statistical analysis and control. What is Statistics? Shekhar S Babar
Population is the entire group that you want to draw conclusions about. S ample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc. Shekhar S Babar Population V/S Sample
Shekhar S Babar Population vs sample Examples Population Sample Advertisements for IT jobs in the India The top 50 search results for advertisements for IT jobs in the India on June1, 2022. Songs from the Indian Idol contest Winning songs from the Indian Idol contest. Undergraduate students in the India. 300 undergraduate students from three Top universities who volunteer for your psychology research study All countries of the world Countries with published data available on birth rates and GDP since 2000
Types of statistics Shekhar S Babar
Types of Statistics 1 . Descriptive Statistics : Descriptive statistics uses data that provides a description of the population either through numerical calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing objects, etc. Shekhar S Babar
Shekhar S Babar 2. Inferential Statistics : Inferential Statistics makes inference and prediction about population based on a sample of data taken from population. It generalizes a large dataset and applies probabilities to draw a conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to analyze, interpret result, and draw conclusion. Inferential Statistics is mainly related to and associated with hypothesis testing whose main target is to reject null hypothesis. Types of Statistics One sample test of difference/One sample hypothesis test Confidence Interval Contingency Tables and Chi-Square Statistic T-test or Anova Pearson Correlation Bi-variate Regression Multi-variate Regression
MEAN – MEDIAN – MODE (MEASURE OF CENTRAL TENDANCY) Shekhar S Babar SR NO STUDENT AGE 1 SHEKHAR 32 2 SAGAR 36 3 SURESH 34 4 ARATI 28 5 ARYA 30 6 ABIR 28 7 PRAVIN 35 32 + 36 + 34 + 28 + 30 + 28 + 35 MEAN = ---- ------------------------------- ------ = 31.86 7 MEDIAN = 32 SR NO STUDENT AGE 1 ARATI 28 2 ABIR 28 3 ARYA 30 4 SHEKHAR 32 5 SURESH 34 6 SAGAR 35 7 PRAVIN 36 MODE = 28
Shekhar S Babar
STANDARD DEVIATION M easure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value ) of the set, while a high standard deviation indicates that the values are spread out over a wider range. Shekhar S Babar
RANGE-VARIANCE-DISPERSION (MEASURE OF DISPERSION/VARIABILITY) Shekhar S Babar Range : It is given measure of how to spread apart values in sample set or data set. Range = Maximum value - Minimum value Variance: It simply describes how much a random variable defers from expected value and it is also computed as square of deviation .
Shekhar S Babar
Data “Torture the data and it will confess to anything” -Ronald Coase Economics, Noble Prize. Shekhar S Babar Importance of Data Data helps in make better decisions. Data helps in solve problems by finding the reason for underperformance. Data helps one to evaluate the performance. Data helps one improve processes. Data helps one understand consumers and the market
Types of Data / Variables Shekhar S Babar
Shekhar S Babar Qualitative Data / Categorical Data Qualitative or Categorical Data describes the object under consideration using a finite set of Categories e.g. Gender, Marital Status, Number of Children, Cloth Size, Grade Nominal Data: A nominal scale classifies data into several distinct categories in which no ranking criteria is implied. For example, Gender, Marital Status. Ordinal Data An ordinal scale classifies data into distinct categories during which ranking is implied For example: Cloth Size : Small – Medium – Large Grades : A – B – C - D Types of Data Name Gender Marital Status weight (KG) Number of Children Cloth Size Grade Shekhar Male Married 62.1 1 Medium A Sagar Male Married 58.6 2 Small C Sakshi Female Unmarried 53 Small B Pravin Male Unmarried 72.4 Large A Riya Female Married 60 2 Medium B
Shekhar S Babar
Shekhar S Babar Types of Data Numerical Data / Quantitative Data : Numerical data can further be classified into two categories: Discrete Data: Discrete data contains the data which have discrete numerical values for example Number of Children, Defects per Hour etc. Continuous Data: Continuous data contains the data which have continuous numerical values for example Weight, Voltage etc. Name Gender Marital Status weight (KG) Number of Children Cloth Size Grade Shekhar Male Married 62.1 1 Medium A Sagar Male Married 58.6 2 Small C Sakshi Female Unmarried 53 Small B Pravin Male Unmarried 72.4 Large A Riya Female Married 60 2 Medium B
Nominal Scale: A nominal scale classifies data into several distinct categories in which no ranking criteria is implied. For example, Gender, Marital Status. Ordinary Scale: An ordinal scale classifies data into distinct categories during which ranking is implied For example: Faculty rank : Professor, Associate Professor, Assistant Professor Students grade : A, B, C, D.E.F Interval scale: An interval scale may be an ordered scale during which the difference between measurements is a meaningful quantity, but the measurements don’t have a true zero point. For example: Temperature in Fahrenheit and Celsius , Years Ratio scale: A ratio scale may be an ordered scale during which the difference between the measurements is a meaningful quantity and therefore the measurements have a true zero point. Hence, we can perform arithmetic operations on real scale data. For example : Weight, Age, Salary etc. Shekhar S Babar A dvanced Level Data Classification
Shekhar S Babar Independent vs dependent vs control variables Type of variable Definition Effect of Advertising on Sales Independent variables Variables you manipulate in order to affect the outcome of an experiment. Advertising expenditure Dependent variables Variables that represent the outcome of the experiment. Sales amount Control variables Variables that are held constant throughout the experiment. Salary of the Employee
Shekhar S Babar
Range example Your data set is the ages of 8 participants. Participant 1 2 3 4 5 6 7 8 Age 37 19 31 29 21 26 33 36 First, order the values from low to high to identify the lowest value (L) and the highest value (H). Age 19 21 26 29 31 33 36 37 Then subtract the lowest from the highest value. R = H – L R = 37 – 19 = 18 The range of our data set is 18 years. Shekhar S Babar
Shekhar S Babar
Shekhar S Babar Year 2000 2004 2008 2012 2016 Voter turnout (%) 50.3 55.7 57.1 54.9 55.5 Arithmetic Mean vs Geometric Mean The geometric mean is best for reporting average inflation, percentage change, and growth rates. Because these types of data are expressed as fractions, the geometric mean is more accurate for them than the arithmetic mean . While the arithmetic mean is appropriate for values that are independent from each other (e.g., test scores), the geometric mean is more appropriate for dependent values, percentages, fractions, or widely ranging data.
Shekhar S Babar The harmonic mean is a numerical average calculated by dividing the number of observations, or entries in the series, by the reciprocal of each number in the series. Thus, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. Harmonic Mean Harmonic mean of 1, 4, and 4 Two firms. One has a market capitalization of $100 billion and earnings of $4 billion (P/E of 25 ) Other has a market capitalization of $1 billion and earnings of $4 million (P/E of 250). In an index made of the two stocks, with 10% invested in the first and 90% invested in the second, the P/E ratio of the index is……
Shekhar S Babar The harmonic mean is best used for fractions such as rates or multiples. Harmonic means are used in finance to average data like price multiples such as the price-to-earnings (P/E) ratio. Market technicians may also use harmonic means to identify patterns such as Fibonacci sequences. Harmonic Mean The harmonic mean necessarily includes all the entries in a series, and it allows a more significant weighting to be given to smaller values. Harmonic mean can be calculated for a series that includes negative values
Shekhar S Babar Frequency Tables A frequency table is a way to present data. The data are counted and ordered to summarize larger sets of data. With a frequency table you can analyze the way the data is distributed across different values . Frequency means the number of times a value appears in the data. A table can quickly show us how many times each value appears. A ge of the 934 Nobel Prize winners up until the year 2020. Age Interval Frequency 10-19 1 20-29 2 30-39 48 40-49 158 50-59 236 60-69 262 70-79 174 80-89 50 90-99 3
Shekhar S Babar Age Interval Relative Frequency 10-19 0.11% 20-29 0.21% 30-39 5.14% 40-49 16.92% 50-59 25.27% 60-69 28.05% 70-79 18.63% 80-89 5.35% 90-99 0.32% Age Cumulative Frequency Younger than 20 1 Younger than 30 3 Younger than 40 51 Younger than 50 209 Younger than 60 445 Younger than 70 707 Younger than 80 881 Younger than 90 931 Younger than 100 934 Cumulative frequency counts up to a particular value. Here are the cumulative frequencies of ages of Nobel Prize winners. Now, we can see how many winners have been younger than a certain age. Relative frequency means the number of times a value appears in the data compared to the total amount. A percentage is a relative frequency.
Shekhar S Babar
Shekhar S Babar Five Number Summary for Analysis! 10,11,14,16,16,19,23,26,30,32 Min = 10 Max = 32 Median = 16 + 19 / 2 = 17.5 Q1 = 25% = 14 Q3 = 75% = 26 IQR = Q3 – Q1
Shekhar S Babar Boxplots / Box-and-Whisker plot A boxplot is a plot that shows the five-number summary of a dataset. The five-number summary include: The minimum = 10 The first quartile = 12 The median = (13+14)/2 = 13.5 The third quartile = 19 The maximum = 25 Why Are Boxplots Useful? It help us visualize five important descriptive statistics of a dataset minimum, lower quartile, median, upper quartile, and maximum. For example : 1. How tall is tallest Plant? [24] 2. What percent of plants are taller than 19 inches?[25%]
Shekhar S Babar IQR – Outlier Detection
Shekhar S Babar
Shekhar S Babar
Shekhar S Babar
Shekhar S Babar
What is a Bar Graph? A bar graph is a pictorial representation of data in the form of vertical or horizontal rectangular bars of uniform width where the length of each bar is proportional to the value they represent A bar graph is used to give the comparison between two or more categories. It consists of two or more parallel vertical (or horizontal) rectangles. Shekhar S Babar
Shekhar S Babar Vertical (or Horizontal) Grouped Bar Charts: The grouped bar chart is also referred the clustered bar chart (graph). It represents the discrete value for two or more categorical data. Vertical (or Horizontal) Stacked Bar Charts: The stacked bar chart is also known as the composite bar chart Types of Bar Charts
A histogram is a graphical representation of a grouped frequency distribution with continuous classes. It is an area diagram and can be defined as a set of rectangles with bases along with the intervals between class boundaries and with areas proportional to frequencies in the corresponding classes. Shekhar S Babar Lifetime (in hours) Number of lamps 300 – 400 14 400 – 500 56 500 – 600 60 600 – 700 86 700 – 800 74 800 – 900 62 900 – 1000 48 Histogram
Shekhar S Babar Bell Shaped Histogram Bimodal Histogram Skewed Right Histogram Skewed Left Histogram Uniform Histogram
Shekhar S Babar Histogram vs Bar Graph
A P ie C hart is a type of graph that represents the data in the circular graph. The slices of pie show the relative size of the data, and it is a type of pictorial representation of data . A pie chart requires a list of categorical variables and numerical variables. Here, the term “pie” represents the whole, and the “slices” represent the parts of the whole . Shekhar S Babar Pie Chart
Shekhar S Babar A density curve is a curve on a graph that represents the distribution of values in a dataset. A density curve gives us a good idea of the “shape” of a distribution, including whether or not a distribution has one or more “peaks” of frequently occurring values and whether the distribution is skewed to the left or the right. A density curve lets us visually see where the mean and the median of a distribution are located. A density curve lets us visually see what percentage of observations in a dataset fall between different values. The most famous density curve is the bell-shaped curve that represents the normal distribution . If a density curve is left skewed , then the mean is less than the median. If a density curve is right skewed , then the mean is greater than the median. If a density curve has no skew , then the mean is equal to the median.
Shekhar S Babar KDE Plot described as Kernel Density Estimate is used for visualizing the Probability Density of a continuous variable. It depicts the probability density at different values in a continuous variable. We can also plot a single graph for multiple samples which helps in more efficient data visualization . KDE Plot
Shekhar S Babar Scatterplots are used to display the relationship between two variables. Scatter Plot
Shekhar S Babar
Shekhar S Babar
Hypothesis Testing What is Hypothesis? A supposition or proposed explanation made based on limited evidence as a starting point for further investigation. What is P-Value ? Probability of null hypothesis is true. What is Null Hypothesis? Assumption that treat everything equal and similar. Significance of the P –value P <= 0.05 : strong evidence against the null hypothesis/ Reject the null hypothesis. P >= 0.05 : weak evidence against the null hypothesis/ fail to Reject the null hypothesis. P = 0.05 Marginal could go on either way Shekhar S Babar
Shekhar S Babar Skewness and Kurtosis: Skewness is a statistical measure that assesses the asymmetry of a probability distribution. It quantifies the extent to which the data is skewed or shifted to one side. Positive skewness indicates a longer tail on the right side of the distribution, while negative skewness indicates a longer tail on the left side. Skewness helps in understanding the shape and outliers in a dataset. If the values of a specific independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions or may reduce the interpretation of feature importance. S kewness is a degree of asymmetry observed in a probability distribution that deviates from the symmetrical normal distribution (bell curve) in a given set of data . A skewed data set, typical values fall between the first quartile (Q1) and the third quartile (Q3).
Shekhar S Babar Positive Skewed or Right-Skewed (Positive Skewness ) a positively skewed or right-skewed distribution has a long right tail . For positively skewed distributions, the famous transformation is the log transformation. The log transformation proposes the calculations of the natural logarithm for each value in the dataset. Negative Skewed or Left-Skewed (Negative Skewness) A negatively skewed or left-skewed distribution has a long left tail; it is the complete opposite of a positively skewed distribution.
Shekhar S Babar Pearson’s first coefficient of skewness is helping if the data present high mode. But, if the data have low mode or various modes, Pearson’s first coefficient is not preferred. Pearson’s second coefficient may be superior, as it does not rely on the mode. Rule of thumb : If the skewness is between -0.5 & 0.5, the data are nearly symmetrical. If the skewness is between -1 & -0.5 (negative skewed) Between 0.5 & 1(positive skewed), the data are slightly skewed. If the skewness is lower than -1 (negative skewed) Greater than 1 (positive skewed), the data are extremely skewed.
Shekhar S Babar Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal distribution . Distributions with medium kurtosis (medium tails) are mesokurtic . Distributions with low kurtosis (thin tails) are platykurtic . Distributions with high kurtosis (fat tails) are leptokurtic. Where n = is the sample size xi = are observations of the variable x_bar = is the mean of the variable In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level of risk for an investment because it indicates that there are high probabilities of extremely large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low. Kurtosis
Shekhar S Babar What Is Probability? Probability is simply how likely something is to happen. Whenever we’re unsure about the outcome of an event, we can talk about the probabilities of certain outcomes—how likely they are. Probability is the measure of how likely an event occurs. For example, if there is a 60% chance of rain tomorrow, then the probability is 60%.
Shekhar S Babar A probability distribution is represented in the form of a table or an equation. The table or the equation corresponds to every outcome of a statistical experiment with its probability of occurrence . A probability distribution is a statistical function that describes all the possible values and probabilities for a random variable within a given range. This range will be bound by the minimum and maximum possible values, but where the possible value would be plotted on the probability distribution will be determined by a number of factors. The mean (average), standard deviation, skewness, and kurtosis of the distribution are among these factors. What Is Probability Distribution? X denotes the random variable X. P(X) denotes the probability of X. P(X = x) is the probability that the random variable X is equal to a particular value, denoted by x.
Shekhar S Babar Types of Probability Distributions Discrete Probability Distributions Continuous Probability Distributions The farmer weighs 100 random eggs and describes their frequency distribution using a histogram: We can get a rough idea of the probability of different egg sizes directly from this frequency distribution. For example, she can see that there’s a high probability of an egg being around 1.9 oz., and there’s a low probability of an egg being bigger than 2.1 oz.
Shekhar S Babar A discrete probability distribution is a probability distribution of a categorical or discrete variable. Discrete probability distributions only include the probabilities of values that are possible. In other words, a discrete probability distribution doesn’t include any values with a probability of zero. For example, a probability distribution of dice rolls doesn’t include 2.5 since it’s not a possible outcome of dice rolls. The probability of all possible values in a discrete probability distribution add up to one. Discrete probability distributions
Shekhar S Babar A probability mass function (PMF) is a mathematical function that describes a discrete probability distribution. It gives the probability of every possible value of a variable . A probability mass function can be represented as an equation or as a graph . Number of sweaters owned per person in the United States follows a Poisson distribution. Probability mass functions (PMF) The probability mass function of the distribution: You can have two sweaters or 10 sweaters, but you can’t have 3.8 sweaters. The probability that a person owns zero sweaters is .05, the probability that they own one sweater is .15, and so on. If you add together all the probabilities for every possible number of sweaters a person can own, it will equal exactly 1.
Shekhar S Babar Continuous probability distributions A continuous probability distribution is the probability distribution of a continuous variable. A continuous variable can have any value between its lowest and highest values. Therefore, continuous probability distributions include every number in the variable’s range. The probability that a continuous variable will have any specific value is so infinitesimally small that it’s considered to have a probability of zero. However, the probability that a value will fall within a certain interval of values within its range is greater than zero. A probability density function (PDF) is a mathematical function that describes a continuous probability distribution. It provides the probability density of each value of a variable, which can be greater than one. Probability density functions (PDF) In graph form, a probability density function is a curve. You can determine the probability that a value will fall within a certain interval by calculating the area under the curve within that interval. The area under the whole curve is always exactly one because it’s certain (i.e., a probability of one) that an observation will fall somewhere in the variable’s range.
Shekhar S Babar Example: Probability density function The probability density function of the normal distribution of egg weight is given by the formula: The probability of an egg being exactly 2 oz. is zero . The probability that an egg is within a certain weight interval, such as 1.98 and 2.04 oz., is greater than zero and can be represented in the graph of the probability density function as a shaded region:
Shekhar S Babar No of Sweaters Probability 2 0.6 3 0.2 4 0.2 Sweaters ( x ) Probability ( P ( x )) x * P ( x ) 2 .2 2 * 0.2 = 0.4 3 .5 3 * 0.5 = 1.5 4 .3 4 * 0.3 = 1.2 What is the expected value of sweaters per person? E ( x ) = 0.4 + 1.5 + 1.2 E ( x ) = 3.1 Sweaters Eggs ( x ) Probability ( P ( x )) x – E ( x ) [ x – E ( x )] 2 * P ( x ) 2 .2 2 − 3.1 = −1.1 (−1.1) 2 * 0.2 = 0.242 3 .5 3 − 3.1 = −0.1 (−0.1)2 * 0.5 = 0.005 4 .3 4 − 3.1 = 0.9 (0.9) 2 * 0.3 = 0.243 σ = √(0.242 + 0.005 + 0.243) σ = √(0.49) σ = 0.7 sweaters How to find the expected value and standard deviation? EV - a predicted value of a variable, calculated as the sum of all possible values each multiplied by the probability of its occurrence . Expected value, in general, the value that is most likely the result of the next repeated trial of a statistical experiment.
Shekhar S Babar Distribution Description Example Binomial Describes variables with two possible outcomes. It’s the probability distribution of the number of successes in n trials with p probability of success. The number of times a coin lands on heads when you toss it five times Discrete uniform Describes events that have equal probabilities. The suit of a randomly drawn playing card Poisson Describes count data. It gives the probability of an event happening k number of times within a given interval of time or space. The number of text messages received per day Common discrete probability distributions
Shekhar S Babar Common continuous probability distributions Distribution Description Example Normal distribution Describes data with values that become less probable the farther they are from the mean , with a bell-shaped probability density function. SAT scores Continuous uniform Describes data for which equal-sized intervals have equal probability. The amount of time cars wait at a red light Log-normal Describes right-skewed data. It’s the probability distribution of a random variable whose logarithm is normally distributed. The average body weight of different mammal species Exponential Describes data that has higher probabilities for small values than large values. It’s the probability distribution of time between independent events. Time between earthquakes
Shekhar S Babar Binomial Distribution Criteria The number of the trial or the experiment must be fixed. As you can only figure out the probable chance of occurrence of success in a trail you should have a finite number of trials. Every trial is independent. None of your trials should affect the possibility of the next trial. The probability always stays the same and equal. The probability of success may be equal for more than one trial. If the WHO introduced a new cure for a disease then there is an equal chance of success and failure. It can either cure the diseases or not. If you are purchasing a lottery then either you are going to win money or you are not. In other words, anywhere the outcome could be a success or a failure that can be proved through binomial distribution.
Shekhar S Babar Binomial Distribution – Formula Mean and Variance of a Binomial Distribution Mean(µ) = np Variance(σ2) = npq
Shekhar S Babar Poisson distribution formula is used to find the probability of an event that happens independently, discretely over a fixed time period, when the mean rate of occurrence is constant over time. Poisson Distribution Formula The events are independent. The average number of successes in the given period of time alone can occur. No two events can occur at the same time . mean = variance = λ The standard deviation is always equal to the square root of the mean μ . Applications of Poisson Distribution To count the number of defects of a finished product To count the number of deaths in a country by any disease or natural calamity To count the number of infected plants in the field To count the number of bacteria in the organisms or the radioactive decay in atoms To calculate the waiting time between the events.
Shekhar S Babar Hospital Baby unit has average 4 births per week, week is randomly chosen Case1– Probability of exactly 4 babies born this week Case2 – Probability of fewer 4 birth this week Case3 – Probability of 7 or more births occurred this week
Shekhar S Babar Uniform Distribution A uniform distribution is a distribution that has constant probability due to equally likely occurring events. It is also known as rectangular distribution (continuous uniform distribution). It has two parameters a and b: a = minimum and b = maximum. The distribution is written as U(a, b).
Shekhar S Babar
Shekhar S Babar Normal Distribution In a normal distribution, data is symmetrically distributed with no skew. When plotted on a graph, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center . The mean, median and mode are exactly the same. The distribution is symmetric about the mean—half the values fall below the mean and half above the mean. The distribution can be described by two values: the mean and the standard deviation.
Shekhar S Babar
Shekhar S Babar
Shekhar S Babar The standard normal distribution, also called the z-distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1 . Every normal distribution is a version of the standard normal distribution that’s been stretched or squeezed and moved horizontally right or left. Standard Normal Distribution