Data Analytics, Engineering, & Science Buzz words that every decision maker either wants to or forced to look into Data-driven decision making is hard Needs right data, fitting tools, skilled analysts, & a supportive environment Data analysts Domain experts Tool experts 2 Useful Insights
Objectives Given a dataset, to train you to Ask Right Questions Identify Right Tool(s) Derive Right Answers/Insights We take a data-driven approach First try to derive a set of questions based on data available for the analysis Explore potential techniques to support answering those questions while using available data Deriving right answers by interpreting processed data & visualizations 3
Descriptive, Predictive, & Prescriptive Analytics Descriptive Analytics Use data aggregation & data mining techniques to provide insight into the past & answer: “What has happened?” Predictive Analytics Use statistical models & forecasts techniques to understand the future & answer: “What could happen?” Prescriptive Analytics Use optimization & simulation algorithms to advice on possible outcomes & answer: “What should we do?” Source: https://halobi.com/2014/10/descriptive-predictive-and-prescriptive-analytics-explained/ 5
Example Descriptive Analytics Wal-Mart’s found that on Friday afternoons, young American males who buy diapers also tend to buy beer Potential sales of each item can increase, if they are kept close to each other Predictive analytics Demand for diapers could increase in mid to late summer as more babies are expected to bone in the USA. Make sure expected mothers are informed of their diaper choices through advertising, & production & supply are ready to meet the extra demand Increased sales Prescriptive analytics When to start advertising & when to give discounts? Help us understand the most effective dates & percentage of discounts that not only increase sales but also profit 6
Tools can help reduce difficulty 7
Source: IBM 8
Review of Basic Statistics & Probability
Populations & Samples Population All items of interest for a particular decision or investigation E.g., all Gmail users, all subscribers to Netflix Sample A subset of the population E.g., all Google Apps for Education users, list of customers who rented a comedy from Netflix in the past year Purpose of sampling is to obtain sufficient information to draw a valid inference about a population 10
Sample Space & Events Sample Space All possible outcomes of an experiment E.g., flipping a coin {H, T} E.g., rolling a dice {1, 2, 3, 4, 5, 6} Event Any subset of the sample space E.g., {H}, {T}, {H, T}, {1}, or {2, 4, 6} 11
Random Variable Variable whose value is subject to variations due to chance Discrete random variables Toss a coin, roll a dice Continuous random variables Stock value, voltage of a sensor, 12
Measures of Location Mean Population mean Sample mean Median Middle value of data when sorted from least to greatest Mode Observation that occurs most often Midrange Average of greatest & least values = (max – min)/2 13
Probability Distribution/Mass Function 14
Measures of Dispersion Dispersion Refers to the degree of variation in data Range Difference between max & min value Interquartile Range (IQR) Difference between 3 rd and 1 st quartiles Variance Average of squared deviations form mean Standard Deviation (STD) Square root of the variance 15
Measures of Dispersion (Cont.) z -score Standard score is the number of STD an observation is above/below the mean For many data sets encountered in practice: ~68% of observations fall within 1 STD of mean ~95% fall within 2 STDs ~99.7% fall within 3 STDs 16
Measures of Dispersion (Cont.) Coefficient of Variation A relative measure of dispersion Return to risk = 1/CV 17
Exercise Mean & STD of Closing Stock Prices: Intel (INTC): Mean = $18.81, STD = $0.50 General Electric (GE): Mean = $16.19, STD = $0.35 Which stock has higher risk of investment? 18
Measures of Dispersion (Cont.) Percentiles Value below which a given percentage of observations in a group of observations fall Source: www.mathsisfun.com/data/percentiles.html 19
Measures of Shape Skewness Describes lack of symmetry Coefficient of Skewness CS < 0 for left-skewed data CS > 0 for right-skewed data |CS| > 1 suggests high degree of skewness 0.5 ≤ |CS| ≤ 1 suggests moderate skewness |CS| < 0.5 suggests relative symmetry 20
Measures of Shape (Cont.) Kurtosis Refers to peakedness or flatness Coefficient of Kurtosis CK < 3 indicates data is somewhat flat with a wide degree of dispersion CK > 3 indicates data is somewhat peaked with less dispersion 21
Measures of Association Covariance Measure of linear association between 2 variables, X & Y Population Sample 22
Measures of Association Correlation Measure of linear association between 2 variables, X & Y Correlation Coefficient Doesn’t depend upon units of measurement (unlike covariance) Population Sample 23
Measures of Association 24
Outliers Mean & range are sensitive to outliers No standard definition of what constitutes an outlier Possible methods to identify outliers are: z- scores greater than +3 or less than -3 extreme outliers are more than 3*IQR to the left of Q 1 or right of Q 3 mild outliers are between 1.5*IQR and 3*IQR to the left of Q 1 or right of Q 3 25