Data Analytics- content DESCRIPTIVE ANALYTICS Descriptive statistics: Introduction to Data Analytics, Descriptive Statistics, Probability Distributions. Inferential Statistics: Point and Interval Estimations, Hypothesis Testing PREDICTIVE ANALYTICS ANOVA, Regression, Logistic Regression, Neural Network, Forecasting. PRESCRIPTIVE ANALYTICS Optimization: Linear Programming, Decision Tree Analysis, Multi-Criteria Decision Making
Data Analytics- content DESCRIPTIVE ANALYTICS Descriptive statistics: Introduction to Data Analytics, Descriptive Statistics, Probability Distributions Inferential Statistics: Point and Interval Estimations, Hypothesis Testing
Data Analytics- content PRESCRIPTIVE ANALYTICS (Optimization) Linear Programming, Decision Tree Analysis, Multi-Criteria Decision Making
TEXTBOOK Competing on Analytics: The New Science of Winning by Jeanne G. Harris and Thomas H. Davenport Probability and Statistics in Engineering by William W. Hines and Douglas C Montgomery Applied Multivariate Statistical Analysis by Richard A. Johnson and Dean W. Wichern Forecasting: Methods and Applications by Steven C. Wheelwright and Spyros G Makridakis Operation Research: An Introduction by Hamdy A Taha. BIOSTATISTICS: A Foundation For Analysis In The Health Sciences by W. Daniel An introduction to Statistical Methods and Data Analysis by R. Ott
Reference Book Analytics at Work: Smarter Decisions, Better Results by Thomas H. Davenport. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities by Thomas H. Davenport Predictive analytics: the power to predict who will click buy, lie or die by Eric Siegel Data Mining Concept and Techniques by Jiawei Han and Micheline Kamber .
Course structure Introduction to Analytics Descriptive statistics & Exploratory data analysis: Visualization, Summarizing data Looking at Probability Distributions and relationships Inferential Statistics: The idea of populations and samples How do we use it to make inferences Single sample case Two or more samples, ANOVA
Introduction to Data Analytics What is Science? Science is just the systematic refinement of the process of finding rules from the past data to predict the future to decide on actions so that people get what they want. Finding these rules needs many tools and maths of statistics is one of those tools. Data and Statistics Statistics is the science of learning from data. Data is a collection of facts, such as numbers, words, measurements, observations or even just of things which essentially represent some information . It helps to think of data as ‘values’ of quantitative and qualitative variables. Data or measurements that have not been organised, summarised, or otherwise manipulated are called raw data .
Introduction to Data Analytics What is data Analytics? Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. Data analytics is used in many industries to allow companies and organization to make better business decisions and in the sciences to verify or disprove existing models or theories. In business, Data Analytics is the practice of using data to drive business strategy and performance. It includes a range of approaches and solutions, from looking backward to evaluate what happened in the past to looking forward to do scenario planning and predictive modelling.
Introduction to Data Analytics Types of Data Analysis? The analysis of data is generally divided into exploratory data analysis (EDA) , where new features in the data are discovered, and confirmatory data analysis (CDA) , where existing hypotheses are proven true or false. Qualitative data analysis (QDA) is used in the social sciences to draw conclusions from non-numerical data like words, photographs or video. In information technology, the term has a special meaning in the context of IT audits, when the controls for an organization's information systems, operations and processes are examined, data analysis is used to determine whether the systems in place effectively protect data, operate efficiently and succeed in accomplishing an organization's overall goals.
Introduction to Data Analytics Analytics By analytics, we mean the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions. The analytics may be input for human decisions or may drive fully automated decisions. Analytics are a subset of what has come to be called business intelligence (BI) : a set of technologies and processes that use data to understand and analyze business performance. As the figure:1 suggests, business intelligence includes both data access and reporting, and analytics. Each of these approaches addresses a range of questions about an organization's business activities. The questions that analytics can answer represent the higher-value and more proactive end of this spectrum.
Analytics
Introduction to Data Analytics Analytics In principle, analytics could be performed using paper, pencil, and perhaps a slide rule, but any sane person using analytics today would employ information technology. The range of analytical software goes from relatively simple statistical and optimization tools in spreadsheets (Excel being the primary example, of course), to statistical software packages (e.g. Minitab), to complex business intelligence suites (SAS, Cognos , Business Objects), predictive industry applications (Fair Isaac), and the reporting and analytical modules of major enterprise systems (SAP and Oracle). And good analytical capabilities also require good information management capabilities to integrate, extract, transform, and access business transaction data.
Introduction to Data Analytics Data analytics and Data mining Data analytics is distinguished from data mining by the scope, purpose and focus of the analysis. Data miners sort through huge data sets using sophisticated software to identify undiscovered patterns and establish hidden relationships. Data analytics focuses on inference, the process of deriving a conclusion based solely on what is already known by the researcher. Big data is any collection of data sets so large and complex that it is difficult to analyze using traditional data processing applications. Big data analytics is conceptually the same thing. Except the data is less well behaved and may require special attention and infrastructure to manage and turn into something useful.
Importance of Data Analytics Information for information’s sake isn’t good if it doesn’t provide actionable, forward-looking insights. Jeanne Harris, senior executive at Accenture Institute for High Performance, has stressed the significance of analytics professionals by saying, “…data is useless without the skill to analyze it.” The speed of today’s business has increased manifold in last two decades, and the trend is increasing. Prescriptive analytics helps finance leaders and their companies react quickly to shape desired outcomes proactively. Analytics is important like making right decisions for your business. The practice of analytics is all about supporting decision making by providing the relevant facts that will allow you to make a better decision. This is when data that is hardly appreciated, sitting on technology that is hardly seen, comes together with mathematics that humans can hardly compute and allows you to make decisions on a scale that can hardly be believed.
Importance of Data Analytics Don't get me wrong - intuition and experience are invaluable. However, they should be taken together with facts and analytics is all about that. The importance of data analytics has its advantages on being proficient in the Customer relationship management ( CRM) applications. Data analytics is the simple form of analysis that often uses in daily lives. The same seems true for organisations that use data analytics solutions when encountered with the trouble of any kind. Business analytics has the capability to enable business owners, strategic marketing professionals and even business managers to analyse and simply understand business opportunities. Another thing, the analysis is used for positioning of products as well into the market. In fact, the importance of data analytics cannot be compared to some other business tools. Data analytics belong to the business intelligence family and the only one that assists a business to convert heaps of gathered raw data into useful business information that can drive business decisions.
Soaring Demand for Analytics Professionals There are more job opportunities in Big Data management and Analytics than previous years and many IT professionals are prepared to invest time and money for the training. The job trend graph for Big Data Analytics proves that there is a growing trend for it and as a result there is a steady increase in the number of job opportunities. The current demand for qualified data professionals is just the beginning. Srikanth Velamakanni , the Bangalore-based co-founder and CEO of CA-headquartered (California-headquartered) Fractal Analytics states: “In the next few years, the size of the analytics market will evolve to at least one-thirds of the global IT market from the current one-tenths”.
Soaring Demand for Analytics Professionals In a study by Quin Street Inc., it was found that the trend of implementing Big Data Analytics is zooming and is considered to be a high priority among U.S. businesses. A majority of the organizations are in the process of implementing it or actively planning to add this feature within the next two years. The top five industries hiring big data-related expertise include Professional, Scientific and Technical Services (30%), Information Technologies (19%), Manufacturing (18%), Finance and Insurance (10%) and Retail Trade (8%). Top three U.S. big data employment markets are: San Jose – Sunnyvale – Santa Clara, CA, San Francisco – Oakland – Fremont, CA New York-Northern New Jersey-Long Island
THE FUTURE OF DATA ANALYTICS : Making the Impossible Possible? A global KPMG survey showed that organizations cannot yet reap all the benefits of data analytics due to data quality issues and a lack of capable resources. In 30 years’, time, developments in data analytics itself could solve this issue, making many current professions in the sector obsolete. The impossible will become possible, and this may well lead to an autonomous decision-making process. Data analytics is expected to radically change the way we live and do business in the future. Already today we use the analytics in our technology devices, for many decisions in our lives. Not only how to drive from A to B and avoid traffic-jams, but also to identify waste in business processes with the help of Lean six sigma optimization projects. Although organizations are taking steps to turn data into insights , but global surveys show that organizations are still struggling with data quality and the problem to find the right resources to turn these insights into true value and become more data-driven. Expectations are that data analytics will make the impossible possible, but we are still in the early stages of the data era. Basically, every company is currently investing in data analytics capabilities to keep up with known or unknown developments and competition.
THE FUTURE OF DATA ANALYTICS : Making the Impossible Possible? The known data analytics development cycle is described in stages: from descriptive (what happened) to diagnostic (why did it happen), to discovery (what can we learn from it), to predictive (what is likely to happen), and, finally, to prescriptive analytics (what action is the best to take). In general, organizations currently find themselves in the diagnostic and discovery stages. Another way of looking at this is that data analytics initially “supported” the decision-making process, but is now enabling “better” decisions than we can make on our own. What comes to mind here are cases where analytics is applied to combine multiple data sources, resulting in new and better insights, for example to combine sales, location and weather data to understand sales increase for certain stores and improve the replenishment process. If it turns out in the future that a decision-making process based on data analytics will produce better results, the step to “automated” decision-making will be small (e.g., artificial intelligence). Examples are the autopilot update in the Tesla model S cars, or the Google car. Probably, we will also witness a change in job types, or even the complete disappearance of many current jobs. A study from Oxford University showed that about 47% of total US employment is at risk. And as may be expected, the work of an accountant has a 94% chance of being computerized in 20 years.
THE FUTURE OF DATA ANALYTICS : Making the Impossible Possible? All devices will be connected and exchange data within the “Internet of Things” and deliver enormous sets of data. Sensor data like location, weather, health, error messages, machine data, etc. will enable diagnostic and predictive analytics capabilities. We will be able to predict when machines will break down and plan maintenance repairs before it happens. Not only will this be cheaper, as you do not have to exchange supplies when it is not yet needed, but you can also increase uptime. Furthermore, it will become easier and user-friendlier to link all sorts of data from various sources with each other and get insights on a real-time basis. You “google” your analytical question and you get your answers. Siri or Cortana can already do this at a basic level, acting as your personal assistant. Just imagine that this user interface will solve all your business questions; perhaps we will finally solve the business-IT alignment challenges, and may not even need Excel anymore in 30 years’ time.
Descriptive statistics
Who is a data scientist? A data scientist is someone who is better at statistics than any software engineering and better at software engineering than any statistician. What are Data? Data is essentially a number (or text, symbol) which represents some information. It is helpful to think of data as ‘values’ of quantitative and qualitative variables. What is a population? The average person thinks of a population as a collection of entities, usually people. A population of entities is the largest collection of entities in which one has an interest at a particular time. Population data are the largest collection of values of the random variable in which one has an interest at a particular time. What is a Simple random Sample? A sample is a part of a population. If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected, the sample is called a simple random sample . What is a statistical inference? Statistical inference is the procedure by which one reaches a conclusion about a population on the basis of information contained in a sample that has been drawn from the population.
Data Type What are the variable types: Numerical or Quantitative: Basics of measurement. Some forms of arithmetic operations make sense on these variables. Continuous: Any value within the interval is possible. Nothing can prohibit Discrete: Can only take on a certain number of values Categorical or Qualitative: Always discrete. Essentially represents some characteristics. Nominal: A nominal variable is one that has two or more categories, but there is no intrinsic ordering to the categories. Ex. Gender, blood group, race, marital status, etc. Ordinal: An ordinal variable is similar to a nominal variable. The difference between the two is that there is a clear ordering of the variables. Even though we can order these from lowest to highest, the spacing between the values may not be the same across the levels of the variables. Ex. Terror alert colours: G-O-Y-R Interval: An interval variable is similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. Ex. Temp ( C)
Data type Why does it matter whether a variable is categorical, ordinal or interval? Statistical computations and analyses assume that the variables have a specific level of measurement. For example, it would not make sense to compute an average hair colour. An average of a categorical variable does not make much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if you tried to compute the average educational experience, you would also obtain a nonsensical result. Because the spacing between the four levels of educational experience is very uneven, the meaning of this average would be very questionable. In short, an average requires a variable to be interval. Sometimes you have variables that are "in-between" ordinal and interval, for example, a five-point Likert scale with values "strongly agree", "agree", "neutral", "disagree" and "strongly disagree". If we cannot be sure that the intervals between each of these five values are the same, then we would not be able to say that this is an interval variable, but we would say that it is an ordinal variable. However, in order to be able to use statistics that assume the variable is interval, we will assume that the intervals are equal.
Descriptive statistics When analysing data, such as the marks achieved by 100 students for a piece of coursework, it is possible to use both descriptive and inferential statistics in your analysis of their marks. Typically, in most research conducted on groups of people, you will use both descriptive and inferential statistics to analyse your results and draw conclusions. It is an idea to quantitatively describe the data. It is a way to say something meaningful for the data set. Making statements based on the data, about the data, derived from the data. Don’t allow us to make conclusions beyond the data You can’t make generalizations about the source of statistics .
Descriptive statistics Descriptive statistics is a means to describe, show, and summarize the basic features of a dataset found in a given study, presented in a summary that describes the data sample and its measurements. It helps analysts to understand the data better. F our major types of descriptive statistics: Measures of Frequency: * Count, Percent, Frequency. ... Measures of Central Tendency. * Mean, Median, and Mode. ... Measures of Dispersion or Variation. * Range, Variance, Standard Deviation. Measures of Position. * Percentile Ranks, Quartile Ranks.
Inferential statistics Descriptive statistics are applied to populations, and the properties of populations, like the mean or standard deviation, are called parameters as they represent the whole population (i.e., everybody you are interested in). Often, however, you do not have access to the whole population you are interested in investigating, but only a limited number of data instead called a sample. Properties of samples, such as the mean or standard deviation, are NOT called parameters, but statistics . Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. It is, therefore, important that the sample accurately represents the population. The process of achieving this is called sampling. Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population. The methods of inferential statistics are (1) the estimation of the parameter(s) and (2) the testing of statistical hypotheses .
Descriptive statistics When we use descriptive statistics, it is useful to summarize our group of data using a combination of tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical commentary (i.e., a discussion of the results). Descriptive statistics: It is an idea to quantitatively describing the data. Visualization technique Graphical representation Tabular representation Summary statistics
Descriptive statistics A useful visual device for communicating the information contained in a data set. It shows the variability of data and the symmetry of the distribution.
Graphical Display-Histogram & Frequency Polygon We may display a frequency distribution (or a relative frequency distribution) graphically in the form of a histogram, which is a special type of bar graph. Ordered array of Ages of 189 Subjects Who Participated Frequency Distribution of Ages in a Study on Smoking Cessation.
Histogram & Frequency Polygon A frequency distribution can be portrayed graphically --- by means of a frequency polygon, which is a special kind of line graph. To draw a frequency polygon we first place a dot above the midpoint of each class interval represented on the horizontal axis of a graph. The height of a given dot above the horizontal axis corresponds to the frequency of the relevant class interval. Connecting the dots by straight lines produces the frequency polygon .
Stem-and-Leaf Display A graphical device, useful for representing quantitative data sets and bears a strong resemblance to a histogram and serves the same purpose. Like a histogram, a stem-and-leaf display provides information regarding the range of the data set, shows the location of the highest concentration of measurements, and reveals the presence or absence of symmetry. An advantage of the stem-and-leaf display over the histogram is the fact that it preserves the information contained in the individual measurements. Such information is lost when measurements are assigned to the class intervals of a histogram. To construct a stem-and-leaf display we partition each measurement into two parts: stem and leaf . The stem consists of one or more of the initial digits of the measurement and the leaf is composed of one or more of the remaining digits.
Stem-and-Leaf Display
Descriptive statistics Graphical representation of multiple variables: Scatter plots: two qualitative variables – does a good job capturing the relationship Boxplots: - side-by-side boxplots – one categorical with one quantitative variable
Contingency Table A contingency table (Crosstab/two-way frequency table), is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts, i.e. summarize the relationship between several categorical variables. The table displays sample values in relation to two different variables that may be dependent or contingent on one another. The table helps in determining conditional probabilities quite easily. Example: a study of speeding violations and drivers who use cell phones produced the following fictional data: Speeding violation in the last year No speeding violation in the last year Total Cell phone user 25 280 305 Not a cell phone user 45 405 450 Total 70 685 755
Summary statistics - Numbers that describe data In many instances, w e need to summarize the data by means of a single number called a descriptive measure . It may be computed from the data of a sample or the data of a population. A descriptive measure computed from the data of a sample is called a statistic while it computed from the data of a population is called a parameter . Several types of descriptive measures can be computed from a set of data. Major Summary statistics are: Measures of central tendency, Measures of dispersion and Measures of the shape of the distribution: Skewness & Kurtosis
Measures of central tendency A measure of central tendency tells what is fairly central value or at the centre. It conveys information regarding the average value of a set of values. The three most commonly used measures of central tendency are the mean , the median , and the mode . Population Mean, and Sample mean, Median (Central Data), and Mode – the most frequently occurred value
When do we want to use mean, median and mode? Choosing between mean and median - Good outliers and bad outliers An outlier is an observation whose value, x , either exceeds the value of the third quartile by a magnitude greater than 1.5(IQR) or is less than the value of the first quartile by a magnitude greater than 1.5(IQR). That is, an observation of x > Q 3 + 1.5 (IQR) or an observation of x < Q 1 – 1.5(IQR) is called an outlier. Bad outliers Errors. Do not provide a realistic picture of the story. Good outliers the story is in the outliers. Mode More useful with nominal variables. Multimodal distributions.
Measures of dispersion How does the data deviate from the central value? Do the values deviate a lot from the centre? C onveys information regarding the amount of variability present in a set of data. If all the values are the same, there is no dispersion. Range – (Max – Min) Inter Quartile Range (IQR) – (Q3 – Q1) Variance Standard deviation and s Coefficient of variance,
Skewness Data distributions may be classified on the basis of whether they are symmetric or asymmetric. If a distribution is symmetric, the left half of its graph (histogram or frequency polygon) will be a mirror image of its right half. When the left half and right half of the graph of a distribution are not mirrored images of each other, the distribution is asymmetric and said to be skewed . If the graph of distribution extends further to the right than to the left, that is, if it has a long tail to the right, we say that the distribution is skewed to the right or is positively skewed . If the graph of distribution extends further to the left than to the right, that is, if it has a long tail to the left, we say that the distribution is skewed to the left or is negatively skewed . A distribution will be skewed to the right, or positively skewed if its mean is greater than its mode. A distribution will be skewed to the left, or negatively skewed, if its mean is less than its mode.
Skewness A value of skewness > 0 indicates positive skewness and a value of skewness < 0 indicates negative skewness.
Skewness
kurtosis The Kurtosis parameter is a measure of the combined weight of the tails relative to the rest of the distribution. It is all about the tails of the distribution. A distribution, in comparison to a normal distribution, may possess an excessive proportion of observations in its tails, so that its graph exhibits a flattened appearance. Such a distribution is said to be platykurtic . Conversely, a distribution, in comparison to a normal distribution, may possess a smaller proportion of observations in its tails, so that its graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic . A normal, or bell-shaped distribution, is said to be mesokurtic . Kurtosis is the “fourth standardized central moment for the probability model”.
Probability Distribution A probability distribution is a way of describing the data. It is fairly comprehensive not summarizing data. Random variable: a variable whose value is subject to variation due to randomness. The mathematical function describing this randomness (the probabilities for the set of possible values a random variable can take) is called a probability distribution . Until fairly recently, statisticians and mathematicians thought of probability only as an objective phenomenon derived from objective processes. The concept of objective probability may be categorized into classical or a priori probability, and (2) the relative frequency or a posteriori concept of probability.
Different concept of Probability Classical concept of Probability: The classical treatment of probability dates back to the 17th century and the work of two mathematicians, Pascal and Fermat. Probabilities are calculated by the processes of abstract reasoning. If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a trait E, the probability of the occurrence of E is equal to m/N. Relative frequency concept of Probability: If some process is repeated a large number of times, n , and if some resulting event with the characteristic E occurs m times, the relative frequency of occurrence of E , m/n , will be approximately equal to the probability of E. ⇒ only a estimate Subjective concept of Probability: This view holds that probability measures the confidence that a particular individual has in the truth of a particular proposition.
Bayesian methods of Probability Bayesian methods are an example of subjective probability since it takes into consideration the degree of belief that one has in the chance that an event will occur. While probabilities based on classical or relative frequency concepts are designed to allow for decisions to be made solely on the basis of collected data. Bayesian methods make use of what is known as prior probabilities and posterior probabilities. The prior probability of an event is a probability based on prior knowledge, prior experience, or results derived from prior data collection activity. The posterior probability of an event is a probability obtained by using new information to update or revise a prior probability. As more data are gathered, more are likely to be known about the “true” probability of the event under consideration.
Elementary properties of Probability Given some process (or experiment) with n mutually exclusive outcomes (called events), E 1 ; E 2 ; . . . ; E n , the probability of any event E i is assigned a nonnegative number. That is . The sum of the probabilities of the mutually exclusive outcomes is equal to 1. , Consider any two mutually exclusive events, E i and E j . The probability of the occurrence of either E i or E j is equal to the sum of their individual probabilities. .
Conditional Probability and Joint probability Conditional Probability: On occasion, the set of “all possible outcomes” may constitute a subset of the total group. In other words, the size of the group of interest may be reduced by conditions not applicable to the total group. When probabilities are calculated with a subset of the total group as the denominator, the result is a conditional probability. The conditional probability of A given B is equal to the probability of A∩ B divided by the probability of B , provided the probability of B is not zero. Joint Probability: Sometimes we want to find the probability that a subject picked at random from a group of subjects possesses two characteristics at the same time. Such a probability is referred to as a joint probability.
Marginal Probability: Given some variable that can be broken down into m categories designated by A 1 ; A 2 ; . . . ; A i ; . . . ; A m and another jointly occurring variable that is broken down into n categories designated by B 1 ; B 2 ; . . . ; B j ; . . . ; B n , the marginal probability of A i , P(A i ), is equal to the sum of the joint probabilities of A i with all the categories of B. That is, for all values of j Multiplication Rule of probability: A probability may be computed from other probabilities. For example, a joint probability may be computed as the product of an appropriate marginal probability and an appropriate conditional probability. This relationship is known as the multiplication rule of probability.
Addition Rule of probability: Given two events A and B, the probability that event A, or event B, or both occur is equal to the probability that event A occurs, plus the probability that event B occurs, minus the probability that the events occur simultaneously. Independent Events: the probability of event A is the same regardless of whether or not B occurs. In this situation, In such cases we say that A and B are independent events. The multiplication rule for two independent events, then, may be written as and Complementary event: The probability of an event A is equal to 1 minus the probability of its complement, which is written and
BAYES’ THEOREM A false positive results when a test indicates a positive status when the true status is negative. A false negative results when a test indicates a negative status when the true status is positive. The following questions must be answered in order to evaluate the usefulness of test results and symptom status in determining whether or not a subject has some disease: Given that a subject has the disease, what is the probability of a positive test result (or the presence of a symptom)? Given that a subject does not have the disease, what is the probability of a negative test result (or the absence of a symptom)? Given a positive screening test (or the presence of a symptom), what is the probability that the subject has the disease? Given a negative screening test result (or the absence of a symptom), what is the probability that the subject does not have the disease?
The sensitivity of a test (or symptom) is the probability of a positive test result (or presence of the symptom) given the presence of the disease. The specificity of a test (or symptom) is the probability of a negative test result (or absence of the symptom) given the absence of the disease. The predictive value positive of a screening test (or symptom) is the probability that a subject has the disease given that the subject has a positive screening test result. The predictive value negative of a screening test (or symptom) is the probability that a subject does not have the disease, given that the subject has a negative screening test result (or does not have the symptom).
BAYES’ THEOREM To obtain the predictive value estimates, we make use of Bayes’ theorem. Event T is the result of a subject’s being classified as positive with respect to a screening test. A subject classified as positive may have the disease or may not have the disease. Therefore, the occurrence of T is the result of a subject having the disease and being positive or not having the disease and being positive . These two events are mutually exclusive, and consequently, by the addition rule, The numerator of Bayes’ theorem is equal to the sensitivity times the rate (prevalence) of the disease and the denominator is equal to the sensitivity times the rate of the disease plus the term 1 minus the specificity times the term 1 minus the rate of the disease. Thus, we see that the predictive value positive can be calculated from knowledge of sensitivity, specificity, and the rate of the disease. This give the answers of Question 3 ( Given a positive screening test, what is the probability that the subject has the disease? )
Probability Distribution To answer Question 4 ( Given a negative screening test result, what is the probability that the subject does not have the disease? ) we follow a familiar line of reasoning to arrive at the following statement of Bayes’s theorem: The above equation allows us to compute an estimate of the probability that a subject who is negative on the test (or has no symptom) does not have the disease, which is the predictive value negative of a screening test or symptom.
Example A medical research team wished to evaluate a proposed screening test for Alzheimer’s disease. The test was given to a random sample of 450 patients with Alzheimer’s disease and an independent random sample of 500 patients without symptoms of the disease. The two samples were drawn from populations of subjects who were 65 years of age or older. The results are as follows: U sing the results of the study, compute the predictive value positive of the test. Assume that 11.3 percent of the population aged 65 and over have Alzheimer’s disease.
Probability Distribution Using these data we estimate the sensitivity of the test to be . The specificity of the test is estimated to be . We now use the results of the study to compute the predictive value positive of the test. That is, we wish to estimate the probability that a subject who is positive on the test has Alzheimer’s disease. Similarly, let us now consider the predictive value negative of the test. We have already calculated all entries necessary except for Using the values previously obtained and our new value, we find
Discrete Probability Distribution The relationship between the values of a random variable and the probabilities of their occurrence may be summarized by means of a device called probability distribution . A probability distribution may be expressed in the form of a table, graph, or formula. Possible outcomes make them discrete or continuous. The probability distribution of a discrete random variable is a table, graph, formula, or other device used to specify all possible values of a discrete random variable along with their respective probabilities. The cumulative probability for x i is written as F(x i ) = P(X≤ x i ) and gives the probability that X is less than or equal to a specified value, x i . The graph of a cumulative probability distribution is called an ogive .
Binomial Probability Distribution The binomial distribution is derived from a process known as a Bernoulli trial and named in honour of the Swiss mathematician James Bernoulli. When a random process or experiment, called a trial, can result in only one of two mutually exclusive outcomes, such as dead or alive, sick or well, full-term or premature, the trial is called a Bernoulli trial. The probability of the outcomes may not be same. Bernoulli Process: A sequence of Bernoulli trials forms a Bernoulli process under the following conditions. 1. Each trial results in one of two possible, mutually exclusive, outcomes. One of the possible outcomes is denoted (arbitrarily) as a success, and the other is denoted as a failure. 2. The probability of success, denoted by p, remains constant from trial to trial. The probability of failure, 1 - p, is denoted by q. 3. The trials are independent; that is, the outcome of any particular trial is not affected by the outcome of any other trial.
Binomial Probability Distribution The probability of obtaining exactly x successes in n trials is Parameters of Binomial Distribution are n and p Applied in sampling from an infinite population or from a finite population with replacement. n is small relative to N. Normally N ≥ 10n.
Poisson Probability Distribution This discrete distribution is named for the French mathematician Simeon Denis Poisson who is generally credited for publishing its derivation in 1837. If x is the number of occurrences of some random event in an interval of time or space (or some volume of matter), the probability that x will occur is given by is is called the parameter of the distribution and is the average number of occurrences of the random event in the interval (or volume).
Poisson Probability Distribution What is a Poisson Process? A random process having the following properties is known as the Poisson process. 1. The occurrences of the events are independent. The occurrence of an event in an interval of space or time has no effect on the probability of a second occurrence of the event in the same, or any other, interval. 2. Theoretically, an infinite number of occurrences of the event must be possible in the interval. 3. The probability of the single occurrence of the event in a given interval is proportional to the length of the interval. 4. In any infinitesimally small portion of the interval, the probability of more than one occurrence of the event is negligible. An interesting feature of the Poisson distribution is the fact that the mean and variance are equal. It is employed as a model when counts are made of events or entities that are distributed at random in space or time.
Geometric Probability Distributions Geometric distribution is a counterpart of the Bernoulli process. Consider a sequence of trials, where each trial has only two possible outcomes (designated failure and success). The probability of success is assumed to be the same for each trial. In such a sequence of trials, the geometric distribution is useful to model the number of failures before the first success since the experiment can have an indefinite number of trials until success, unlike the binomial distribution which has a set number of trials. The distribution gives the probability that there are zero failures before the first success, one failure before the first success, two failures before the first success, and so on. No. of times you need to toss a coin before getting the first H or next H. Example: A patient is waiting for a suitable matching kidney donor for a transplant. If the probability that a randomly selected donor is a suitable match is p=0.1, what is the expected number of donors who will be tested before a matching donor is found? For the alternative formulation, where X is the number of trials up to and including the first success, the expected value is E( X ) = 1/ p = 1/0.1 = 10.
Uniform Probability Distributions Discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of n values has equal probability 1/ n . Another way of saying "discrete uniform distribution" would be "a known, finite number of outcomes equally likely to happen". It is also known as the "equally likely outcomes" distribution. A simple example of the discrete uniform distribution is throwing a fair dice. The possible values are 1, 2, 3, 4, 5, 6, and each time the die is thrown the probability of a given score is 1/6.
Continuous Probability Distribution Definition: A nonnegative function f(x) is called a probability distribution (sometimes called a probability density function) of the continuous random variable X if the total area bounded by its curve and the x-axis is equal to 1 and if the subarea under the curve bounded by the curve, the x-axis, and perpendiculars erected at any two points a and b give the probability that X is between the points a and b. The total area under the curve is equal to one, The probability of any specific value of the random variable is zero. The relative frequency of occurrence of values between any two points on the x-axis is equal to the total area bounded by the curve, the x-axis, and perpendicular lines erected at the two points on the x-axis.
Normal Distribution The formula for this distribution was first published by Abraham De Moivre . The distribution is frequently called the Gaussian distribution in recognition of contributions of Carl Friedrich Gauss. Two parameters: mean: 𝝻 and SD: 𝞂 Characteristic of Normal Distribution 1. It is symmetrical about its mean, 𝝻 . 2. The mean, the median, and the mode are all equal. 3. The total area under the curve above the x-axis is one square unit. Area between 𝝻 - 𝞂 and 𝝻 + 𝞂 ≈ 68% of the total area. Area between 𝝻 - 2𝞂 and 𝝻 + 2𝞂 ≈ 95% of the total area. Area between 𝝻 - 3𝞂 and 𝝻 + 3𝞂 ≈ 99.7% of the total area
Normal Distribution The normal distribution is completely determined by the parameters 𝝻 and 𝞂. Different values of 𝝻 shift the graph of the distribution along the x-axis and values of 𝞂 determine the degree of flatness or peakedness of the graph of the distribution. Because of these characteristics, 𝝻 is often referred to as a location parameter and 𝞂 is often referred to as a shape parameter. Standard Normal distribution: Most important member of Normal distribution family and sometimes called Unit Normal distribution . Mean = 0 and SD = 1
Continuous Uniform Probability Distribution Continuous uniform distribution or rectangular distribution is a family of symmetric probability distributions. The distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, a and b , which are the minimum and maximum values. The probability density function of the continuous uniform distribution is: The probability density function is portrayed as a rectangle where b−a is the base and is the height. As the distance between a and b increases, the density at any particular value within the distribution boundaries decreases. Since the probability density function integrates to 1, the height of the probability density function decreases as the base length increases