Descriptive Analysis Descriptive analysis is a sort of data research that aids in describing, demonstrating, or helpfully summarizing data points so those patterns may develop that satisfy all of the conditions of the data. It is the technique of identifying patterns and links by utilizing recent and historical data.
Descriptive Analytics Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?” (or What is happening?), characterized by traditional business intelligence (BI) and visualizations such as pie charts, bar charts, line graphs, tables, or generated narratives.
Data visualization Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics , and even animations.
Tools for Data Visualization Pie charts Bar charts Histograms Gantt charts Heat maps Box-and-whisker plots Waterfall charts Area charts Scatter plots Infographics Maps
DATA VISUALIZATION TOOLS FOR BUSINESS Microsoft Excel Google Charts Tableau can integrate with hundreds of sources to import data and output dozens of visualization types—from charts to maps and more. Owned by Sales force , Tableau boasts millions of users and community members, and it’s widely used at the enterprise level . Datawrapper is a tool that, like Google Charts, is used to generate charts, maps, and other graphics for use online. Infogram is another popular option that can be used to generate charts, reports, and maps.
Data Queries: A query is a specific request for information from a database. In robust database systems in particular, queries make it easier to perceive trends at a high level or make edits to data in large quantities. Sorting and filtering data is an essential task for managing and manipulating large sets of data. Sorting allows you to organize your data in a specific order, such as alphabetically or numerically, while filtering allows you to extract specific information from your data based on certain criteria. Sorting is done using the ORDER BY clause, which specifies the column or columns that you want to sort your data by. Filtering is done using the WHERE clause, which is used to specify the conditions that must be met for a row to be included in the result set. Both sorting and filtering can be used together to create powerful queries that can extract the specific information that you need from your data. Additionally, more advanced techniques such as using multiple criteria, wildcards, and subqueries can also be used to further refine your results .
Probability can be used for more than calculating the likelihood of one event; it can summarize the likelihood of all possible outcomes. A thing of interest in probability is called a random variable, and the relationship between each possible outcome for a random variable and their probabilities is called a probability distribution .
When you conduct research about a group of people, it’s rarely possible to collect data from every person in that group. Instead, you select a sample . The sample is the group of individuals who will actually participate in the research. To draw valid conclusions from your results, you have to carefully decide how you will select a sample that is representative of the group as a whole. This is called a sampling method . There are two primary types of sampling methods that you can use in your research: Probability sampling involves random selection, allowing you to make strong statistical inferences about the whole group. Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data.
Measures of Location A fundamental task in many statistical analyses is to estimate a location parameter for the distribution; i.e., to find a typical or central value that best describes the data. Definition of Location: The first step is to define what we mean by a typical value. For univariate data, there are three common definitions: Mean - the mean is the sum of the data points divided by the number of data points. That is, Y¯=∑ i =1NYi/N The mean is that value that is most commonly referred to as the average. We will use the term average as a synonym for the mean and the term typical value to refer generically to measures of location. Median - the median is the value of the point which has half the data smaller than that point and half the data larger than that point. That is, if X 1 , X 2 , ... , X N is a random sample sorted from smallest value to largest value, then the median is defined as: Y~=Y(N+1)/2if N is odd Y~=(YN/2+Y(N/2)+1)/2if N is even Mode - the mode is the value of the random sample that occurs with the greatest frequency. It is not necessarily unique. The mode is typically used in a qualitative fashion. For example, there may be a single dominant hump in the data perhaps two or more smaller humps in the data. This is usually evident from a histogram of the data.
Measures of Dispersion Measures of dispersion are non-negative real numbers that help to gauge the spread of data about a central value. These measures help to determine how stretched or squeezed the given data is. Measures of dispersion can be defined as positive real numbers that measure how homogeneous or heterogeneous the given data is. The value of a measure of dispersion will be 0 if the data points in a data set are the same. However, as the variability of the data increases the value of the measures of dispersion also increases.
Range: Given a data set, the range can be defined as the difference between the maximum value and the minimum value. Variance: The average squared deviation from the mean of the given data set is known as the variance. This measure of dispersion checks the spread of the data about the mean. Standard Deviation: The square root of the variance gives the standard deviation. Thus, the standard deviation also measures the variation of the data about the mean. Mean Deviation: The mean deviation gives the average of the data's absolute deviation about the central points. These central points could be the mean, median, or mode. Quartile Deviation: Quartile deviation can be defined as half of the difference between the third quartile and the first quartile in a given data set. Relative Measures of Dispersion If the data of separate data sets have different units and need to be compared then relative measures of dispersion are used. The measures are expressed in the form of ratios and percentages thus, making them unitless . Some of the relative measures of dispersion are given below: Coefficient of Range: It is the ratio of the difference between the highest and lowest value in a data set to the sum of the highest and lowest value. Coefficient of Variation: It is the ratio of the standard deviation to the mean of the data set. It is expressed in the form of a percentage. Coefficient of Mean Deviation: This can be defined as the ratio of the mean deviation to the value of the central point from which it is calculated. Coefficient of Quartile Deviation: It is the ratio of the difference between the third quartile and the first quartile to the sum of the third and first quartiles.
Hypothesis Testing can be defined as a statistical tool that is used to identify if the results of an experiment are meaningful or not. It involves setting up a null hypothesis and an alternative hypothesis. These two hypotheses will always be mutually exclusive. This means that if the null hypothesis is true then the alternative hypothesis is false and vice versa. An example of hypothesis testing is setting up a test to check if a new medicine works on a disease in a more efficient manner. Null Hypothesis The null hypothesis is a concise mathematical statement that is used to indicate that there is no difference between two possibilities. In other words, there is no difference between certain characteristics of data. This hypothesis assumes that the outcomes of an experiment are based on chance alone. It is denoted as H0. Alternative Hypothesis The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the observations of an experiment are due to some real effect. It indicates that there is a statistical significance between two possible outcomes and can be denoted as H1.
Hypothesis Testing Chi Square The Chi square test is a hypothesis testing method that is used to check whether the variables in a population are independent or not. It is used when the test statistic is chi-squared distributed. The Chi-Square is denoted by χ 2 . The chi-square formula is: χ 2 = ∑( O i – E i ) 2 / E i where O i = observed value (actual value) E i = expected value. One Tailed Hypothesis Testing One tailed hypothesis testing is done when the rejection region is only in one direction. It can also be known as directional hypothesis testing because the effects can be tested in one direction only. This type of testing is further classified into the right tailed test and left tailed test. Right Tailed Hypothesis Testing The right tail test is also known as the upper tail test. This test is used to check whether the population parameter is greater than some value. The null and alternative hypotheses for this test are given as follows: H0: The population parameter is ≤ some value H1: The population parameter is > some value. Left Tailed Hypothesis Testing The left tail test is also known as the lower tail test. It is used to check whether the population parameter is less than some value. The hypotheses for this hypothesis testing can be written as follows: H0: The population parameter is ≥ some value H1: The population parameter is < some value. The null hypothesis is rejected if the test statistic has a value lesser than the critical value.
Analysis of variance (ANOVA) The hypothesis is based on available information and the investigator's belief about the population parameters. The specific test considered here is called Analysis of Variance (ANOVA) and is a test of hypothesis that is appropriate to compare means of a continuous variable in two or more independent comparison groups.