DAM_3Unit.pptx it is used todddddddddddddd

PraneshDeshmukh 21 views 42 slides Sep 14, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

dddddddddddddddddddddddddddddddddddddddddddddddd


Slide Content

DA&M Unit3 Descriptive Statistics Counts & Specific Values Measure of Central Tendency Measure of Spread Measure of Distribution Shape Statistical Indices Moments Key functions Measures of Complexity & Model Selection Measures of location of dispersion

What is Descriptive Statistics? Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main features and characteristics without making any generalizations or inferences to a larger population. The primary goal of descriptive statistics is to provide a clear and concise summary of the data, enabling researchers or analysts to gain insights and understand patterns, trends, and distributions within the dataset.

What is Descriptive Statistics? This summary typically includes measures such as central tendency (e.g., mean, median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution (e.g., skewness, kurtosis). Descriptive statistics also involves a graphical representation of data through charts, graphs, and tables, which can further aid in visualizing and interpreting the information. Common graphical techniques include histograms, bar charts, pie charts, scatter plots, and box plots.

By employing descriptive statistics, researchers can effectively summarize and communicate the key characteristics of a dataset, facilitating a better understanding of the data and providing a foundation for further statistical analysis or decision-making processes.

Counts & Specific Values Let { } be a set of data values, and let { , , ....... , } be the set of these values arranged in ascending order. Then a series of very simple measures can be readily computed (some of which are available as SQL database language commands). Note that the data may be integer or real, and in some instances, purely nominal — for example, the data might represent a finite set of classes, such as classified land use types, in a remote sensed image. As in the Statistical Data, the initial analysis of datasets and computation of basic measures is a fundamental first step in the process of statistical analysis.  

Counts & Specific Values In many instances data will have been loaded into tables, either within standard statistical software packages or in SQL-compatible databases. We have included details of the SQL commands and functions (where available as standard in most implementations) for the measures listed.

Count The number of data values in a set Count({ })=n In SQL this is implemented as the aggregate function COUNT()  

Top m, Bottom m The set of the largest (smallest) m values from an ordered set, { , , ....... , } . Top m{ }={Xn‑m+1,…Xn‑1,Xn} Bottom m{ }={X1,X2,… Xm-1,Xm} May be generated via an SQL command TOP to yield the results in terms of the number of values or percentage of the total records — equates to numerical Top and Bottom if the data column selected is numeric and sorted. For the first and last records in a sorted list, the SQL FIRST() and LAST() functions can be used  

Variety The number of distinct, i.e. different, data values in a set. Some software packages refer to the variety as diversity, which should not be confused with information theoretic and other diversity measures. In SQL this is implemented as the command DISTINCT

Majority The most common i.e. most frequent data values in a set. Similar to mode, but often applied to subsets of the data, for example all data lying within a particular range of values or local neighborhood of a sample point. For general datasets the term should only be applied to cases where a given class is 50%+ of the total In SQL,you need to calculate mode

How to Find Most Frequent Value in a Column? To calculate the mode (most frequent value) in SQL, you can use the GROUP BY clause in combination with the COUNT function to count the occurrences of each unique value in a column. Let’s say you have a table named sales with a column named product_category, and you want to find the mode (most frequent product category). Step 1: Count the occurrences of each unique value in the product_category column using the GROUP BY clause and COUNT function.

This query will give you a result set that looks like this:

Step 2: Find the mode by selecting the row(s) with the highest count(s). You can use the ORDER BY clause to sort the result set by category_count in descending order and limit the result to the top row(s) using the LIMIT clause. This query will give you the mode (most frequent product category) along with its count:

Minority The least common i.e. least frequently occurring data values in a set. Often applied to subsets of the data, for example all data lying within a particular range of values or local neighborhood of a sample point

Maximum, Max The maximum value of a set of values. May not be unique Max{xi}=Xn In SQL this is implemented as the aggregate function MAX() R: max(x) The minimum value of a set of values. May not be unique Min{xi}=X1 In SQL this is implemented as the aggregate function MIN(). R: min(x) Minimum, Min

Sum The sum of a set of data values In SQL this is implemented as the aggregate function SUM() R: sum(x)  

Average The arithmetic mean of a set of numeric data values In SQL this is implemented as the aggregate function AVG() R: mean(x)  

Measure of Central Tendency In statistics, the central tendency is the descriptive summary of a data set. Through the single value from the dataset, it reflects the centre of the data distribution. Moreover, it does not provide information regarding individual data from the dataset, where it gives a summary of the dataset. Generally, the central tendency of a dataset can be defined using some of the measures in statistics.

Definition The central tendency is stated as the statistical measure that represents the single value of the entire distribution or a dataset. It aims to provide an accurate description of the entire data in the distribution.

Measures of Central Tendency

Mean The mean represents the average value of the dataset. It can be calculated as the sum of all the values in the dataset divided by the number of values. In general, it is considered as the arithmetic mean Mean Mean Geometric Harmonic Weighted

Group Activity 2 : PPT 14/03/2024 10:15-11:15 https://byjus.com/maths/geometric-mean/ 26,35,32,14 https://byjus.com/jee/harmonic-mean/ 9,8,24,37 https://byjus.com/weighted-mean-formula/ 46,48,39,12 Mean (power) 41,33,29,31,49 Trim mean/Olympic mean/truncated mean 59,60,50,58 Winsorized mean 19,22,7,47 Mean (circular) 2,1,3,62 Mid-range 61,63,10,15

It is observed that if all the values in the dataset are the same, then all geometric, arithmetic and harmonic mean values are the same. If there is variability in the data, then the mean value differs. Calculating the arithmetic mean value is completely easy. The formula to calculate the mean value is given by: The histogram given below shows that the mean value of symmetric continuous data and the skewed continuous data.  

In symmetric data distribution, the mean value is located accurately at the centre. But in the skewed continuous data distribution, the extreme values in the extended tail pull the mean value away from the centre. So it is recommended that the mean can be used for the symmetric distributions.

Median Median is the middle value of the dataset in which the dataset is arranged in the ascending order or in descending order. When the dataset contains an even number of values, then the median value of the dataset can be found by taking the mean of the middle two values.

Consider the given dataset with the odd number of observations arranged in descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2 Here 12 is the middle or median number that has 6 values above it and 6 values below it.

Now, consider another example with an even number of observations that are arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and 17 When you look at the given dataset, the two middle values obtained are 27 and 29. Now, find out the mean value for these two numbers. i.e.,(27+29)/2 =28 Therefore, the median for the given data distribution is 28.

Mode The mode represents the frequently occurring value in the dataset. Sometimes the dataset may contain multiple modes and in some cases, it does not contain any mode at all. Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5 Since the mode represents the most common value Hence, the most frequently repeated value in the given dataset is 5.

Based on the properties of the data, the measures of central tendency are selected. If you have a symmetrical distribution of continuous data, all the three measures of central tendency hold good. But most of the times, the analyst uses the mean because it involves all the values in the distribution or dataset.

Based on the properties of the data, the measures of central tendency are selected. If you have a symmetrical distribution of continuous data, all the three measures of central tendency hold good. But most of the times, the analyst uses the mean because it involves all the values in the distribution or dataset. If you have skewed distribution, the best measure of finding the central tendency is the median.

Based on the properties of the data, the measures of central tendency are selected. If you have a symmetrical distribution of continuous data, all the three measures of central tendency hold good. But most of the times, the analyst uses the mean because it involves all the values in the distribution or dataset. If you have skewed distribution, the best measure of finding the central tendency is the median. If you have the original data, then both the median and mode are the best choice of measuring the central tendency.

Based on the properties of the data, the measures of central tendency are selected. If you have a symmetrical distribution of continuous data, all the three measures of central tendency hold good. But most of the times, the analyst uses the mean because it involves all the values in the distribution or dataset. If you have skewed distribution, the best measure of finding the central tendency is the median. If you have the original data, then both the median and mode are the best choice of measuring the central tendency. If you have categorical data, the mode is the best choice to find the central tendency.

Measures of Central Tendency and Dispersion The central tendency measure is defined as the number used to represent the center or middle of a set of data values. The three commonly used measures of central tendency are the mean, median, and mode. A statistic that tells us how the data values are dispersed or spread out is called the measure of dispersion. A simple measure of dispersion is the range. The range is equivalent to the difference between the highest and least data values. Another measure of dispersion is the standard deviation, representing the expected difference (or deviation) among a data value and the mean.

Measure of Spread The simplest measure of the spread of a distribution is the range, that is, the difference between the largest and smallest values recorded. However, this only provides very limited information regarding the pattern of spread, and several other measures are used in conjunction with, or in preference to, the range. Amongst these the so-called five number summary values are: the minimum and maximum values; the median (the middle value), and the upper quartile and lower quartile values, and the variance (the mean squared deviation of observations from the mean).

Measure of Spread If a sample dataset is arranged in size order, from smallest to largest, then five number summary values are often computed and displayed graphically using so-called box plots (see further, below). Box plots (or box-whisker plots) are a form of exploratory data analysis (EDA) provided in many data analysis and graphing packages. Together with distribution plots and scatter plots they provide one of the three main ways in which statistical data are examined graphically. Because box plots are less familiar to many, and of particular use in examining outliers, they are describe in some detail here (see figure, below).

Box plot The box plots in this diagram are for a set of radioactivity observations made at 1008 sites in Germany on one particular day in 2004, with some minor modifications for the purposes of this plotting exercise. The plot on the left (Column 1) is a summary representation of readings made at 200 of the sites. The plot on the right (Column 2) shows data from a further 808 locations and their readings. Side-by-side box plots provide a quick way of comparing the pattern of spread of two or more distributions.

A box plot consists of a number of distinct elements. The example in the diagram above was generated using MATLab Statistics Toolbox and we provide definitions below that apply to this particular implementation: The lower and upper lines of the "box" in the center of the plot window are the 25th and 75th percentiles of the sample (the lower quartile and the upper quartile). The distance between the top and bottom of the box is the inter-quartile range (IQR) The line in the middle of the box is the sample median. If the median is not centered in the box it is an indication of skewness

Range The Range is simply the difference between the maximum and minimum values of a set. Thus Range{xi}=Xn‑X1. With a sample of size n, the mean range is simple the Range/n. R: range(x)

Lower quartile (25%), LQ, Q1 In an ordered set, 25% of data items are less than or equal to the upper bound of this range. For a continuous distribution the LQ or Q1 is the set of values from 0% to 25% (0.25) obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary. LQ={X1, … X(n+1)/4}. If Q2 is the median of a set of data items, Q1 is the median of the values from the minimum up to and including Q2 . The R operator quantile(x) provides the minimum, maximum, median and lower and upper quartiles

Upper quartile (75%), UQ, Q3 In an ordered set 75% of data items are less than or equal to the upper bound of this range. For a continuous distribution the UQ is the set of values from 75% (0.75) to 100% obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary. UQ={X3(n+1)/4, … Xn}. If Q2 is the median of a set of data items, Q3 is the median of the values from the maximum down to and including Q2

Inter-quartile range, IQR, Q3-Q1 The difference between the lower and upper quartile values, hence covering the middle 50% of the distribution. The inter-quartile range can be obtained by taking the median of the dataset, then finding the median of the upper and lower halves of the set. The IQR is then the difference between these two secondary medians. IQR=UQ-LQ=Q3-Q1. The IQR is a robust measure of spread as it is unaffected by outliers in the upper and lower tails of a sample.

Trim-range, TR, t
Tags