computer science lecture 01 descriptive stat

ssuser6c723c 1 views 88 slides Sep 28, 2025
Slide 1
Slide 1 of 88
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88

About This Presentation

computer science lecture 01 descriptive stat


Slide Content

Descriptive Statistics

Books

…… a form of knowledge- a mode of arranging and stating facts which belong to various sciences (Lond. And Westn. Rev, 1838) Science dealing with collection, analysis, interpretation, and presentation of masses of numerical data (Webster dictionary, 1966) Science of collecting and analysing numerical data (Oxford dictionary, 1996) What is Statistics?

An investigation will typically focus on a well-defined collection of objects constituting a population of interest When desired information is available for all objects in the population, we have what is called a census. Constraints on time, money, and other scarce resources usually make a census impractical or infeasible. Instead, a subset of the population— a sample — is selected in some prescribed manner. Population Vs Sample

A descriptive measure of the population is called parameter A descriptive measure of the sample is called statistic Parameter Vs Statistic

Descriptive statistics Inferential statistics Branches of statistics

If a business analyst is using data gathered on a group to describe and reach conclusions about the same group, the statistics are called descriptive statistics. Example- if an instructor produces statistics to summarize a class’s examination efforts and uses those statistics to reach conclusions about that class only. Descriptive statistics

If a researcher gathers data from a sample and uses the statistics generated to reach conclusions about the population from which the sample was taken Example- pharmaceutical research Inferential statistics

BITS Pilani Pilani Campus Terminologies

A variable is any characteristic whose value may change from one object to another in the population. E.g. age the patients, number of visits to a particular website , etc. Variable

Categorical/ qualitative variables: Take category or label values and place an individual into one of several groups. Each observation can be placed in only one category, and the categories are mutually exclusive. Quantitative variables: Take numerical values and represent some kind of measurement . Types of variable

State Zip code Family size Annual income 1 U.P 201001 5 10,00,000 2 Delhi 110092 10 25,00,000 3 Gurgaon 122503 12 40,00,000 4 Delhi 110091 4 8,00,000 5 U.P 201003 2 2,00,000 6 Gurgaon 122004 1 5,00,000 Example: Indian census data 2010

A dataset is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables. A univariate data set consists of observations on a single variable. Bivariate data sets have observations made on two variables Multivariate data arises when observations are made on more than one variable (so bivariate is a special case of multivariate) Data Sets

Nominal level- It is the lowest level of data measurement. Numbers representing nominal level data can be used only to classify or categorize Example- Employee ID Data Measurement

Ordinal Level- In addition to nominal level capabilities, it can be used to rank or order objects The categories for each of these ordinal variables show order, but not the magnitude of difference between two adjacent points. Example- a supervisor can rank the productivity of employees from 1 to 5 Data Measurement

Interval level- In this distances between consecutive numbers have meaning and the data are always numerical. the distance between pairs of consecutive numbers is assumed to be equal. Zero is just another point on scale and does not mean the absence of phenomenon Example- temperature in Fahrenheit Data Measurement

Ratio level - It has same properties as interval data, but ratio data have an absolute zero, and the ratio of two numbers is meaningful Example- height, weight, time, volume, production cycle time, etc. For instance, we know that someone who is forty years old is twice as old as someone who is twenty years old. There is a meaningful zero point – that is, it is possible to have the absence of age. Data Measurement

The following type of questions are sometimes asked in the survey. These question will result in what level of data measurement How long ago were you released from the hospital? Which type of unit were you in for most of your stay? Intensive care Maternity care Surgical unit How serious was your condition when you were first admitted to the hospital Critical Moderate Minor Exercise- Healthcare industry

BITS Pilani Pilani Campus Data Visualization

Refer to BMI data Both the pie chart and the bar chart help us visualize the distribution of a categorical variable Pie charts and bar chart for a categorical data

Here are the score in mathematics for 15 students: 78, 58, 65, 71, 57, 74, 79, 75, 87, 92, 81, 69, 66, 43, 63 Histograms for quantitative data Score Count [40- 50] 1 [50- 60] 2 [60- 70] 4 [70- 80] 5 [80- 90] 2 [90- 100] 1

Ques. What percentage of students earned less than a grade of 70 on the exam?

A survey was conducted to see how many video calls people made daily. The results are displayed in the table below: Ques1 . Make bar chart and then tell how many of the people surveyed make less than 4 video calls daily? Ques2 . How many people were surveyed? Example Number of calls made Frequency 1 – 3 10 4 - 7 7 8 – 11 4 12 - 15 1 16 - 19 1

When describing the shape of a distribution, we should consider: Symmetry/skewness of the distribution. Peakedness (modality) — the number of peaks (modes) the distribution has. Shape of histograms

Symmetric and single peaked distribution https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/describing-distributions/

Symmetric and double peaked distribution https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/describing-distributions/

Symmetric and flat distribution https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/describing-distributions/

Right skewed distribution https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/describing-distributions/

Left skewed distribution https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/describing-distributions/

The center of the distribution is its midpoint — the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values. Center Histogram of the energy consumption data Image: Book (Probability and statistics for the engineering and sciences by Devore

The spread (also called variability ) of the distribution can be described by the approximate range covered by the data. From looking at the histogram, we can approximate the smallest observation ( min ), and the largest observation ( max ), and thus approximate the range. Spread Histogram of the energy consumption data Image: Book (Probability and statistics for the engineering and sciences by Devore

The stemplot (also called stem and leaf plot) is another graphical display of the distribution of quantitative data. Separate each data point into a stem and leaf, as follows: The leaf is the right- most digit. The stem is everything except the right- most digit. So, if the data point is 54, then 5 is the stem and 4 is the leaf. If the data point is 5.35, then 5.3 is the stem and 5 is the leaf. Stem and Leaf Plot

Example Image: Book (Probability and statistics for the engineering and sciences by Devore

Used to summarize a quantitative variable graphically. The dotplot, like the stemplot, shows each observation, but displays it with a dot rather than with its actual value. When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. Dotplot

A dot plot of 50 random values from to 9. Example Image: Wikipedia

BITS Pilani Pilani Campus Measures of Location/Central Tendency: Ungrouped Data

Ungrouped Data Data presented as individual values without being categorized into intervals. The systolic blood pressure (SBP) readings of 10 patients (in mmHg): 110, 118, 125, 132, 140, 145, 148, 150, 155, 160 This data is not classified into groups or intervals and is presented as individual values. Used when sample size is small or when precise values are needed. Data organized into frequency groups or intervals for better analysis. The same blood pressure data grouped into intervals : Grouping helps identify patterns and distributions in larger datasets. Commonly used in medical statistics, epidemiology, and clinical research . Grouped Data SBP Range (mmHg) Number of Patients 110 - 120 2 121 - 130 1 131 - 140 2 141 - 150 3 151 - 160 2

Ungrouped Data The systolic blood pressure (SBP) readings of 10 patients (in mmHg): 110, 118, 125, 132, 140, 145, 148, 150, 155, 160 The same blood pressure data grouped into intervals : Grouping helps identify patterns and distributions in larger datasets. Commonly used in medical statistics, epidemiology, and clinical research . Grouped Data SBP Range (mmHg) Number of Patients 110 - 120 2 121 - 130 1 131 - 140 2 141 - 150 3 151 - 160 2

Measures of Central Tendency Mean : The average of all values. Median : The middle value when data is ordered. Mode : The most frequently occurring value. Understanding Mean, Median, and Mode

Example: SBP readings (mmHg) from 10 patients: 110, 118, 125, 132, 140, 145, 148, 150, 155, 160 Calculating the Mean  

Calculating the Mean Formula: Mean = 138.3 mmHg (approx.)  

The median M is the midpoint of the ordered distribution Steps: Order the data from smallest to largest. Consider whether n, the number of observations, is even or odd. If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the spot in the ordered list. I f n is even , the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the and spots in the ordered list.   Median

Median with Even and Odd Data Points Odd Number of Data Points: BP readings: 110, 118, 125, 132, 140 - Median = 125 mmHg (middle value) Even Number of Data Points: BP readings: 110, 118, 125, 132, 140, 145 - Median = mmHg**  

Calculating the Mode Ordered Data: 110, 118, 125, 132, 132,140, 145, 148, 150, 155, 160 Mode = The most frequently occurring value in the dataset. Since 132 appear s the maximum number of times, 132 is ** the mode** in this dataset. Ordered Data: 110, 118, 125, 132, 132,140, 145, 148, 148, 150, 155, 160 Both 132 and 142 are the modes of this data Bimodal dataset

Medical Interpretation - The **Mean (138.3 mmHg)** represents the average blood pressure. - The **Median (142.5 mmHg)** is useful when data has outliers. - The **Mode** helps identify the most common value in large datasets. Doctors use these measures to understand blood pressure trends and diagnose hypertension.

Summary & Key Takeaways - **Mean** gives the overall average. - **Median** helps when data has extreme values. - **Mode** is useful for identifying common occurrences. - These measures help in analyzing medical data for better decision-making.

When to Use Mean? D ata is **symmetrically distributed** (no extreme values). Best for **large datasets** without outliers. Commonly used in **medical research** for patient averages (e.g., average heart rate, BP).

Pitfall of Mean Calculation 9 patients have glucose levels (mg/dL): 80, 85, 90, 95, 100, 105, 110, 115, 600 ( misleading ) = 153.3 mg/dL - The ** 600 mg/dL ** is an outlier, making the mean misleading . - Doctors might misinterpret this as overall high glucose levels.  

When to Use Median? D ata has outliers or skewed distribution. Best for small sample sizes or non-normal distributions. Often used in income data, hospital stay durations, and patient recovery times.

Relevance o f Mode Most Common Blood Type in a Population - Blood types in a sample: O, A, B, O, AB, A, O, O, A, B - Mode = **O** (most frequently occurring type) Most Prescribed Painkiller - Painkiller prescriptions in a hospital: Paracetamol, Ibuprofen, Paracetamol, Aspirin, Paracetamol - Mode = **Paracetamol** (most commonly prescribed drug)

Impact of Outliers on Mean vs. Median & Mode Example : Hospital Stay Durations (in days ) 2, 3, 4, 5, 6, 50 (Outlier) = 11.6 days ( misleading ) = Middle value = 4.5 days (resistant to outliers) = Most frequent value (if any appears more than once ) Conclusion : - **Mean is affected by outliers**, which can distort interpretation. - **Median & Mode are not influenced** and provide a better representation in skewed data.  

For symmetric distributions with no outliers: mean is approximately equal to median For skewed right distributions and/or datasets with high outliers: mean > median For skewed left distributions and/or datasets with low outliers: mean < median Comparing the Mean and the Median: Interpretations

Mean is the sum of the observations divided by the number of observations 11 6 7 5 2 8 11 12 15 Ques1. Mean is? 77/9 = 8.55 Mean

When to Use the Mean Sampling stability is desired. Other measures are to be computed such as standard deviation, coefficient of variation and skewness Mean

11 6 7 5 2 8 11 12 15 Ques1. Median is? 2 5 6 7 8 11 11 12 15 Location is (9+1)/2= 5 th element So median is 8 Example

the mode is the most commonly occurring value in a distribution. 11 6 7 5 2 8 11 12 15 Ques1. Mode is? Mode will be 11 Ques2. What kind of distribution is formed by the data from the above 9 students? Unimodal Mode

The mean is very sensitive to outliers, while the median is resistant to outliers? TRUE or FALSE? Use the below data to analyze Data set A → 54 55 56 68 70 71 73 Data set B → 54 55 56 68 70 71 730 Comparing the Mean and the Median

Quartiles in statistics are values that divide your data into quarters. However, quartiles aren’t shaped like pizza slices; Instead they divide your data into four segments according to where the numbers fall on the number line. Quartiles

There are three quartiles: the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3). The first quartile (lower quartile, QL), is equal to the 25th percentile of the data. The second (middle) quartile or median of a data set is equal to the 50th percentile of the data The third quartile, called upper quartile (QU), is equal to the 75th percentile of the data. Quartiles

What are Quartiles? Quartiles divide data into four equal parts. • Q1 (First Quartile) - 25% of data falls below this value. • Q2 (Median) - 50% of data falls below this value. • Q3 (Third Quartile) - 75% of data falls below this value.

Ordered Data Set: 6 , 7, 15, 36, 39, 40, 41, 42, 43, 47, 49 Q2 will be i.e. 40 Q1 will be Q3 will be   Example

The central box spans from Q1 to Q3. A line in the box marks the median Lines go from the edges of the box to the smallest and largest observations that are not classified as outliers. Steps

Why Use Quartiles in Medicine? • Helps compare treatment effectiveness. • Identifies variability in patient responses. • Detects outliers that indicate unusual recovery patterns.

Comparing Two Treatments: Recovery Times • Study on 20 patients using two medications (A & B). • Recovery times measured in days: Treatment A : 5 , 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 14, 15, 16, 17, 18, 20, 21, 22, 24 Treatment B : 3 , 4, 5, 5, 6, 7, 8, 9, 9, 10, 11, 11, 12, 13, 14, 14, 15, 16, 17, 18 • Quartiles help determine which treatment is more effective.

Quartiles for Each Treatment • Treatment A : Q1 = 9 days Median (Q2) = 13 days Q3 = 17 day Treatment B: Q1 = 6 days Median (Q2) = 10 days Q3 = 14 days Box Plot Comparison of Treatments Treatment B has a lower median recovery time, suggesting better effectiveness.

What do we l earn from Box Plots • Treatment B has a **lower median recovery time**, meaning most patients recover faster. • Treatment A shows **higher variability**, meaning recovery times are less predictable. • Outliers in Treatment A suggest some patients took much longer to recover.

Clinical Decision: Which Treatment is Better? • Doctors prefer **Treatment B** because: - Faster recovery (lower median and Q3). - Less variability in recovery times. - Fewer extreme recovery delays. • Treatment A may still be useful for specific patient cases.

Why Quartiles Matter in Drug Research • Used in clinical trials to compare drug effectiveness. • Helps identify best treatments for different patient groups. • Supports evidence-based decision-making in medicine.

BITS Pilani Pilani Campus Measures of Location/Central Tendency: Grouped Data

Let's assume we have data on blood pressure reduction (mmHg) after a treatment for a group of patients. Blood Pressure Reduction (mmHg) Frequency (f) 10 - 19 5 20 - 29 8 30 - 39 15 40 - 49 10 50 - 59 7 On average, how much did a patient’s blood pressure reduce after the treatment? What is the typical blood pressure reduction experienced by a patient, such that half of the patients had a greater reduction and half had a lower reduction? What is the most frequently observed blood pressure reduction among patients?

Mean- Blood Pressure Reduction Blood Pressure Reduction (mmHg) Frequency ( ) Midpoint ( ) 10 - 19 5 14.5 72.5 20 - 29 8 24.5 196 30 - 39 15 34.5 517.5 40 - 49 10 44.5 445 50 - 59 7 54.5 381.5 Total = 45 = 1612.5 Blood Pressure Reduction (mmHg) 10 - 19 5 14.5 72.5 20 - 29 8 24.5 196 30 - 39 15 34.5 517.5 40 - 49 10 44.5 445 50 - 59 7 54.5 381.5 Total   Formula: ( is midpoint of class) Steps: Compute the midpoint ( ) for each class: Multiply midpoint by frequency ( ). Sum up and divide by total frequency  

Mean- Blood Pressure Reduction   Formula: ( is midpoint of class) Steps: Compute the midpoint ( ) for each class: Multiply midpoint by frequency ( ). Sum up and divide by total frequency   Blood Pressure Reduction (mmHg) Frequency ( f ) Midpoint ( x ) f *x 10 - 19 5 14.5 72.5 20 - 29 8 24.5 196 30 - 39 15 34.5 517.5 40 - 49 10 44.5 445 50 - 59 7 54.5 381.5 Total 45 1612.5

Median- Blood Pressure Reduction   =   Class Interval Frequency (f) Cumulative Frequency (CF) 10 - 19 5 5 20 - 29 8 13 30 - 39 15 28 40 - 49 10 38 50 - 59 7 45

Mode Calculation Identify modal class (highest frequency class). Apply formula using: = lower boundary of modal class = frequency of modal class = frequency of class before modal class = frequency of class after modal class = class width     Blood Pressure Reduction (mmHg) Frequency (f) 10 - 19 5 20 - 29 8 30 - 39 15 40 - 49 10 50 - 59 7

Doctor’s Conclusion & Summary • **Treatment B is more effective** based on quartiles and box plots. • Doctors use quartiles to make data-driven treatment decisions. • Quartiles help improve patient care by identifying the best treatment options.

A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. Formula: i = (P/100) *N i= percentile location P= the percentile of interest N= number of observations in dataset If “i” is a whole number, P th percentile is average of values at i and i+1 locations If “i“ is not a whole number P th percentile is located at whole number part of i+1 Percentile

Ques. What is 80 th percentile of 1240 numbers Percentile location= (80/100) *1240= 992 So answer is average of terms at 992 location and 993 location Example

BITS Pilani Pilani Campus Measures of variability: Ungrouped Data

The range is exactly the distance between the smallest data point (min) and the largest one (Max). Here are the number of hours that 9 students spend on social media on a typical day: 11 6 7 5 2 8 11 12 15 Ques. Range for above case is ? 2 5 6 7 8 11 11 12 15 Range is 15 – 2 = 13 Range

While the range quantifies the variability by looking at the range covered by ALL the data, The IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data. The middle 50% of the data falls between Q1 and Q3, and therefore: IQR = Q3 - Q1 The IQR should be used as a measure of spread of a distribution only when the median is used as a measure of center. Inter- Quartile Range (IQR)

Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49 Q2 will be (n+1)/2 = 12/2 = 6 th element i.e. 40 Q1 will be (15+36)/2 = 25.5 Q3 will be (42+43)/2= 42.5 Ques. IQR range will be? Q3 – Q1= 42.5 – 25.5 = 17 Example

The IQR is used as the basis for a rule of thumb for identifying outliers. An observation is considered a suspected outlier if it is: below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR) Using the IQR to Detect Outliers

The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five- number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. Boxplot

The central box spans from Q1 to Q3. A line in the box marks the median Lines go from the edges of the box to the smallest and largest observations that are not classified as outliers. Steps Image: google

Boxplots are most useful when presented side- by- side for comparing and contrasting distributions from two or more groups. Side by side Box plot Actors: min = 31, Q1 = 37.25, M = 42.5, Q3 = 50.25, Max = 76 Actresses: min = 21, Q1 = 32, M = 35, Q3 = 41.5, Max = 80 Image: https://bolt.mph.ufl.edu/

Its is the average of squared deviations about the arithmetic mean for a set of numbers Variance Image: google

The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean. The standard deviation gives the average (or typical distance) between a data point and the mean. It is the square root of variance Standard Deviation

Here are the number of hours that 9 students spend on social media on a typical day: 11 6 7 5 2 8 11 12 15 Mean = 8.55 Deviations from the mean 11- 8.55, 6-8.55, 7-8.55, 5-8.55, 2-8.55, 8-8.55, 11- 8.55, 12- 8.55, 15- 8.55 2.45, - 2.55, -1.55, - 3.55, - 6.55, -0.55, 2.45, 3.45, 6.45 Example
Tags