Data Analysis technique, data collection, data analysis
EktaJolly
190 views
95 slides
Dec 09, 2023
Slide 1 of 95
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
About This Presentation
Introduction to data analysis
Size: 2.23 MB
Language: en
Added: Dec 09, 2023
Slides: 95 pages
Slide Content
Data Analysis The data, after collection, has to be processed and analysed in accordance with the outline laid down for the purpose at the time of developing the research plan This is essential for a scientific study and for ensuring that we have all relevant data for making contemplated comparisons and analysis Processing implies editing, coding, classification and tabulation of collected data so that they are amenable to analysis The term analysis refers to the computation of certain measures along with searching for patterns of relationship that exist among data-groups
Process of examining the collected raw data to detect errors and omissions and to correct these when possible It involves a careful scrutiny of the completed questionnaires and/or schedules It ensures that the data are accurate, consistent with other facts gathered, uniformly entered, as completed as possible and have been well arranged to facilitate coding and tabulation With regard to points or stages at which editing should be done, one can talk of field editing and central editing Field editing consists in the review of the reporting forms by the investigator for completing (translating or rewriting) what the latter has written in abbreviated and/or in illegible form at the time of recording the respondents’ responses. This type of editing is necessary in view of the fact that individual writing styles often can be difficult for others to decipher Editing
Central editing should take place when all forms or schedules have been completed and returned to the office Thorough editing by a single editor or a team of editors in case of a large inquiry Editor(s ) may correct the obvious errors In case of inappropriate on missing replies, the editor can sometimes determine the proper answer by reviewing the other information in the schedule and at the same time respondent can be contacted for clarification
Editors must keep in view several points while performing their work: They should be familiar with instructions given to the interviewers and coders as well as with the editing instructions supplied While crossing out an original entry for one reason or another, they should just draw a single line on it so that the same may remain legible They must make entries (if any) on the form in some distinctive colur and that too in a standardised form They should initial all answers which they change or supply Editor’s initials and the date of editing should be placed on each completed form or schedule
Coding It refers to the process of assigning numerals or other symbols to answers so that responses can be put into a limited number of categories or classes appropriate to the research problem They must also possess the characteristic of exhaustiveness and also that of mutual exclusively which means Another rule to be observed is that of unidimensionality by which is meant that every class is defined in terms of only one concept Through it the several replies may be reduced to a small number of classes which contain the critical information required for analysis
Classification Most research studies result in a large volume of raw data which must be reduced into homogeneous groups if we are to get meaningful relationships This fact necessitates classification of data which happens to be the process of arranging data in groups or classes on the basis of common characteristics Data having a common characteristic are placed in one class and in this way the entire data get divided into a number of groups or classes Classification can be one of the following two types, depending upon the nature of the phenomenon involved : According to attributes According to class intervals
Tabulation When a mass of data has been assembled, it becomes necessary for the researcher to arrange the same in some kind of concise and logical order This procedure is referred to as tabulation and thus , tabulation is the process of summarizing raw data and displaying the same in compact form (i.e., in the form of statistical tables) for further analysis In a broader sense, tabulation is an orderly arrangement of data in columns and rows Tabulation is essential because of the following reasons: It conserves space and reduces explanatory and descriptive statement to a minimum It facilitates the process of comparison It facilitates the summation of items and the detection of errors and omissions It provides a basis for various statistical computations
Generally Accepted Principles of Tabulation Every table should have a clear, concise and adequate title so as to make the table intelligible without reference to the text and this title should always be placed just above the body of the table Every table should be given a distinct number to facilitate easy reference The column headings (captions) and the row headings (stubs) of the table should be clear and brief The units of measurement under each heading or sub-heading must always be indicated Explanatory footnotes, if any, concerning the table should be placed directly beneath the table , along with the reference symbols used in the table Source or sources from where the data in the table have been obtained must be indicated just below the table Usually the columns are separated from one another by lines which make the table more readable and attractive
Lines are always drawn at the top and bottom of the table and below the captions There should be thick lines to separate the data under one class from the data under another class and the lines separating the sub-divisions of the classes should be comparatively thin lines The columns may be numbered to facilitate reference Those columns whose data are to be compared should be kept side by side Similarly, percentages and/or averages must also be kept close to the data It is generally considered better to approximate figures before tabulation as the same would reduce unnecessary details in the table itself In order to emphasise the relative significance of certain categories, different kinds of type, spacing and indentations may be used
It is important that all column figures be properly aligned Decimal points and (+) or (–) signs should be in perfect alignment Abbreviations should be avoided to the extent possible and ditto marks should not be used in the table Miscellaneous and exceptional items, if any, should be usually placed in the last row of the table Table should be made as logical, clear, accurate and simple as possible. If the data happen to be very large, they should not be crowded in a single table for that would make the table unwieldy and inconvenient Total of rows should normally be placed in the extreme right column and that of columns should be placed at the bottom The arrangement of the categories in a table may be chronological, geographical, alphabetical or according to magnitude to facilitate comparison
Elements/ Types of Analysis By analysis we mean the computation of certain indices or measures along with searching for patterns of relationship that exist among the data groups It involves estimating the values of unknown parameters and testing of hypotheses for drawing inferences Analysis may, therefore, be categorized as descriptive analysis and inferential analysis (Inferential analysis is often known as statistical analysis ) Descriptive analysis is largely the study of distributions of one variable & this sort of analysis may be in respect of one variable (described as unidimensional analysis), or in respect of two variables (described as bivariate analysis ) or in respect of more than two variables (described as multivariate analysis ) We may as well talk of correlation analysis and causal analysis
Correlation analysis studies the joint variation of two or more variables for determining the amount of correlation between two or more variables Causal analysis is concerned with the study of how one or more variables affect changes in another variable It is thus a study of functional relationships existing between two or more variables This analysis can be termed as regression analysis Causal analysis is considered relatively more important in experimental researches In modern times, with the availability of computer facilities, there has been a rapid development of multivariate analysis which may be defined as “all statistical methods which simultaneously analyse more than two variables Elements/ Types of Analysis
Multivariate analysis Multiple regression analysis: This analysis is adopted when the researcher has one dependent variable which is presumed to be a function of two or more independent variables The objective of this analysis is to make a prediction about the dependent variable based on its covariance with all the concerned independent variables Multiple discriminant analysis: This analysis is appropriate when the researcher has a single dependent variable that cannot be measured, but can be classified into two or more groups on the basis of some attribute The object of this analysis is to o predict an entity’s possibility of belonging to a particular group based on several predictor variables Multivariate analysis of variance ( or multi-ANOVA ): Extension of two way ANOVA , wherein the ratio of among group variance to within group variance is worked out on a set of variables
Canonical analysis: This analysis can be used in case of both measurable and non-measurable variables for the purpose of simultaneously predicting a set of dependent variables from their joint covariance with a set of independent variables Inferential analysis is concerned with the various tests of significance for testing hypotheses in order to determine with what validity data can be said to indicate some conclusion or conclusions
Statistics in Research The role of statistics in research is to function as a tool in designing research, analysing its data and drawing conclusions therefrom Clearly the science of statistics cannot be ignored by any research worker, even though he may not have occasion to use statistical methods in all their details and ramifications The important statistical measures Measures of central tendency or statistical averages Measures of dispersion Measures of asymmetry ( skewness ) Measures of relationship Other measures
Measures of central tendency (or statistical averages) tell us the point about which items have a tendency to cluster Mean, median and mode are the most popular averages
Median Arrange your numbers in numerical order Count how many numbers you have If you have an odd number, divide by 2 and round up to get the position of the median number If you have an even number, divide by 2. Go to the number in that position and average it with the number in the next higher position to get the median Mode To find the mode, or modal value, it is best to put the numbers in order. Then count how many of each number. A number that appears most often is the mode.
Find the mean, median, and mode for the following list of values: 13, 18, 13, 14, 13, 16, 14, 21, 13 Mean=15 Median: 14 Mode:13 1, 2, 4, 7 Mean=3.5 Median= (2+4)/2=3 Mode=0 G.M. & H.M.
Measure of Dispersion An average can represent a series only as best as a single figure It fails to give any idea about the scatter of the values in the series around the true value of average In order to measure this scatter, statistical devices called measures of dispersion are calculated Important measures of dispersion are Range Mean deviation Standard deviation https://geographyfieldwork.com/DataPresentationScatterGraphs.htm#
Range It is the simplest possible measure of dispersion and is defined as the difference between the values of the extreme items of a series Range = Highest value of an item in a series- Lowest value of an item in a series It gives an idea of the variability very quickly, but the drawback is that range is affected very greatly by fluctuations of sampling Its value is never stable, being based on only two values of the variable As such, range is mostly used as a rough measure of variability and is not considered as an appropriate measure in serious research studies
Mean deviation It is the average of difference of the values of items from some average of the series In calculating mean deviation we ignore the minus sign of deviations while taking their total for obtaining the mean deviation Standard deviation It is most widely used measure of dispersion of a series and is commonly denoted by the symbol sigma Standard deviation is defined as the square-root of the average of squares of deviations, when such deviations for the values of individual items in a series are obtained from the arithmetic average
Measures of Asymmetry When the distribution of item in a series happens to be perfectly symmetrical, we then have the following type of curve for the distribution:
A normal curve and the relating distribution as normal distribution Such a curve is perfectly bell shaped curve in which case the value of X or M or Z is just the same and skewness is altogether absent If the curve is distorted (whether on the right side or on the left side), we have asymmetrical distribution which indicates that there is skewness If the curve is distorted on the right side, we have positive skewness but when the curve is distorted towards left, we have negative skewness
Skewness is, thus, a measure of asymmetry and shows the manner in which the items are clustered around the average
Measures of Relationship S tatistical measures that we used so far are in context of univariate population i.e ., measurement of only one variable If for every measurement of a variable , X , there is a corresponding value of a second variable, Y , the resulting pairs of values are called a bivariate population Similarly it can be a multi-variable data There are several methods of determining the relationship between variables, but no method can tell us for certain that a correlation is indicative of causal relationship
Two types of questions in bivariate or multivariate populations Does there exist association or correlation between the two (or more) variables? If yes, of what degree? Is there any cause and effect relationship between the two variables ? If yes, of what degree and in which direction? The first question is answered by the use of correlation technique and the second question by the technique of regression Measures of Relationship
There are several methods of applying the two techniques, but the important ones are as under : In case of bivariate population: Correlation can be studied through Cross tabulation Charles Spearman’s coefficient of correlation Karl Pearson’s coefficient of correlation; whereas cause and effect relationship can be studied through simple regression equations In case of multivariate population: Correlation can be studied through Coefficient of multiple correlation Coefficient of partial correlation; whereas cause and effect relationship can be studied through multiple regression Measures of Relationship
Simple Regression Analysis Regression is the determination of a statistical relationship between two or more variables In simple regression , we have only two variables, one variable (defined as independent) is the cause of the behaviour of another one (defined as dependent variable ) Regression can only interpret what exists physically i.e ., there must be a physical way in which independent variable X can affect dependent variable Y The basic relationship between X and Y is given by denotes the estimated value of Y for a given value of X
Then generally used method to find the ‘best’ fit that a straight line of this kind can give is the least-square method Least-Square Method
Least Square Curve Fitting method b a b S. S. Shashtri , “Introductory-Methods-of-Numerical-Analysis, 2012, PHI Learning, N. Delhi
Read a =a, a 1 =b*
A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve . A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula
Definition Curve fitting: is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. It is a statistical technique use to drive coefficient values for equations that express the value of one(dependent) variable as a function of another (independent variable) https://www2.slideshare.net/shopnohinami/curve-fitting-53775511?from_action=save
What is curve fitting Curve fitting is the process of constructing a curve, or mathematical functions, which possess closest proximity to the series of data. By the curve fitting we can mathematically construct the functional relationship between the observed fact and parameter values, etc. It is highly effective in mathematical modelling some natural processes. https://www2.slideshare.net/shopnohinami/curve-fitting-53775511?from_action=save
Interpolation & Curve fitting In many application areas, one is faced with the test of describing data, often measured, with an analytic function. There are two approaches to this problem:- 1. In Interpolation, the data is assumed to be correct and what is desired is some way to descibe what happens between the data points. 2. The other approach is called curve fitting or regression, one looks for some smooth curve that ``best fits'' the data, but does not necessarily pass through any data points. In many application areas, one is faced with the test of describing data, often measured, with an analytic function . There are two approaches to this problem • In Interpolation, the data is assumed to be correct and what is desired is some way to describe what happens between the data points • The other approach is called curve fitting or regression , one looks for some smooth curve that `` best fits'' the data, but does not necessarily pass through any data points
Curve fitting There are two general approaches for curve fitting : • Least squares regression Data exhibit a significant degree of scatter. The strategy is to derive a single curve that represents the general trend of the data Interpolation Data is very precise. The strategy is to pass a curve or a series of curve through each of the points is very precise.
General approach for curve fitting
Engineering A applications of C urve fitting T echnique Trend Analysis:- Predicating values of dependent variable , may include extrapolation beyond data points or interpolation between data points In engineering, two types of applications are encountered: Trend analysis . Predicting values of dependent variable , may include extrapolation beyond data points or interpolation between data points Hypothesis testing . Comparing existing mathematical model with measured data
Data scatterness Positive Correlation Positive Correlation No Correlation
Mathematical Background Variance . Representation of spread by the square of the standard deviation. Coefficient of variation . Has the utility to quantify the spread of data. 2 n 1 ( y y ) S 2 i y 2 2 2 n 1 y / n y S i i y c . v . S y 100 % y Mean S.D
Least square method
Linear Regression: Criteria for a “Best” Fit n n a a 1 x i ) m i n e i ( y i i 1 i 1 e 1 = -e 2
Linear Regression: Criteria for a “Best” Fit n n min | e i | | y i a a 1 x i | i 1 i 1
Linear Regression: Criteria for a “Best” Fit n min max| e i | | y i a a 1 x i | i 1
Linear curve fitting (Straight line)? Given a set of data point (x i, f(xi )) find a curve that best captures the general trend • Where g(x) is approximation function set of data point (x i, f(x i )) find a curve that best captures the general trend Where g(x) is approximation function Try to fit a straight line Through the data
Linear Regression: Least Squares Fit n i n n r i i S i 1 2 2 i 1 i 1 2 ( y i a a 1 x i ) e ( y , m e a s u r ed y , m od e l ) n n i r e i 1 i 0 1 i i 1 2 ( y a a x ) 2 m i n S Yields a unique line for a given set of data.
Linear Regression: Least Squares Fit n n r i i 0 1 i 2 e ( y a a x ) 2 m i n S i 1 i 1 The coefficients a and a 1 that minimize S r must satisfy the following conditions: a 1 S a r S r
Linear Regression: Determination of a o and a 1 2 1 i i i i o y x a x a x 1 y i a a 1 x i 2 ( y i a o a 1 x i ) x i S r a 2 ( y i a o a 1 x i ) S r a 2 1 i i i i y x a x a x a na n a x i a 1 y i 2 equations with 2 unknowns, can be solved simultaneously
Linear Regression: Determination of ao and a1 2 2 1 i i x x n i i x i y i x y n a a y a 1 x
Error Quantification of Linear Regression Sum of the squares of residuals around the regression line is S r Total sum of the squares around the mean for the dependent variable, y, is S t 2 S t ( y i y ) 2 n n 2 r i i 1 i 1 e ( y i a o a 1 x i ) S
Example The table blew gives the temperatures T in C and Resistance R in Ω of a circuit if R=a + a 1 T Find the values of a and a 1 T 10 20 30 40 50 60 R 20.1 20.2 20.4 20.6 20.8 21
S o lu t ion a =19.867 a 1 =0.01857 6a +210a 1 =123.1 210a +9100a 1 =4341 g(x)=19.867+0.01857*T
Least Squares Fit of a Straight Line: Example Fit a straight line to the x and y values in the following Table: x i 28 y i 24.0 2 i x 140 i i x y 119.5 x 28 4 7 7 y 24 3.428571 x i y i x y i i i x 2 1 0.5 0.5 1 2 2.5 5 4 3 2 6 9 4 4 16 16 5 3.5 17.5 25 6 6 36 36 7 5.5 38.5 49 28 24 119.5 140
Least Squares Fit of a Straight Line: Example 2 2 1 x ) x ( n x y n a i i i i x i y i 7 119.5 28 24 0.8392857 7 140 28 2 a y a 1 x 3.428571 0.8392857 4 0.07142857 Y = 0.07142857 + 0.8392857 x
Least Squares Fit of a Straight Line: Example (Error Analysis) 2 i r e 2.9911 S 0.868 S t S r r 2 S t 2 y y 22.7143 S t i r 2 r 0.868 0.932
Least Squares Fit of a Straight Line: Example (Error Analysis ) The standard deviation (quantifies the spread around the mean): n 1 7 1 s S t 22.7143 1.9457 y The standard error of estimate (quantifies the spread around the regression line) 7 2 2.9911 0.7735 n 2 s S r y / x
The relationship between the dependent and independent variables is linear. However, a few types of nonlinear functions can be transformed into linear regression problems. The exponential equation. The power equation. The saturation-growth-rate equation. Linearization of Nonlinear Relationships
Linearization of Nonlinear Relationships 1. The exponential equation. ln y ln a 1 b 1 x y* = a o + a 1 x
Linearization of Nonlinear Relationships 2. The power equation log y log a 2 b 2 log x y* = a o + a 1 x*
Linearization of Nonlinear Relationships The saturation-growth-rate equation a x y a 3 1 1 b 3 1 3 y* = 1/y a o = 1/a 3 a 1 = b 3 /a 3 x* = 1/x
Example Fit the following Equation: y a 2 x b 2 To the data in the following table: x i y i X*=log x i Y*=logy i 1 0.5 0 0.602 2 1.7 0.301 0.753 3 3.4 0.301 0.699 4 5.7 .22 6 0.922 5 8.7 .447 2.079 1 5 19.7 .53 4 2.141 log y log( a 2 x 2 ) b let Y * log y, X * log x, a log a 2 , a 1 b 2 2 2 log y log a b log x Y * a a X * 0 1
Example Sum Xi Yi X* i =Log(X) Y* i =Log(Y) X*Y* X*^2 1 0.5 0.0000 -0.3010 0.0 00 0.0 00 2 1.7 0.3010 0.2304 0.0 6 94 0.0 9 06 3 3.4 0.4771 0.5315 0.2 5 36 0.2 2 76 4 5.7 0.6021 0.7559 . 455 1 . 362 5 5 8.4 0.6990 0.9243 0.6 4 60 0.4 8 86 15 19.700 2.079 2.141 1.424 1.169 i i 5 1.424 2.079 2.141 1.75 5 1.169 2.079 2 n x 2 ( x ) 2 a 1 a y a 1 x 0.4282 1.75 0.41584 0.334 n x i y i x i y i
Linearization of Nonlinear Functions: Example log y =-0.334+1.75log x y 0.46 x 1.75
Polynomial Regression Some engineering data is poorly represented by a straight line For these cases a curve is better suited to fit the data The least squares method can readily be extended to fit the data to higher order polynomials
Polynomial Regression (cont’d) A parabola is preferable
Polynomial Regression (cont’d) • A 2 nd 2 nd order polynomial (quadratic) is defined by: y a a x a x 2 e o 1 2 The residuals between the model and the data: e y a a x a x 2 i i o 1 i 2 i The sum of squares of the residual: 2 2 2 2 i r i a x e y i a o a 1 x i S
Polynomial Regression (cont’d) A system of 3x3 equations needs to be solved to determine the coefficients of the polynomial . The standard error & the coefficient of determination n 3 s S r y / x t S S r r 2 S t i i i i i i i i i i i a x y x x x x a x y x x 2 2 i i 1 4 3 2 3 2 n x x 2 a y
Polynomial Regression (cont’d) The coefficient of determination: General: The mth-order polynomial: y a a x a x 2 ..... a x m e o 1 2 m A system of (m+1)x(m+1) linear equations must be solved for determining the coefficients of the mth-order polynomial. The standard error: s S r n m 1 y / x S t S r r 2 S t
Polynomial Regression- Example Fit a second order polynomial to data: 3 x 225 4 979 i x x i y i x i 2 x i 3 x i 4 x i y i x i y i 2 2.1 1 7.7 1 1 1 7.7 7.7 2 13.6 4 8 16 27.2 54.4 3 27.2 9 27 81 81.6 244.8 4 40.9 16 64 256 163.6 654.4 5 61.1 25 125 625 305.5 1527.5 15 152.6 55 225 979 585.6 2489 x i y i 5 8 5 . 6 x i 15 y i 15 2 . 6 2 i i x 55 y 152.6 25.433 6 x 15 2.5, 6 2 i i x y 2488.8
2 nd order polynomial Example y a a x a x 2 o 1 2 xi fi 𝑥𝑖 2 𝒙𝒊 𝟑 𝒙𝒊 𝟒 fixi 𝒇𝒊𝒙𝒊 𝟐 g (x) 1 4 1 1 1 4 4 4.505 2 11 4 8 6 22 44 10.15 4 19 16 64 256 76 304 19.43 6 26 36 216 1296 156 936 26.03 8 30 64 512 4096 240 1920 29.95 𝑥 = 21 𝑓𝑖 = 90 𝑥𝑖 2 = 121 𝑥𝑖 3 = 801 𝑥𝑖 4 = 5665 𝑓𝑖𝑥𝑖 = 498 𝑓𝑖𝑥𝑖 2 = 3208
2 nd order polynomial Example 5 a +21 a 1 +121 a 2 =90 21 a +121 a 1 +801 a 2 =498 121 a +801 a 1 +5665 a 2 =3208 a =- 1.81 ,a 1 =6.6 5 , a 2 =- 0.335 So the required equation is g (x)=-1.81+6.65X-0.335 𝑥 2
Exponential function x 1 2 3 4 5 y 1.5 4.5 6 8.5 11 Solution y=a 𝑒 𝑏𝑥 lny=lna 𝑒 𝑏𝑥 =lna+bx Y=a +a 1 X Where Y=lny=fi, a =a ,a 1 =b , X=x
Polynomial Regression- Example (cont’d) x i y i y model e i 2 (y i -y`) 2 2.1 2.4786 0.14332 544.42889 1 7.7 6.6986 1.00286 314.45929 2 13.6 14.64 1.08158 140.01989 3 27.2 26.303 0.80491 3.12229 4 40.9 41.687 0.61951 239.22809 5 61.1 60.793 0.09439 1272.13489 15 152.6 3.74657 2513.39333 The standard error of estimate: 3.74657 1.12 6 3 y / x s 2513.39 r 2 r 0.99925 The coefficient of determination: r 2 2513.39 3.74657 0.99851,