Data analysis for business and economics education

FuleaAmena2 11 views 73 slides Jun 02, 2024
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

Data analysis


Slide Content

7. 1 . T ypes of D a t a f o r Ana l ysis Before analyzing the data for your research, it is important to know the type of data you have at hand as the technique you use i s dete r mine d b y th e d a t a . The following figure provides you clear information of the type of d a ta t o b e used for r esear c h . CHAPTE R SEVEN D A TA ANA L Y SIS @researchaid 22-12-2022 1

@researchaid 22-12-2022 2

7.1.1. Data can be divided into two distinct groups: Categorical and Numerical A. Categorical data Th e se ar e d a ta th a t can ‘ t be measu r ed n u me r ica l l y as quantities.The variables defined by the classes or categories into whi c h a n in d iv i d u a l membe r f all s . Categorical data can be further sub-divided in to: 1. Nominal- Whose values can‘t be measured numerically or can‘t be ranked. Rather these data simply count the number 22 - 1 2 - 2 02 2 3 of o @ c r e c s e a u r c r h a r i d enc e s i n e a c h c a te g o r y o f a v a r ia b l e .

Examples of nominal variables: Where a person lives (AA, Adama, B/Dar, etc.) Ge n de r (mal e , f e male) N a tionality ( A me r ican, Ethio p ian, Chin e se) Ethnicity (Oromo, Amhara,Tgire, Gurage…) 22 - 1 2 - 2 02 2 4 @researchaid

2. Ranked/Ordinal data - whose values can be ranked in orders Examples of ordinal data Education (Elementary school, High school, College Diploma, College de g r e e , Maste rs, PhD) Agreement (strongly disagree, disagree, neutral, agree, strongly agree) R a tin g (excellent, good, f ai r , po o r) Freq ue n c y (al w a ys, often, s o metimes , n e v er Grading ( A , B, C , D , F) A n y other scale ( ― On a scale o f 1 to 5 , 1 to 7 . . . ‖) 22-12-2022 5 @researchaid

Descriptive data with only two categories are known as dichotomous data. E . g . gend e r can b e div i de into female and male . Or questions with a ‘ y es ’ o r ‘No’ resp o nse B . Nume rical D a ta Whi c h ar e so m etim e s te r m e d ‗quantifia b le‘ , ar e those whose values are measured or counted numerically as quantities. Nu m e r ical d a ta c a n be ana l ysed using a f ar wider r an g e of 22 - 1 2 - 2 02 2 6 st a @ t i r e s s e t a i r c c h a s id th a n c a t e g ori c al d a t a .

The re a r e t w o pos s i b l e w a ys o f su b - div i ding num e r ical d a ta into: Interval or Ratio data and Continuous or Discrete data. A.1. Interval Data: whose values on an interval scale can meaningfully be added and subtracted, but not multiplied and divided. The Celsius temperature scale is a good example of an interval scale. Although the difference between, say, 20°C and 30°C is 10°C it does not mean that 30°C is one and a half times as warm as 20°C. This is because ° C d o es no t r ep r esent a t r ue ze r o . There is no true zero as there is no such thing that there is no temperature. A person whose IQ is 90 does not mean he is twice clever that one with 22 - 1 2 - 2 02 2 7 I Q o f @ 4 r 5 e s e . ar c haid

A.2. R a ti o D a ta : a r e d a ta f or w hich y ou c an calc u l a te the r el a t i v e difference or ratio between any two data values for a variable. E.g. if a multinational company makes a profit of $300, 000, 000 in one year and $600, 000, 000 the following year, we can say that p r o f its h a v e dou b led . Ratio data are the ultimate nirvana when it comes to measurement and scaling and statistical analysis. 22 - 1 2 - 2 02 2 8 @researchaid

22 - 1 2 - 2 02 2 9 B.1. Continuous Data: are those whose values can theoretically take any value (sometimes within a restricted range) provided that you can measu r e them accu r a te l y en o ug h . Examples: age in years, weight, blood pressure r eadings, tempe r a tu r e et c . B.2. Discrete data: can, by contrast, be measured precisely . Each case takes one of a finite number of values from a scale that measures c han g es in disc r ete u nit s . These d a ta a re o f ten whole n u mb e r s ( i n teg e r s) Examples Number of mobile telephones manufactured Number of customers served. Nu m @ b r e e s e r ar c o h a f id respondents i n te r vi e w ed

Coding the Data Coding – Process of translating information gathered from question n ai r es or other sourc es into som e thing th a t ca n be analyzed Involves assigning a value to the information given—often value is g i v e n a l a bel . Coding can ma k e d a ta mo r e consistent Exa m pl e : Qu e stion = Sex A n s w e r s = Male, Femal e , M, o r F Coding will avoid such inconsistencies @researchaid 22 - 1 2 - 2 02 2 10

Coding Systems Common coding systems (code and label) for dichotomous variables: 0=No 1=Yes (1 = v al u e assig n e d , Y es= la b el o f v al u e) OR: 1=No 2= Y es 1 = M, 2 = F When you assign a value, you must also make it clear what that value means In first example above, 1=Yes but in second example 1=No As long as it is clear how the data are coded, either is fine @researchaid 22 - 1 2 - 2 02 2 11

Coding: Nominal Variables For coding nominal variables, order makes no difference Example: variable RESIDENCE 1 = No r the a st 2 = Sou th 3 = No r th w est 4 = Mi d w est 5 = Sou th w est Order does not matter, no ordered value associated with each response @researchaid 22 - 1 2 - 2 02 2 12

Coding- Ordinal Variables Coding p r oc e ss i s s i mila r w i th other c a tego r ical v a r ia b les Exa m pl e : v a r ia b l e EDU C A TIO N , p ossi b l e co d ing: = Did not g radu a te f r om hi g h s c hool 1 = H i gh s c hool g radu a te 2 = Some college or post-high school education 3 = Colleg e g radu a te Could be coded in reverse order (0=college graduate, 3=did not g radu a te high s c hool) @researchaid 22 - 1 2 - 2 02 2 13

Coding: Continuous Variables Creating categories from a continuous variable (ex. age) is common May break down a continuous variable into chosen categories by creating an ordinal categorical variable Exam p le : v a r ia b l e = A GE 1 = – 9 y ea r s old 2 = 1 – 19 y ea r s old 3 = 20 – 39 y ea r s old 4 = 4 – 59 y ea r s old 5 = 60 y ea r s o r older @researchaid 22 - 1 2 - 2 02 2 14

Ent e r ing da ta Once your data have been coded, you can enter them into the computer. Increasingly, data analysis software contains algorithms th a t c he c k th e d a ta f o r o b vious e r r o r s a s i t i s ent e r e d . Despite this, it is essential that you take considerable care to ensure that your data are entered correctly. @researchaid 22 - 1 2 - 2 02 2 15

No matter how carefully you code and subsequently enter data there will always be some errors. The main methods to check d a ta f o r e r r o r s ar e a s f o ll o ws : Look for illegitimate/dishonest/ codes . In any coding s c he m e , on l y ce r tain nu m be r s ar e alloc a te d . Other numbers are, therefore, errors. Common errors are the inclusion/addation/of letters O instead of zero, letters l or I instead of 1 , and nu m ber 7 instead of 1 . Checking for errors @researchaid 22 - 1 2 - 2 02 2 16

Outliers? (really high or low numbers) Exa m pl e : Ag e = 110 ( r eal l y 1 or 11?) Missing values? D i d the pe r son not g i v e a n an s w er? Was answer accidentally not entered into the database? @researchaid 22 - 1 2 - 2 02 2 17

Coding Tip Thou g h y ou d o no t code until the d a ta i s g a the r ed, y ou should think about how you are going to code while designing your questionnaire, before you gather any data. This will help you to collect the data in a format y ou can us e . Use of code book is recommended to avoid the need to r emember the code y ou a s signe d . @researchaid 22 - 1 2 - 2 02 2 18

7.2. T ypes o f D a ta Ana l ysis Is the process of inspecting, cleaning, transforming, and modelling data with the goal of discovering useful information; suggesting conclusions, and supporting decision making. D a ta ana l ysis can b e ma d e using : Descriptive Statistics Inferential Statistics Descriptive statistics are used to describe, summarize, or e xp l a in a g i v en se t o f d a t a . Whe r ea s , in f e r en t ial st a tis t i c s is used to infer certain characteristics of samples to population. @researchaid 19 22 - 1 2 - 2 02 2

A. M e asures of Centra l T enden c y Is the analysis carried out with the description of single variable in terms of the applicable unit of analysis. The th r ee mos t f r e q ue n t l y u s ed measu r es o f c entra l tendency are; Mode Median and Mean @researchaid 20 22 - 1 2 - 2 02 2

1. Mode Mode can be defined as the most frequently occurring value in a g r oup of obse r v a tion s . If the scores for a given sample distributions are: 32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45 Then the mode would be 39 because a score of 39 occurs three times, m o r e tha n a n y other sco r e . Mode i s v e r y good m e asu r e f o r asce r taini n g th e loc a tion of dist r i b ution i n th e case of nom i nal d a t a . @researchaid 21 22 - 1 2 - 2 02 2

2. Median Median is defined as the middle value in an ordered arrangement of o b se r v a tio n s . The median is often used to summarize the location of a distribution. F u r th e r , the me d ian c a n be used with ordi n al , int e r v al , or r a tio measurements. If the sc o r es f o r a g i v en samp le dist r i b u t ions ar e: 32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45 The media n will be 38 + 39 = 38.5 2 @researchaid 22 22 - 1 2 - 2 02 2

3. Mean The arithmetic mean is the most commonly used and accepted m e asu r e of c e ntral tende n c y . This should be used in the case of interval or ratio data. If the scores for a given sample distributions are: 32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45 The mea n of the dist r i b ution w i l l be: 32+32+35+36+37+38+38+39+39+39+40+40+42+45/14= 38 Mid-mean, geometric mean, mid-range are other types of means. @researchaid 23 22 - 1 2 - 2 02 2

B . Measu r e of Dispe r sion The measure of dispersion is as important as the measure of location for data description and whenever researchers describe the measure of location, they should also specify the spread of di s t r i b ution w hi c h i s m e asu r ed b y m e asu r e o f di spe r si o n . St a tist i cs m e asu r ing v a r ia b ility and dispe r sion ar e: o Rang e o V a r ia n c e a n d o Stand a rd d e vi a tion @researchaid 24 22 - 1 2 - 2 02 2

1. Range Range is the difference between the highest and lowest value. It is based solely on the extreme values, thus it cannot truly reveal the body of measurement. 2.Variance The variance is another measure of variability. Variance is expressed by the formula Variance makes deviation much larger than it actually is, hence to r em o v e the effect th e y a re un squa re d , w e ta k e the squa r e r o o t of the squared deviations in the process of computing standard deviation. @researchaid 25 22 - 1 2 - 2 02 2

3. Standar d D e vi a tion The standard deviation provides the best measure of dispersion for interval/ratio measurements and is the most widely used st a tist i cal m e asu r e aft e r m e an. The Standard deviation for the sample will be calculated by the f o ll o w i ng f o r m ul a . Formula to compute the Standard Deviation for Population @researchaid 22 - 1 2 - 2 02 2 26

Example The owner of a restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected customers and noted the following 4 4 , 5 , 38, 9 6 , 42, 4 7 , 40, 3 9 , 46, 50 He calculated the mean by adding the observations and dividing by 10 to get x = 49.2 Below is the table for getting the standard deviation: @researchaid 22 - 1 2 - 2 02 2 27

22 - 1 2 - 2 02 2 2 X X- 49.2 (X-49.2)2 44 -5.2 27.04 50 0.8 0.64 38 11.2 125.44 96 46.8 2190.24 42 -7.2 51.84 47 -2.2 4.84 50 -9.2 84.64 39 -10.2 104.04 46 -3.2 10.24 50 0.8 0.64 8 @resear T ch o aid tal 2,600.4

Hence the variance is 289 and the standard deviation is the square ro o t o f 289 = 17 . The mean for this example was about 49.2 and the standard deviation was 17. W e h a v e : 49 . 2 - 17 = 32 . 2 49 . 2 + 17 = 66 . 2 What this means is that most of the patrons probably spend between 32.20 and 66.20. @researchaid 22 - 1 2 - 2 02 2 29

Inferential statistics infer from the sample to the population They determine probability of characteristics of population based on the characteristics of your sample They help assess strength of the relationship between your independent ( causal ) variables, and you dependent ( effect ) variables. @researchaid 22 - 1 2 - 2 02 2 30 Infe r ential St a tistics

A. t - tests 22 - 1 2 - 2 02 2 31 There are three types of t-tests: One sample t-test, independent sample t-tests, and dependent (paired) sample t- tests. A . 1 . On e - sampl e t - test One sam p l e t test i s us e d t o c o m par e th e m e a n of a single sam p l e w i th th e po p ul a tion m e a n . E.g. MoFED wants to know if the per capita income of Amhara Regional state is the same as the national average. The objective is to decide whether to accept a null hypothesis: H = μ = μ or t o r ej e ct th e null h ypoth e sis i n f a v o u r of th e a l te r n a ti v e hypothesis: Ha : μ i s significant l y dif f e r ent f r om μ @researchaid

Exa m ple o n One Sam p l e t - test 22 - 1 2 - 2 02 2 32 A business school in its advertisement claims that the average salary of its graduates in a particular year is at par with the average salaries offered at the top five business schools (800,000)per annum. Then a sample of 20 graduates were selected and their salaries is indicated a as follows E mpl o y ee Salary Employee Salary 1.00 750.00 11.00 567.00 2.00 600.00 12.00 677.00 3.00 600.00 13.00 700.00 4.00 650.00 14.00 690.00 5.00 860.00 15.00 688.00 6.00 960.00 16.00 569.00 7.00 100.00 17.00 788.00 8.00 880.00 18.00 654.00 9.00 890.00 19.00 489.00 10.00 960.00 20.00 487.00 P @ ro re v se e ar t c h ha e id validity of claim by the Business School

22 - 1 2 - 2 02 2 33 A.2. Independent (unpaired) samples t- test W e inte r est e d in c o m pa r ing t w o po p ul a tions using a random sam p le s f r om ea c h. E.g. If MoFED wants to know whether the per capita income of O r o m ia r e g ional st a te is the same w ith SN N P taking Adam a and Hawasa as random samples, the unpaired t-test should be used. Or If a market researcher wants to know in which territory will his p r od u ct be a b l e to pe n etr a te m o r e . @researchaid

Example on Independent Sample t-test 22 - 1 2 - 2 02 2 The National Bank of Ethiopia Wants to know whether private banks out perform than government owned banks.The performance was measured by return on assets and the data for ten private owned and 12 st a te o wned ba n ks is i n d ic a ted a s un d e r . 12.00 2.00 .22 13.00 2.00 .88 14.00 2.00 .48 15.00 2.00 .59 16.00 2.00 .75 17.00 2.00 .33 18.00 2.00 .28 19.00 2.00 .53 20.00 2.00 .33 21.00 2.00 .72 22.00 2.00 .77 1.00 1.00 .34 2.00 1.00 .46 3.00 1.00 .44 4.00 1.00 .68 5.00 1.00 .70 6.00 1.00 .32 7.00 1.00 .25 8.00 1.00 .33 9.00 1.00 .56 10.00 1.00 .66 3 1 4 1.00 @re 2 se .0 ar chaid .66

A.3 . Depe n den t (pai r ed) Sa m pl e t - test Experiments where the observation are made on the same sam p l e a t t w o dif f e r ent time s . E.g. The HR manager wants to know if a particular training program had any impact in increasing the motivation level of the employees. So, if D represents the difference between observations, the h ypoth e ses ar e : Ho: D = (the difference between the two observations is 0) H a : D ≠ (the diff e r en c e i s not ) If the p value associated with t is low (< 0.05), there is evidence to reject the null hypothesis. @researchaid 22 - 1 2 - 2 02 2 35

Example on Dependent sample t-test Human resource department claim that the training it provided to its workers greatly enhanced their efficiency. The efficiency was measured by the number of products produced each day. The data was collected for some periods before and after the training provided to the employees.The data is indicated as under; @researchaid 22 - 1 2 - 2 02 2 36 1.00 41.00 44.00 2.00 35.00 36.00 3.00 40.00 48.00 4.00 50.00 47.00 5.00 39.00 40.00 6.00 45.00 52.00 7.00 35.00 35.00 8.00 36.00 51.00 9.00 44.00 46.00 10.00 40.00 55.00 B efo r e E m p l o y ee After E m ployee Before After 11.00 46.00 39.00 12.00 42.00 40.00 13.00 37.00 36.00 14.00 34.00 39.00 15.00 38.00 50.00 16.00 42.00 46.00 17.00 46.00 49.00 18.00 39.00 42.00 19.00 40.00 51.00 20.00 45.00 37.00

R e l a tionsh ips b e t w e e n V a r i a b l e s Help r ese ar c he r s t o kn o w th e n a t u r e , di r ection, and significance of the relationships between two variables in the study. Oft e n i n practical s i tu a tions, r ese ar c he r s ar e i n te r est e d in describing associations between variables. They try to ascertain how two variables are related with each ot h e r , th a t is , w he t her a cha n ge i n one affec t s th e ot h e r . The measures of association depend on the nature of the data and c o uld b e po s it i v e , n e g a t i v e o r ne u t ra l . 22 - 1 2 - 2 02 2 37 @researchaid

While defining relationship, researchers can define one variable a s a f u nction of anoth e r v a r ia b l e . Researchers can then assess whether a change in one variable results in change in the other variable to ascertain the relationship. It is important to point out that the relationship between two variables could be a simple association or it could be a causal relationship. 22 - 1 2 - 2 02 2 38 @researchaid

7.2.1.1. Relation between two nominal variables -X 2 Test Ch i - s q u a re (X 2 ) is o n e o f the v e r y p o p u la r me t hod fo r test ing hypothesis on discrete data (can be nominal or Ordinal). Finding descriptive statistics for such data is meaningless. The o n l y s u mma r y s t a t i stics use f ul fo r su c h d a ta a re freq u encies and percentages. Ther e a re th r ee di f f e r e n t t y pe s o f c hi - sq u a re test s ; 1.Chi-square test for goodness of fit Ch i - sq u a re test fo r ho m o g e n eity Ch i - squ a r e test o f inde p en d en c e @researchaid 39 22-12-2022

The chi-square test for goodness of fit determines if the sample under investigation has been drawn from a population, which follows some spe c ific dist r i b u t io n . The test for homogeneity investigates the issue whether several populations are homogeneous with respect to a particular characteristics. The two a re not very common in business research. Hence, it the chi-square test of Independence will be given due a t t en t ion her e in this c hap t er! @researchaid 40 22-12-2022

Chi-square Test of Independence Is used to test the hypothesis that two categorical variables are in d e p e n de nt o f ea c h o t he r . The following are examples of business research where chi-square test o f in d e p e n de nc e c a n b e a p plie d ; 1 . If You want to see that performance ( categorized as; loss, breakeven and profit) of a firm is dependent on location of the firm ( low, middle and high inc o m e c o u n t r y) . @researchaid 22 - 1 2 - 2 02 2 41

2. An organizations’ researcher wants to determine if the satisfaction level ( on scale 1 to 3) of the employees of a firm is de p endent on their placemen t ( loc a l or intern a tion a l) of t he f irm. @researchaid 22 - 1 2 - 2 02 2 42 I f y ou w ant to a s ses s if the r e is m ot i v a tion t o w a r d s job is de p endent on g ender of t he r espondent . I s v i e wing t e l e v i si o n a d v e r tisem e nt of a p r odu c t ( y es/No) related to buying that particular product ( buy/Not buy).

It compares the expected frequencies (based on probabilities) and the o b se r v ed f req u e n cies a n d t he X 2 st a tisti c s is ob t ain e d b y the f o r m ul a : If the tabulated chi-square value is greater than the calculated chi-square value, the null hypothesis is rejected and we can conclude that there is si gnifica n t as s oci a tion bet w een the t w o v a r ia b le s . @researchaid 22 - 1 2 - 2 02 2 43

Exam p l e on C h i - squ a r e test of I n d epe n de nce @researchaid 22 - 1 2 - 2 02 2 44 A researcher wants to know whether the performance of firms is independent of its location. She developed a measure of performance on a nominal scale from 1 to 3, 1 representing loss, 2 breakeven, and 3 profit. The location of the firm was put in to one of the two categories 1 representing low and middle income countries and 2 representing high income countries. The d a ta on these t w o v a r ia b l e a s in d ic a ted bel o w . 1.00 1.00 3.00 11.00 3.00 3.00 21.00 2.00 3.00 2.00 1.00 3.00 12.00 3.00 1.00 22.00 2.00 3.00 3.00 1.00 3.00 13.00 2.00 1.00 23.00 3.00 1.00 4.00 1.00 1.00 14.00 3.00 2.00 24.00 3.00 1.00 5.00 2.00 1.00 15.00 2.00 3.00 25.00 3.00 1.00 6.00 2.00 3.00 16.00 3.00 1.00 26.00 1.00 2.00 7.00 2.00 2.00 17.00 1.00 3.00 27.00 2.00 3.00 8.00 1.00 2.00 18.00 1.00 3.00 28.00 3.00 1.00 9.00 1.00 2.00 19.00 1.00 3.00 29.00 1.00 2.00 10.00 3.00 3.00 20.00 2.00 2.00 30.00 3.00 1.00 Performance Fi r m Lo c a tion Performance Fi r m Lo c a tion P er fo r ma nce Fi r m Loc a tion

7.2.1. 2. Co r r el a tions Ana l ysis Correlation is one of the most widely used measures of association be t w een t w o or mo r e v a r ia b le s . Measures of correlation are employed to explore the presence or abs e nc e of a c o r r el a tion be t w een the v a r ia b le s . The correlation coefficient describes the direction of the c o r r el a tio n , th a t is, whether it is P ositi v e or Negative, An d the streng t h of the c o rr el a tio n , th a t is, whether a n e x isting correlation is: St r ong or Weak. @researchaid 22 - 1 2 - 2 02 2 45

Though there are various measures of correlation between nominal or ordinal data, Pearson product-moment correlation coefficient is a measure of linear association between two interval or ratio variables. Measure, represented by the letter r, varies from –1 to +1. A zero correlation indicates that there is no correlation between the variables . @researchaid 22 - 1 2 - 2 02 2 46

Significance of Correlations It is imperative to assess whether the identified relationship between variables is statistically significant, that is, whether a co r r el a tion actual l y exists i n th e po p ul a tion . Significance test can be easily done by comparing computer- generated p value with the predetermined significance level, wh i c h i n m o st case s i s . 05 . In case the p value is less than 0.05, we can assume that the correlation is significant and it is a reflection of true population characteristic. In case researchers want to control the other variable in multiple co r r el a tion‘ by 22-12-2022 co r r el a tion v a r ia b les , h e can use ‗pa r tial co n t r o ll i n g other v a r ia b le s . 47 @researchaid

Example on Correlation Analysis The researcher wants to know whether firm performance measured by return on assets has relation with firm age, ownership and branch size. The data for ten firms is indicated as follows. See the relation between the variables using bivariate and partial correlation. @researchaid 22 - 1 2 - 2 02 2 48 Firm Age Ownership Branch Size ROA 1.00 5.00 2.00 53.00 .46 2.00 8.00 1.00 66.00 .33 3.00 9.00 2.00 39.00 .25 4.00 13.00 1.00 51.00 .55 5.00 16.00 2.00 40.00 .53 6.00 12.00 2.00 33.00 .63 7.00 4.00 2.00 21.00 .72 8.00 11.00 2.00 80.00 .71 9.00 18.00 1.00 72.00 .31 10.00 15.00 1.00 25.00 .45

7.2. 1 .3. Re g re s s i on Ana l ys i s Regression is one of the most frequently used techniques in social research. Regression analysis is used to predict the value of one variable (the dependent variable ) on the basis of other variables (the independent variable ). The most common form of regression, however, is linear regression , where the dependent variable is related to the independent variable in a linear way. @researchaid 22 - 1 2 - 2 02 2 49

The l i nea r r e g r ession equ a tion t a k es the foll o wing fo r m ( B i - V a r i a t e Re g res s io n ) Variables: X = Ind e pe n de n t V a r ia b l e ( w e p r o vide this) Y = De pen d ent V a r ia b l e ( w e obse r v e this) Parameters: β =Y-Intercept β 1 = Slope ε = error term Note: β 1 = Indicates the change in the dependent variable for every unit change in the independent variable @researchaid 22 - 1 2 - 2 02 2 50

For example, the marketing manager wants to know if sales is dependent on factors such as advertising spent, number of products introduced, number of sales personnel etc. Regression coefficient Is the measure of how strongly the predictor (IDV) predicts the DV There are two types of regression coefficients Unstandardized co e f f icients Standardized co e f f icients (Beta V alu e s) The unstandardized coefficient can be used in the equation as coefficients of different independent variables along with the c o nsta nt t e rm t o p r edi c t th e v alu e o f th e depend e n t v a r i a b l e . o Dif f e r e n c e i n “Y ” pe r Un i t chan g e i n “X” @researchaid 22 - 1 2 - 2 02 2 51

The standardized coefficient (Beta) is measured in standard deviation, i.e. the difference in “Y” in standard deviation per standard deviation difference in “X” A beta value of 2 associates with a particular independent variable indicates that a change of 1 standard deviation in that particular independent variable will result in a change of 2 standard deviation in the dependent variable. @researchaid 22 - 1 2 - 2 02 2 52

R v alues 22 - 1 2 - 2 02 2 53 R represents the correlation between the observed values and the predicted values (based on the regression equation obtained) of the dependent variable. Is used to measure the fitness of the model used for the research. R square is the square of R and gives the proportion of variance in the dependent variable accounted for by the set of independent variables chosen for the model. However R square value tend to be influenced when the number of independent variables is more or when the number of cases are large. Therefore, the adjusted R square that takes in to account these things and prov @ id re e se s ar m cha o id re accurate information about the fitness of the model.

Assumptio n T ests o f Line a r R e g r ession 22 - 1 2 - 2 02 2 54 1. Line a r ity Firstly, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter/ Dispersed / plots. L i near No n - Linear @researchaid

2 . No r mality Linear regression analysis also requires all variables to be multivariate normal. This assumption can best be checked with a histogram and a fitted normal curve or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnof and Shapiro-Wilk test, test. When the data is not normally distributed a non-linear transformation can introduce the effects of multicollinearity. If the p-value is greater than 0.05 there is no reason to reject the null h y pothes i s a nd conclude th a t the d a ta comes f r om a no r m a l dist r i b utio n . @researchaid 22 - 1 2 - 2 02 2 55

3. Mul t ico l linea r ity Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent from each other. H o w to test the p r esen c e o f Multi c o l li n ea r ity? Tolerance -the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. W ith T < 0.1 t he r e m i ght be m u l tic o lli n ea r ity in t he d a ta a n d w i th T < 0. 1 the r e ce r tai nl y i s . Variance inflation factors (VIF), 1/T- measure how much the variance of the estimated regression coefficients are inflated / overstated /as compared to when the predictor variables are not linearly related .With VIF > 10 there is an indication for multicollinearity to be present; with VIF > 100 there is certainly multicollinearity in the sample. Henc e , T o l er a nce sh o uld be > . 1 or VI F sh o uld be < 1 . @researchaid 22 - 1 2 - 2 02 2 56

4. Auto c o r rel a tion @researchaid 22 - 1 2 - 2 02 2 57 Lin ear r e g r es s i on a na l ys i s r e q ui r es th a t the r e i s lit t l e or no autoc o r r el a tion in the d a t a . Autocorrelation occurs when the residuals are not independent from ea c h othe r . In other words when the value of y(x+1) is not independent from the v alu e of y(x ) . This for instance typically occurs in stock prices, where the price is not in d epend ent f r om the p r e viou s p r ic e .

While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin- Watson test. Durbin-Watson's d tests the null hypothesis that the residuals are not linear l y au t o - c o r r el a te d . While d can assume values between 0 and 4, values around 2 indicate n o au t oc o rr el a tion p r o b lem s . As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto- c o r r el a tion p r o b le m in the d a t a . @researchaid 22 - 1 2 - 2 02 2 58

5. Homosc e das t icity @researchaid 22 - 1 2 - 2 02 2 59 The last assumption the linear regression analysis makes is homoscedasticity. That is, the error variance around predicted scores is the sa m e for al l p r edicted v alues The scatter plot is good way to check whether homoscedasticity ( that is the error terms along the regression are equal) is given. 6 . M o del Fitness Linear regression require the model to fit the test. It can be tested by using F- st a tistics or the R - V alue s .

Design issues in regression Ratio of cases to IDV Ratio of cases to independent variables: the rule of thumb require the ratio to have sample size at least as much as 50 + 8n for testing multiple c o r r el a tion and 104 + n for test ing i n d ividu a l p r edi c t o r s Where n is the number of independent variables. @researchaid 22 - 1 2 - 2 02 2 60

7.2. 2. Multi v a r i a te An a l ysis In many real life situations, it becomes necessary to analyse relationship among three or more variables led to the popularity of multivariate statistics. Multivariate statistics techniques look at the pattern of relationships be t w een s e v eral v a r ia b le s si m ul t ane o u sl y . The following section deals with categories of multivariate analysis techniques. 7. 2 .2 . 1. Multiple l i ne a r Re g ression In simple regression, there is one dependent variable and one independent variable, whereas in multiple regression, there is one dependent variable and many independent variables. @researchaid 22 - 1 2 - 2 02 2 61

It examines the relationship between a single metric dependent variable and t w o or mo r e met r ic inde p en d ent v a r ia b le s . Assumptions of normality and linearity should be checked before using m ul t iple r e g r ession. Where: y is a dependent variable and x1, x2, … xk are independent variables and a is the Y intercept , b1, b2 … bk ar e th e r e g r ession co e f f icien t . . @researchaid 22 - 1 2 - 2 02 2 62

Othe r T ypes of Mult i v a r i a t e An a l y s is 7.2.2.2 Multivariate Analysis of Variance(MANOVA) MANOVA is similar to ANOVA, with difference that analysis of v a r iance / A N O V A / tests the mean differe n ce s of mo r e than t w o g r oups on one de p en d en t v a r ia b les , whe r eas, MAN O V A tests mean differenc e s am o ng g r ou p s ac r oss several dependent variables simultaneously. There are two and more dependant and independent variable exist on time. @researchaid 22 - 1 2 - 2 02 2 63

7.2.2.3. Multiple Analysis of Covariance (MANCOVA) Multiple analysis of covariance (MANCOVA) is similar to MANOVA and the only difference is the addition of interval independents as ‘covariates’. These covariates act as control variables, which try to reduce the e r r or in the mo d el and ens u r e the best fit . MANCOVA analyses the mean differences among groups for a linear combination of dependent variables after adjusting for the covariate. E.g. testing the difference in output by age group after adjusting f or educ a tion a l qu a li f i c a t i on . @researchaid 22 - 1 2 - 2 02 2 64

5.2.2.4. Logistic Regression/ Logit Model Is more similar with regression analysis from the practical po i nt o f v i e w . Both methods produce prediction equation and also the regression coefficient measure the prediction capacity of independent variables. Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous. i.e . Logistic regression is also used when the dependent variable is non-metric. @researchaid 22 - 1 2 - 2 02 2 65

E.g. Researchers might be interested in predicting the relationship between chewing tobacco and throat cancer. The independent variable is the decision to chew tobacco (to chew tobacco or not to chew tobacco), and the dependent variable is whether to have throat cancer. In this case-control design, researchers have two levels in independent variables (to chew tobacco/not to chew tobacco) and two levels in dependent variables (throat cancer/no throat cancer). @researchaid 22 - 1 2 - 2 02 2 66

5.2.2.5. Conjoint Analysis Conjoint analysis is a statistical technique used in marketing research to determine how people value different attributes (feature, function, benefits) that make up an individual product or service. Used to determine what combination of a limited number of attributes is most influential on respondent choice or de c ision makin g . E.g. In marketing, conjoint analysis is used to understand how consumers develop preferences for products or services based on features such as price, quality or delivery speed. It is carried out with some form of multiple regression analysis. @researchaid 22 - 1 2 - 2 02 2 67

A.4. Compa r ing means of more than t w o popul a tions Analysis of Variance (ANOVA) is the typical method to be used to co m pa re the m e an s of m o r e than t w o po p ul a tion s . If the variances computed do not differ significantly, one can believe that all the group means come from the same sampling dist r i b ution of mean s and th e r e is no r eas o n to claim th a t the group means differ. The “F” statistics will be used to r each on this decision.  @researchaid 22 - 1 2 - 2 02 2 68

E.g. a marketing manager wants to investigate the impact of the different discount schemes on the sales of three major brands of edible oil. The board of directors of banks wants to know if the effectives of the corporate governance practice varies depending on the gender diversity, independence and industry experience of BODs. The Instructor wants to investigate if the performance of students is af f ec t ed b y their a g e , ge n de r , and their ba c k g r o u n d . @researchaid 22 - 1 2 - 2 02 2 69

5. 2 .3. 3. 6 . E r ro r s in H y pothesis testing T ype I a nd T ype I I E r ro r s While testing a hypothesis, if we reject it when it should be accepted, it amounts to type I error. On the other hand, accepting hypothesis when it should be rejected amounts to type II error. An a t t e m p t to r educe one will in c r ea s e the o the r type o f error. Therefore, sample size should be increased to reduce both errors. @researchaid 22 - 1 2 - 2 02 2 70

Significance Level (P- value) It is the criteria used to accept of reject the null hypothesis. It is also the probability of concluding (incorrectly) that there is a difference in your samples when no true difference exists. It is the statistics calculated by comparing the distribution of a given sample data and an expected distribution (normal, F, t etc). E . g . A p - v alue of . 05 m e ans th a t the r e is on l y 5 % c han ce that you would be wrong in concluding that the population are different or 95 % confident of making a right decision. @researchaid 22 - 1 2 - 2 02 2 71

E s t a blis h C r i tic a l or R e j ection r e g ion T h e r isk/ p r o b a b i l i ty of rejecting Ho when it is true @researchaid 22 - 1 2 - 2 02 2 72

End o f Chapte r 7 !! @researchaid 22 - 1 2 - 2 02 2 73
Tags