Data Types and Distributions Tonya Esterhuizen Biostatistics Unit, Centre for Evidence Based Health Care
Introduction to Statistics “There are three kinds of lies: lies, damn lies and statistics” Benjamin Disraeli “Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.” Aaron Levenstein Quiz Click the Quiz button to edit this quiz
Statistics - What Does It Mean ? Data Numerical observations Quantitative information
Statistics - What Does It Mean ? Number of doctors in the different provinces in South Africa Birth weight of babies born at a given hospital in a given year
The number of diabetics in Cape Town who have had an amputation The prevalence rate of HIV per 1000 population in Western Cape The creatine concentration in mg per litre in a 24 h urine sample Statistics - What Does It Mean ?
Statistics - What Does It Mean ? Discipline or science of managing uncertainties in decision processes
Statistics - What Does It Mean ? using the scientific methods of collecting processing reducing presenting analysing interpreting data
Statistics - What Does It Mean ? and of making inferences and drawing conclusions from numerical data
Main Uses of Statistical Methods Collection of data in the best possible way Description of the characteristics of a group or situation Analysing data and drawing conclusions
Collection of data in the best way Using a suitable and appropriate method for selecting subjects for a study, to minimise the role of uncertainty Designing valid data collection instruments such as questionnaires and schedules
Collection of data in the best way Organising data collection procedures for clinical and laboratory research and epidemiological research to minimise the chances of errors eg standardise definitions and equipment and train data gatherers
Description of the characteristics of a group or situation Data presentation in tables or graphs Calculating summary measures such as averages, which can adequately represent the structure of the data set
Analysing data and drawing conclusions This involves analytical techniques and the use of probability concepts in drawing conclusions
Uses of statistical concepts and methods in health science Handling of variation Diagnosis of patients ailments and communities’ health problems Prediction of likely outcomes of an intervention programme in a community or of treatment of individual patients
Uses of statistical concepts and methods in health science Selection of appropriate intervention for a patient or a community Public health, health administration and planning Planning, conducting, analysing, interpreting and reporting of medical research
Handling of variation Variation in a characteristic occurs when its value changes from subject to subject Or from time to time from instrument to instrument within the same subject or from observer to observer
Handling of variation Requires appropriate methods to summarise a characteristic for a group of patients or for a community decide on the normal or average value of a characteristic compare two groups of patients with respect to a particular characteristic
Diagnosis of patients ailments Explicit statistical methods are available for ordering disease categories according to their probabilities of being the correct diagnosis Changing medicine from an art to a science ?
Prediction of likely outcome of treatment Prognosis - An outcome is predicted when the chances of its occurrence are high and the associated uncertainty is low Achieved by keeping records of the characteristics prior to treatment, the treatment and its outcome and analysing them
Selection of appropriate intervention This is based on experience gained with similar patients who received the intervention reports of clinical trials or experiments of the efficacy of different drugs or Rx objective assessment of previous experience
Public Health, health planning Use of data relating to the health and illness in the population to make a community diagnosis Requires knowledge of:
Public Health, health planning Requires knowledge of: population characteristics - age, sex health profile of the population in terms of disease risk factors factors affecting population dynamics data on births, deaths, migration
Get a feel for the data Assess the quality of the data Types of variables Summary statistics Distribution Graphical representation Descriptive Statistics
Types of Variables Quantitative Continuous : temp, height, weight Discrete : number of headaches/week Categorical Ordinal : severity of pain Nominal : sex, blood group Binomial : no or yes
Types of Variables Influences the type of analysis that is possible with that data Therefore its important to be able to define your variable types so that the most appropriate statistical tests are chosen.
Types of Variables
Types of Data Distributions Two of the most common in medical statistics: Normal (Z) distribution (continuous data) Binomial distribution (binary categorical data)
Normal and Skewed Distributions Symmetrical mean, mode, median unimodal
Normal and Skewed Distributions Positively Skewed Negatively skewed tail tail
Normal and Skewed Distributions Bimodal
Normal and Skewed Distributions Mode Median Mean More on these summary measures in next session
Binomial Distribution: Percentages and proportions In a survey of attitudes to statistics 6 out of 100 people say they enjoy the subject The percentage enjoying statistics is (6/100) x 100 = 6% The proportion enjoying statistics is 6/100 = 0.06 In this session we will sometimes use percentages and sometimes proportions
Binomial distribution Sample numbers with a given characteristic follow the binomial distribution The shape of this distribution varies with the population proportion p, and the size of the sample With small samples, the distribution is symmetrical only if p is 0.5
Binomial approximation to Normal As the sample size (n) becomes larger the shape of the distribution becomes roughly Normal, whatever the value of p A rule of thumb is that you can use the Normal approximation if both np and n(1-p) are greater than 5 e.g. If n=20 and p = 0.3, np=6 and n(1-p) =14. Since both exceed 5 we can use the Normal approximation
Binomial approximation to Normal The Normal approximation can be used for both confidence intervals and for hypothesis tests (covered in session 3)