Introduccion a la Estadistica
Virgílio Gavicho Uarrota
(Agronomist, Msc, Ph.D)
https://www.r-project.org/
"Statistics" is an inclusive term
describing the...
Collection
Organization
Analysis
Interpretation
of DATA to answer questions and make
decisions
Definition
Estadística e mineria de datos
•Estadística
Ciência da análise de
datos;
Recoje, organiza,
analisa e saca
conclusiones a partir de
estos datos
•Mineria de datos
Extrair informacion que
no es soportada por los
datos;
Construcion de modelos
estadísticos
Observaciones
•Matéria prima na qual investigadores
trabajam;
•Números;
•Datos;
•Característica comum=“Variabilidade”
•Variábles
Measurement scales/data types
Nominal Scale
•Classified by quality (attribute) rather than numerical scale
•Male, Female
•Eye color: brown, green, or blue
•Genetic phenotype
Ordinal Scale
•Have relative differences
•Consist of ordering or ranking the differences
•Size of 5 cell types might be 1,2,3,4,5 to denote their magnitudes
relative to each other
•Likert scale rating one's overall health on a scale of 1-5
•Exact measurements are not provided
Interval Scale
•Equal intervals between units
•Arbitrary zero point; i.e. negative numbers are allowed
»Degrees Fahrenheit; sea level
Measurement scales /data types
Ratio Scale
•Interval scale with absolute zero (no negative numbers); i.e. zero indicates
complete absence of the characteristic
»Height; weight; age
Discrete variables
•Variable that can take on only certain values
•1,2,3 - itenger only
•Continuous variables
•1.2, 2.4, 7.888 - can take on any real value
•Nominal data are discrete
•Ordinal, interval, and ratio data can be discrete or continuous
•Scale of measurement determines descriptive statistics and other analysis
options
•Nominal: frequency (%) only... mode
•Ordinal: frequency (%)... median and range
•Interval: addition and subtraction... mean and variance
•Ratio: all mathematical and statistical procedures are allowed
Populations and samples
Poblacion y muestra
•Poblacion o universo =todos valores posibles de
una variable
•Muestra = parte de la poblacion
Deve ser representativa para sacar buenas
inferências
Apresentacion, Sumarizacion e caracterizacion de
datos
-Tablas, gráficos, histogramas, diagramas, etc
Medidas de tendência central y de dispersion
Medidas de tendência central
-Promedio, mediana, moda.
Medidas de dispersion
-Variância
-Desviacion standard
-Standard error of mean
-Coeficiente de variacion
-quartil, decil, percentil
Measures of Central Tendency
•Mean
•Median
•Mode
•No one is "best" summary measure for
all purposes
Disadvantages of the mean
Cannot be used for nominal, ordinal, or qualitative data
What is the mean gender for a group of 16 females and 4 males?
It can be affected strongly by outliers
Example: in Whoville, there are 10 people who make $10,000 a year and one
person who earns $1,000,000
The mean = $100,000, which does not reflect the salary of anyone in the sample
Median
Identifies midpoint of data when they are sorted from highest to lowest
•If even number of observations, take average of two midpoints
Advantages of Median
•Works on ordered data as well as quantitative data
•Is unaffected by outliers
•In prior income example wherein mean = $100,000, the median =
$10,000, which better reflects the typical salary from that sample
•Similar to mean, median is unique: a sample has only one
Disadvantage of Median
•Not affected by changes in data away from center
•If the highest income in Whoville drops to $25,000, median income
remains the same
Mode
•The most common value
Advantage
•Works with categorical data
Disadvantages
•May not be unique; e.g. bi-modal or multi-modal
•May not lie near the center
Dispersion measures
•Variation in values
•When the dispersion is large, the values are widely scattered; when it is
small they are tightly clustered
•The width of diagrams such as scatterplots, box plots, stem and leaf plots is
greater for samples with more dispersion and vice versa
•There are several measures of dispersion
•Variance is a fundamental measure of dispersion underlying many statistical
procedures
•Standard deviation is likely the most common measure of dispersion used as
a descriptive statistic
•These measures indicate to what degree the individual observations of a
data set are dispersed or 'spread out' around their mean
•In manufacturing or measurement, high precision is associated with low
dispersion
•Common measures include:
•Range
•Quantiles and interquartile range
•Variance and standard deviation
•Standard error (of the mean)
•Coefficient of variation
Range
•The range of a sample (or a data set) describes the
difference between the largest and the smallest
observed value for a given variable
•A lot of information is ignored when computing the range
since only the largest and the smallest values are
considered
•The range is greatly influenced by the presence of just
one unusually large or small value in the sample (outlier)
Quantiles
•25th, 50th, and 75th percentiles of data
•50th = median
•25th = upper quantile
•75th = lower quantile
•order data, then select
•Useful for defining groups based on rankings
Interquartile Range (IQR)
The difference between the 25th & 75th
quartiles (1st & 3rd)
•Contains the central 50% of observations
Standard Error of the Mean
•A measure of how much the means vary from
sample to sample when taken from the same
distribution
•It is a method used to estimate the standard
deviation of a sampling distribution
•Quantifies how precisely you can estimate the
true population mean
•SEM = SD/√n
SD vs SEM
•The SD describes the variability between
individuals in a sample
•The SEM describes the uncertainty of how
the sample mean represents the
population mean
•SEM is always < SD
Selecting dispersion measures
•Use standard deviation when the mean is reported
•Percentiles and IQR are used when the median is used
•The mean is used, but the objective is to compare
individual observations with a set of norms
•IQR is used to describe the central 50% of a distribution
•Range is used when the purpose is to highlight extreme
values
•CV is used when the intent is to compare distributions
measured on different scales