Estadistica discriptiva_Introduciona a data mining

VirgilioGavichoUarro 16 views 27 slides Sep 06, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

estadistica


Slide Content

Introduccion a la Estadistica
Virgílio Gavicho Uarrota
(Agronomist, Msc, Ph.D)
https://www.r-project.org/

"Statistics" is an inclusive term
describing the...
Collection
Organization
Analysis
Interpretation
of DATA to answer questions and make
decisions
Definition

Estadística e mineria de datos
•Estadística
 Ciência da análise de
datos;
 Recoje, organiza,
analisa e saca
conclusiones a partir de
estos datos
•Mineria de datos
Extrair informacion que
no es soportada por los
datos;
Construcion de modelos
estadísticos

Observaciones
•Matéria prima na qual investigadores
trabajam;
•Números;
•Datos;
•Característica comum=“Variabilidade”
•Variábles

Measurement scales/data types
Nominal Scale
•Classified by quality (attribute) rather than numerical scale
•Male, Female
•Eye color: brown, green, or blue
•Genetic phenotype
Ordinal Scale
•Have relative differences
•Consist of ordering or ranking the differences
•Size of 5 cell types might be 1,2,3,4,5 to denote their magnitudes
relative to each other
•Likert scale rating one's overall health on a scale of 1-5
•Exact measurements are not provided
Interval Scale
•Equal intervals between units
•Arbitrary zero point; i.e. negative numbers are allowed
»Degrees Fahrenheit; sea level

Measurement scales /data types
Ratio Scale
•Interval scale with absolute zero (no negative numbers); i.e. zero indicates
complete absence of the characteristic
»Height; weight; age
Discrete variables
•Variable that can take on only certain values
•1,2,3 - itenger only
•Continuous variables
•1.2, 2.4, 7.888 - can take on any real value
•Nominal data are discrete
•Ordinal, interval, and ratio data can be discrete or continuous
•Scale of measurement determines descriptive statistics and other analysis
options
•Nominal: frequency (%) only... mode
•Ordinal: frequency (%)... median and range
•Interval: addition and subtraction... mean and variance
•Ratio: all mathematical and statistical procedures are allowed

Populations and samples

Poblacion y muestra
•Poblacion o universo =todos valores posibles de
una variable
•Muestra = parte de la poblacion
Deve ser representativa para sacar buenas
inferências
Apresentacion, Sumarizacion e caracterizacion de
datos
-Tablas, gráficos, histogramas, diagramas, etc

Medidas de tendência central y de dispersion
Medidas de tendência central
-Promedio, mediana, moda.
Medidas de dispersion
-Variância
-Desviacion standard
-Standard error of mean
-Coeficiente de variacion
-quartil, decil, percentil

Measures of Central Tendency
•Mean
•Median
•Mode
•No one is "best" summary measure for
all purposes

Disadvantages of the mean
Cannot be used for nominal, ordinal, or qualitative data
What is the mean gender for a group of 16 females and 4 males?
It can be affected strongly by outliers
Example: in Whoville, there are 10 people who make $10,000 a year and one
person who earns $1,000,000
The mean = $100,000, which does not reflect the salary of anyone in the sample

Median
Identifies midpoint of data when they are sorted from highest to lowest
•If even number of observations, take average of two midpoints
Advantages of Median
•Works on ordered data as well as quantitative data
•Is unaffected by outliers
•In prior income example wherein mean = $100,000, the median =
$10,000, which better reflects the typical salary from that sample
•Similar to mean, median is unique: a sample has only one

Disadvantage of Median
•Not affected by changes in data away from center

•If the highest income in Whoville drops to $25,000, median income
remains the same

Mode
•The most common value
Advantage
•Works with categorical data
Disadvantages
•May not be unique; e.g. bi-modal or multi-modal
•May not lie near the center

Dispersion measures
•Variation in values
•When the dispersion is large, the values are widely scattered; when it is
small they are tightly clustered
•The width of diagrams such as scatterplots, box plots, stem and leaf plots is
greater for samples with more dispersion and vice versa
•There are several measures of dispersion
•Variance is a fundamental measure of dispersion underlying many statistical
procedures
•Standard deviation is likely the most common measure of dispersion used as
a descriptive statistic
•These measures indicate to what degree the individual observations of a
data set are dispersed or 'spread out' around their mean
•In manufacturing or measurement, high precision is associated with low
dispersion
•Common measures include:
•Range
•Quantiles and interquartile range
•Variance and standard deviation
•Standard error (of the mean)
•Coefficient of variation

Range
•The range of a sample (or a data set) describes the
difference between the largest and the smallest
observed value for a given variable
•A lot of information is ignored when computing the range
since only the largest and the smallest values are
considered
•The range is greatly influenced by the presence of just
one unusually large or small value in the sample (outlier)

Quantiles
•25th, 50th, and 75th percentiles of data
•50th = median
•25th = upper quantile
•75th = lower quantile
•order data, then select
•Useful for defining groups based on rankings

Interquartile Range (IQR)
The difference between the 25th & 75th
quartiles (1st & 3rd)
•Contains the central 50% of observations

Standard Error of the Mean
•A measure of how much the means vary from
sample to sample when taken from the same
distribution
•It is a method used to estimate the standard
deviation of a sampling distribution
•Quantifies how precisely you can estimate the
true population mean
•SEM = SD/√n

SD vs SEM
•The SD describes the variability between
individuals in a sample
•The SEM describes the uncertainty of how
the sample mean represents the
population mean
•SEM is always < SD

Selecting dispersion measures
•Use standard deviation when the mean is reported
•Percentiles and IQR are used when the median is used
•The mean is used, but the objective is to compare
individual observations with a set of norms
•IQR is used to describe the central 50% of a distribution
•Range is used when the purpose is to highlight extreme
values
•CV is used when the intent is to compare distributions
measured on different scales
Tags