data science lecture for data engineering and data analysis.pdf
omar2014oa98
7 views
10 slides
Jul 24, 2024
Slide 1 of 10
1
2
3
4
5
6
7
8
9
10
About This Presentation
this is a lecture for data science
Size: 501.42 KB
Language: en
Added: Jul 24, 2024
Slides: 10 pages
Slide Content
1
Data Science (CSE121)
Prepared by:
Dr. Mohamed Azzam
Techniques – Lec. 2
Data Science (CSE121) 2
Lecture 2
Textbook
◼Shah, C. (2020). A Hands-On Introduction to Data Science. Cambridge:
Cambridge University Press. doi:10.1017/9781108560412
◼Coursera:
❑Applied Data Science with Python Specialization (University of Michigan- 5 courses)
Data Science (CSE121) 5
Lecture 2
Key terms related to Data
◼Outlier: A data point that is markedly different in value from the other data point
of the sample.
◼Noisy data: The dataset has one or more instances of errors or outliers.
◼Nominal data: The data type is nominal when there is no natural order between
the possible values, for example, colors.
◼Ordinal data: If the possible values of a data type are from an ordered set, the
type is ordinal. For example, grades in a mark sheet.
◼Continuous data: A continuous data is a data type that has an infinite number
of possible values. For example, real numbers.
◼Data cubes: They are multidimensional sets of data that can be stored in
spreadsheet. A data cube could be in two, three, or higher dimensions. Each
dimension typically represents an attribute of interest.
◼Feature space selection: A method for selecting a subset of features or
columns from the given dataset as a way to do data reduction.
Data Science (CSE121) 6
Lecture 2
Key terms related to Data Analytics
◼Nominal variable: The variable type is nominal when there is no natural order
between the possible values that it stores, for example, colors.
◼Ordinal variable: If the possible values of a data type are from an ordered set,
then the type is ordinal. For example, grades in a mark sheet.
◼Interval variable: A kind of variable that provides numerical storage and allows
us to do additions and subtractions on them but not multiplications or divisions.
Example: temperature.
◼Ratio variable: A kind of variable that provides numerical storage and allows us
to do additions and subtractions, as well as multiplications or divisions, on them.
Example: weight.
◼Independent /predictor variable: A variable that is thought to be controlled or
not affected by other variables.
◼Dependent /outcome /response variable: A variable that depends on other
variables (most often other independent variables).
Data Science (CSE121) 7
Lecture 2
Key terms related to Data Analytics (cont’d)
◼Mean: Mean is the average of continuous data found by the summation of th
given data and dividing by the number of data entries.
◼Median: Median is the middle data point in any ordinal dataset.
◼Mode: Mode of a dataset is the value that occurs most frequently.
◼Normal distribution: A normal distribution is a type of distribution of data points
in which, when ordered, most values cluster in the middle of the range and the
rest of the values symmetrically taper off toward both extremes.
◼Correlation: This indicates how closely two variables are related and ranges
from −1 (negatively related) to +1 (positively related). A correlation of 0 indicates
no relation between the variables.
Data Science (CSE121) 8
Lecture 2
Data Storage and Presentation
◼Most commonly used formats that store data as simple text:
❑CSV (Comma-Separated Values): For example, Depression.csv
❑TSV (Tab-Separated Values)
❑XML (eXtensibleMarkup Language
❑RSS (Really Simple Syndication)
❑JSON (JavaScript Object Notation): e.g., var obj = {“name”:“John”,
“age”:25, “state”: “New Jersey”};
Data Science (CSE121) 9
Lecture 2
Data Preprocessing
◼Why?? Data in the real world is often dirty.
◼Factors that indicate that data is not clean or ready to process:
❑Incomplete. When some of the attribute values are lacking, certain
attributes of interest are lacking, or attributes contain only aggregate data.
❑Noisy. When data contains errors or outliers. For example, some of the data
points in a dataset may contain extreme values that can severely affect the
dataset’s range.
❑Inconsistent. Data contains discrepancies in codes or names. For example,
if the “Name” column for registration records of employee contains values
other than alphabetical letters, or if records do not start with a capital letter,
discrepancies are present.
Data Science (CSE121) 10
Lecture 2
Forms of Data Preprocessing
1.Data Cleaning:
❑Data munging: the data is not in a format that is easy to work with.
❑Handling missing data: ignoring that record, using a global constant
to fill in all missing values, imputation, inference-based solutions
(Bayesian formula or a decision tree), etc.
❑Smooth Noisy Data: identify or remove outliers, try to resolve
inconsistencies in the data.
2.Data Integration:
❑Combine data from multiple sources into a coherent storage place.
❑Engage in schema integration, or the combining of metadata from
different sources.
❑Detect and resolve data value conflicts.
❑Address redundant data in data integration.
Data Science (CSE121) 11
Lecture 2
Forms of Data Preprocessing (Cont’d)
3.Data Transformation: Data must be transformed so it is
consistent and readable (by a system).
4.Data Reduction: is a key process in which a reduced
representation of a dataset that produces the same or similar analytical
results is obtained. Common techniques:
❑Data Cube Aggregation: use the smallest representation that is sufficient to address
the given task.
❑Dimensionality Reduction: dimensionality reduction method works with respect to the
nature of the data. This requires identifying redundancy in the given data and/or
creating composite dimensions or features that could sufficiently represent a set of
raw features.
5.Data Discretization: sometimes we need to convert these
continuous values into more manageable parts. three types of
attributes involved in discretization:
a)Nominal: Values from an unordered set
b)Ordinal: Values from an ordered set
c)Continuous: Real numbers