DS601-Data Science Processes for Data science Student.pdf

340 views 50 slides Mar 26, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

processes involved data science


Slide Content

DSC 601
Data Science Process
Dr M. Nyamsi

Course Objectives
•Give an overview of the data science process
•Understand the flow of a data science process
•Learn how to work with big data sets, streaming data
DSC601-Dr NYAMSI 2

Course outline
DSC601-Dr NYAMSI 3

Course outline
•Research Goal
•Retrieving data
•Data preparation
•Data exploration
•Data modeling
•Presentation.
DSC601-Dr NYAMSI 4

Research goal
A project charter requires teamwork, and your input covers at
least the following:
•A clear research goal and the project mission and context
•How you’re going to perform your analysis
•What resources you expect to use
•Proof that it’s an achievable project, or proof of
concepts
•Deliverables and a measure of success; A timeline
DSC601-Dr NYAMSI 5

Research goal
•It states the purpose of your assignment in a clear and
focused manner
•Understand the business goals and context of the project
•Continue asking questions and devising examples until you
grasp the exact business expectations
•Identify how your project fits in the bigger picture
•Appreciate how your research is going to change the
business and understand how they’ll use your results.
DSC601-Dr NYAMSI 6

Research goal
•Many data scientists fail here:
•Despite their mathematical wit and scientific brilliance,
•They never seem to grasp the business goals and
context.
•Many students fail here:
•Despite their CSC background
•Despite their will
•Despite the explanations from their supervisors!
DSC601-Dr NYAMSI 7

Research goal
So, take time to search, refine,
ask good questions, … etc.
DSC601-Dr NYAMSI 8

Research goal: project charter
A project charter requires teamwork, and your input covers at
least the following:
•A clear research goal
•The project mission and context
•How you’re going to perform your analysis
•What resources you expect to use
•Proof that it’s an achievable project, or proof of concepts
•Deliverables and a measure of success
•A timeline
DSC601-Dr NYAMSI 9

Retrieving data
•Retrieve essential data to fit your needs
•Data can be stored in many forms, from simple text
files to tables in a database.
•The objective now is acquiring allthe data you need.
•This may be difficult, and even if you succeed,
•Datais often like a diamond in the rough: it needs
polishing to be of any use to you.
DSC601-Dr NYAMSI 10

Retrieving data
Goal:
•Retrieve the required data
•That can be internal or external.
•And make sure
DSC601-Dr NYAMSI 11

Retrieving data: from the company
•First assess the relevance and quality of the data
that’s readily available within your company.
•This data can be founded in data repositories such as
•Databases: an organized collection of structured
information, or data, typically stored electronically in a
computer system.
•Data marts: a subject-oriented database that meets the
demands of a specific group of users.
DSC601-Dr NYAMSI 12

Retrieving data: from the company
•This data can be founded in data repositories such as
•Data warehouses: a large store of data accumulated from a
wide range of sources within a company and used to guide
management decisions.
•Data lakes: a centralized repository designed to store,
process, and secure large amounts of structured, semi-
structured, and unstructured data.
•Possibility exists that your data still resides in Excel files on
the desktop of a domain expert.
DSC601-Dr NYAMSI 13

Retrieving data: from the company
•As companies grow, their data becomes scattered
around many places.
•Organizations understand the value and sensitivity of
data
•Organizations often have policies in place, so
everyone has access to what they need and nothing
more.
DSC601-Dr NYAMSI 14

Retrieving data: out of the company
•You can shop data: Nielsen and GFK are well known
for this in the retail industry.
•Other companies provide data, in turn, they can
enrich their services and ecosystem.
•Example: Twitter, LinkedIn, and Facebook.
•More governments and organizations share their data
for free with the world. Share broad numbers of
topics
DSC601-Dr NYAMSI 15

Retrieving data: out of the company
DSC601-Dr NYAMSI 16

Retrieving data: data quality checks
•During data retrieval, you
•Check to see if the data is equal to the data in the source
document and
•Look to see if you have the right data types.
DSC601-Dr NYAMSI 17

Retrieving data: data quality checks
•With data preparation, you do a more elaborate
check.
•During the exploratory phase your focus shifts to
what you can learn from the data.
DSC601-Dr NYAMSI 18

Data preparation
Objective:
•Sanitize data
•Prepare it
for the
modeling
and
reporting
phase
DSC601-Dr NYAMSI 19

Data preparation: cleansing
•It focuses on removing errors in your data
•data becomes a true and consistent
•Avoid interpretation errors and standardization errors
•Example
•Gender: F, female,
•Money: cents and euro, or pound and dollars
•There are some possible solutions
DSC601-Dr NYAMSI 20

Data preparation: cleansing
DSC601-Dr NYAMSI 21

Data preparation: cleansing
DSC601-Dr NYAMSI 22

Data preparation: correction of
errors
DSC601-Dr NYAMSI 23
•Correct errors as early as possible
•The data collection process is error prone,
•In a big organization it involves many steps and
teams.
•Data should be cleansed when acquired for many
reasons

Data preparation: correction of
errors
DSC601-Dr NYAMSI 24
•Reasons of cleaning data
•Decision-makers may make costly mistakes on decisions
•Reusability of data: If not corrected early on in the
process, the cleansing will be done for every project that
uses that data.
•Data errors can point to bugs in software or in the
integration of software that may be critical to the
company

Data preparation: correction of
errors
DSC601-Dr NYAMSI 25
•Remarks:
•Always keep a copy of your original data (when
possible).

Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 26
•Your data comes from several different places
•Data varies in size, type, and structure,
•Ranging from databases and Excel files to text
documents.
•We focus on data in table structures for the moment
•Keep in mind that other types of data sources exist,
such as key-value stores, document stores, … etc.

Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 27
•Two ways to combine information from different
data sets.
•Join : enrich an observation from one table wit
information from another table.
•Appending or stacking: adding the observations of one
table to those of another table.
•Using union set, difference and intersection
•Operations from relational algebra seen in relational data
base.

Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 28
•To join tables:
•You use variables that represents the same object in both
tables
•These common fields are known as keys.
•They can be primary keys or not
•We can also use views (virtual layer that combines the
tables) to simulate data joins or appends.
•We can enrich aggregated measures

Data preparation: transformation of
data
DSC601-Dr NYAMSI 29
•We have cleaned and integrated the data
•Certain models require their data to be in a certain
shape.
•We transform data so it takes a suitable form for data
modeling.
•We can transform data, we can reduce the number of
variables, we can turn variables into dummies

Data preparation: transformation of
data
DSC601-Dr NYAMSI 30
•Transformation:
•Found a relationships between an input variable and an
output variable
•Relationship can be linear or not
•Use numerical or statistical methods to do it

Data preparation: transformation of
data
DSC601-Dr NYAMSI 31

Data preparation: transformation of
data
DSC601-Dr NYAMSI 32
•Reduce the number of variables
•Many variables don’t necessary add values to your goal
•Having too many variables in your model makes the model
difficult to handle
•Certain techniques don’t perform well when you overload
them with too many input variables.
•Data scientists use special methods to reduce the number
of variables but retain the maximum amount of data.

Data preparation: transformation of
data
DSC601-Dr NYAMSI 33
•We can turn variables into dummies
•Dummy variables can only take two values: true(1) or
false(0).
•Used to indicate the absence of a categorical effect that
may explain the observation.

Data preparation: transformation of
data
DSC601-Dr NYAMSI 34

Data exploration
DSC601-Dr NYAMSI 35

Data exploration
•Information becomes much easier to grasp when
shown in a picture,
•We mainly use graphical techniques to gain an
understand data and the interactions between
variables
•You will and can still discover anomalies you missedinthe
steps before
DSC601-Dr NYAMSI 36

Data exploration
•There are many techniques for exploration.
•Visual: from simple line graphs or histograms to more
complex diagrams such as Sankey and network graphs
•Brushing and linking: combine and link different graphs
and tables
•Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis.
Now, you understand the content of your cleansing
data. It is time to build your model
DSC601-Dr NYAMSI 37

Data modeling
•The goal:
•Making better predictions,
•Classifying objects,
•Gaining an understanding of the system that you’re
modeling.
You know what you’re looking for and what you
want the outcome to be.
DSC601-Dr NYAMSI 38

Data modeling
DSC601-Dr NYAMSI 39

Data modeling
•The techniques we use here are borrowed from the
field of machine learning, data mining, and/or
statistics.
•Building a model is an iterative process.
•Most models consist of the following main steps:
1.Selection of a modeling technique and variables to enter
in the model
2.Execution of the model
3.Diagnosis and model comparison
DSC601-Dr NYAMSI 40

Data modeling: models and variables
•Objectives:
•Choose variables you need for your model,
•Choose a modeling technique
•From the exploratory analysis phase, we can flair
what variables will help us construct a good
model.
DSC601-Dr NYAMSI 41

Data modeling: models and variables
•Many modeling techniques are available,
•Choosing the right model for a problem requires
judgment on your part.
•Consider model performance and whether your
project meets all the requirements to use your
model
DSC601-Dr NYAMSI 42

Data modeling: models and variables
•Regression techniques: What is the predicted
value for the given data?
•Linear regression: it is a machine learning algorithm
based on supervised learning and is used for
predictive analysis. Regression models a target
prediction value based on independent variables.
Y=ax+bis an example of simple regression equation
•Multivariate regression, like linear regression but with
multiple variables.
DSC601-Dr NYAMSI 43

Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Decision Trees: A simple non-linear and explainable
algorithm based on if-else rules.
•Support Vector Machines(SVMs): aim to draw a line or
plane with a wide margin to separate data into different
categories.
•Naïve Bayes Classifiers: simple probabilistic classifiers
based on applying Bayes’ theorem (from Bayesian statistics)
with strong (naive) independence assumptions.
DSC601-Dr NYAMSI 44

Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Logistic regression: it is a popular supervised learning
algorithm used to assess the probability of a variable
having a binary label based on some predictive features.
•K-Nearest Neighbor (KNN): it is one of the simplest and
most effective classical machine learning algorithms. It
classifies an unknown test state by finding the k-nearest
neighbors from a set of M train states.
DSC601-Dr NYAMSI 45

Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Random forests: the one of the most widely used ML
classifiers. They are ensemble learning method for
classification task.
•Artificial Neural Networks (ANNs): one of the best
models to find non-linear patterns in data and to
build really complex relationships between
independent and dependent variables.
DSC601-Dr NYAMSI 46

Data modeling: model execution
•Once you’ve chosen a model you’ll need to
implement it in code.
•Most programming languages, such as Python,
already have libraries such as StatsModelsor Scikit-
learn.
•These packages use several of the most popular
techniques.
•Take the book on page 49 and try given examples.
(homework 1)
DSC601-Dr NYAMSI 47

Data modeling: model diagnostic
•You’ll be building multiple models from which you
then choose the best one based on multiple criteria.
•In general, Work with a holdout sample (a part of the
data you leave out of the model building so it can be
used to evaluate the model afterward).
•The model is then unleashed on the unseen data and
error measures are calculated to evaluate it.
•Multiple error measures are available (distance, mean
square, … etc.)
DSC601-Dr NYAMSI 48

Data presentation and automation
DSC601-Dr NYAMSI 49

Data presentation
•After you’ve successfully analyzed the data and built a
well-performing model, you’re ready to present your
findings to the world.
•You’ll need to repeat it over and over again because
they value the predictions of your models or the
insights that you produced.
DSC601-Dr NYAMSI 50
Tags