DS601-Data Science Processes for Data science Student.pdf
340 views
50 slides
Mar 26, 2024
Slide 1 of 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
About This Presentation
processes involved data science
Size: 923.86 KB
Language: en
Added: Mar 26, 2024
Slides: 50 pages
Slide Content
DSC 601
Data Science Process
Dr M. Nyamsi
Course Objectives
•Give an overview of the data science process
•Understand the flow of a data science process
•Learn how to work with big data sets, streaming data
DSC601-Dr NYAMSI 2
Research goal
A project charter requires teamwork, and your input covers at
least the following:
•A clear research goal and the project mission and context
•How you’re going to perform your analysis
•What resources you expect to use
•Proof that it’s an achievable project, or proof of
concepts
•Deliverables and a measure of success; A timeline
DSC601-Dr NYAMSI 5
Research goal
•It states the purpose of your assignment in a clear and
focused manner
•Understand the business goals and context of the project
•Continue asking questions and devising examples until you
grasp the exact business expectations
•Identify how your project fits in the bigger picture
•Appreciate how your research is going to change the
business and understand how they’ll use your results.
DSC601-Dr NYAMSI 6
Research goal
•Many data scientists fail here:
•Despite their mathematical wit and scientific brilliance,
•They never seem to grasp the business goals and
context.
•Many students fail here:
•Despite their CSC background
•Despite their will
•Despite the explanations from their supervisors!
DSC601-Dr NYAMSI 7
Research goal
So, take time to search, refine,
ask good questions, … etc.
DSC601-Dr NYAMSI 8
Research goal: project charter
A project charter requires teamwork, and your input covers at
least the following:
•A clear research goal
•The project mission and context
•How you’re going to perform your analysis
•What resources you expect to use
•Proof that it’s an achievable project, or proof of concepts
•Deliverables and a measure of success
•A timeline
DSC601-Dr NYAMSI 9
Retrieving data
•Retrieve essential data to fit your needs
•Data can be stored in many forms, from simple text
files to tables in a database.
•The objective now is acquiring allthe data you need.
•This may be difficult, and even if you succeed,
•Datais often like a diamond in the rough: it needs
polishing to be of any use to you.
DSC601-Dr NYAMSI 10
Retrieving data
Goal:
•Retrieve the required data
•That can be internal or external.
•And make sure
DSC601-Dr NYAMSI 11
Retrieving data: from the company
•First assess the relevance and quality of the data
that’s readily available within your company.
•This data can be founded in data repositories such as
•Databases: an organized collection of structured
information, or data, typically stored electronically in a
computer system.
•Data marts: a subject-oriented database that meets the
demands of a specific group of users.
DSC601-Dr NYAMSI 12
Retrieving data: from the company
•This data can be founded in data repositories such as
•Data warehouses: a large store of data accumulated from a
wide range of sources within a company and used to guide
management decisions.
•Data lakes: a centralized repository designed to store,
process, and secure large amounts of structured, semi-
structured, and unstructured data.
•Possibility exists that your data still resides in Excel files on
the desktop of a domain expert.
DSC601-Dr NYAMSI 13
Retrieving data: from the company
•As companies grow, their data becomes scattered
around many places.
•Organizations understand the value and sensitivity of
data
•Organizations often have policies in place, so
everyone has access to what they need and nothing
more.
DSC601-Dr NYAMSI 14
Retrieving data: out of the company
•You can shop data: Nielsen and GFK are well known
for this in the retail industry.
•Other companies provide data, in turn, they can
enrich their services and ecosystem.
•Example: Twitter, LinkedIn, and Facebook.
•More governments and organizations share their data
for free with the world. Share broad numbers of
topics
DSC601-Dr NYAMSI 15
Retrieving data: out of the company
DSC601-Dr NYAMSI 16
Retrieving data: data quality checks
•During data retrieval, you
•Check to see if the data is equal to the data in the source
document and
•Look to see if you have the right data types.
DSC601-Dr NYAMSI 17
Retrieving data: data quality checks
•With data preparation, you do a more elaborate
check.
•During the exploratory phase your focus shifts to
what you can learn from the data.
DSC601-Dr NYAMSI 18
Data preparation
Objective:
•Sanitize data
•Prepare it
for the
modeling
and
reporting
phase
DSC601-Dr NYAMSI 19
Data preparation: cleansing
•It focuses on removing errors in your data
•data becomes a true and consistent
•Avoid interpretation errors and standardization errors
•Example
•Gender: F, female,
•Money: cents and euro, or pound and dollars
•There are some possible solutions
DSC601-Dr NYAMSI 20
Data preparation: cleansing
DSC601-Dr NYAMSI 21
Data preparation: cleansing
DSC601-Dr NYAMSI 22
Data preparation: correction of
errors
DSC601-Dr NYAMSI 23
•Correct errors as early as possible
•The data collection process is error prone,
•In a big organization it involves many steps and
teams.
•Data should be cleansed when acquired for many
reasons
Data preparation: correction of
errors
DSC601-Dr NYAMSI 24
•Reasons of cleaning data
•Decision-makers may make costly mistakes on decisions
•Reusability of data: If not corrected early on in the
process, the cleansing will be done for every project that
uses that data.
•Data errors can point to bugs in software or in the
integration of software that may be critical to the
company
Data preparation: correction of
errors
DSC601-Dr NYAMSI 25
•Remarks:
•Always keep a copy of your original data (when
possible).
Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 26
•Your data comes from several different places
•Data varies in size, type, and structure,
•Ranging from databases and Excel files to text
documents.
•We focus on data in table structures for the moment
•Keep in mind that other types of data sources exist,
such as key-value stores, document stores, … etc.
Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 27
•Two ways to combine information from different
data sets.
•Join : enrich an observation from one table wit
information from another table.
•Appending or stacking: adding the observations of one
table to those of another table.
•Using union set, difference and intersection
•Operations from relational algebra seen in relational data
base.
Data preparation: combine data
from different sources
DSC601-Dr NYAMSI 28
•To join tables:
•You use variables that represents the same object in both
tables
•These common fields are known as keys.
•They can be primary keys or not
•We can also use views (virtual layer that combines the
tables) to simulate data joins or appends.
•We can enrich aggregated measures
Data preparation: transformation of
data
DSC601-Dr NYAMSI 29
•We have cleaned and integrated the data
•Certain models require their data to be in a certain
shape.
•We transform data so it takes a suitable form for data
modeling.
•We can transform data, we can reduce the number of
variables, we can turn variables into dummies
Data preparation: transformation of
data
DSC601-Dr NYAMSI 30
•Transformation:
•Found a relationships between an input variable and an
output variable
•Relationship can be linear or not
•Use numerical or statistical methods to do it
Data preparation: transformation of
data
DSC601-Dr NYAMSI 31
Data preparation: transformation of
data
DSC601-Dr NYAMSI 32
•Reduce the number of variables
•Many variables don’t necessary add values to your goal
•Having too many variables in your model makes the model
difficult to handle
•Certain techniques don’t perform well when you overload
them with too many input variables.
•Data scientists use special methods to reduce the number
of variables but retain the maximum amount of data.
Data preparation: transformation of
data
DSC601-Dr NYAMSI 33
•We can turn variables into dummies
•Dummy variables can only take two values: true(1) or
false(0).
•Used to indicate the absence of a categorical effect that
may explain the observation.
Data preparation: transformation of
data
DSC601-Dr NYAMSI 34
Data exploration
DSC601-Dr NYAMSI 35
Data exploration
•Information becomes much easier to grasp when
shown in a picture,
•We mainly use graphical techniques to gain an
understand data and the interactions between
variables
•You will and can still discover anomalies you missedinthe
steps before
DSC601-Dr NYAMSI 36
Data exploration
•There are many techniques for exploration.
•Visual: from simple line graphs or histograms to more
complex diagrams such as Sankey and network graphs
•Brushing and linking: combine and link different graphs
and tables
•Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis.
Now, you understand the content of your cleansing
data. It is time to build your model
DSC601-Dr NYAMSI 37
Data modeling
•The goal:
•Making better predictions,
•Classifying objects,
•Gaining an understanding of the system that you’re
modeling.
You know what you’re looking for and what you
want the outcome to be.
DSC601-Dr NYAMSI 38
Data modeling
DSC601-Dr NYAMSI 39
Data modeling
•The techniques we use here are borrowed from the
field of machine learning, data mining, and/or
statistics.
•Building a model is an iterative process.
•Most models consist of the following main steps:
1.Selection of a modeling technique and variables to enter
in the model
2.Execution of the model
3.Diagnosis and model comparison
DSC601-Dr NYAMSI 40
Data modeling: models and variables
•Objectives:
•Choose variables you need for your model,
•Choose a modeling technique
•From the exploratory analysis phase, we can flair
what variables will help us construct a good
model.
DSC601-Dr NYAMSI 41
Data modeling: models and variables
•Many modeling techniques are available,
•Choosing the right model for a problem requires
judgment on your part.
•Consider model performance and whether your
project meets all the requirements to use your
model
DSC601-Dr NYAMSI 42
Data modeling: models and variables
•Regression techniques: What is the predicted
value for the given data?
•Linear regression: it is a machine learning algorithm
based on supervised learning and is used for
predictive analysis. Regression models a target
prediction value based on independent variables.
Y=ax+bis an example of simple regression equation
•Multivariate regression, like linear regression but with
multiple variables.
DSC601-Dr NYAMSI 43
Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Decision Trees: A simple non-linear and explainable
algorithm based on if-else rules.
•Support Vector Machines(SVMs): aim to draw a line or
plane with a wide margin to separate data into different
categories.
•Naïve Bayes Classifiers: simple probabilistic classifiers
based on applying Bayes’ theorem (from Bayesian statistics)
with strong (naive) independence assumptions.
DSC601-Dr NYAMSI 44
Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Logistic regression: it is a popular supervised learning
algorithm used to assess the probability of a variable
having a binary label based on some predictive features.
•K-Nearest Neighbor (KNN): it is one of the simplest and
most effective classical machine learning algorithms. It
classifies an unknown test state by finding the k-nearest
neighbors from a set of M train states.
DSC601-Dr NYAMSI 45
Data modeling: models and variables
•Classification techniques: what category does this
data belong to?
•Random forests: the one of the most widely used ML
classifiers. They are ensemble learning method for
classification task.
•Artificial Neural Networks (ANNs): one of the best
models to find non-linear patterns in data and to
build really complex relationships between
independent and dependent variables.
DSC601-Dr NYAMSI 46
Data modeling: model execution
•Once you’ve chosen a model you’ll need to
implement it in code.
•Most programming languages, such as Python,
already have libraries such as StatsModelsor Scikit-
learn.
•These packages use several of the most popular
techniques.
•Take the book on page 49 and try given examples.
(homework 1)
DSC601-Dr NYAMSI 47
Data modeling: model diagnostic
•You’ll be building multiple models from which you
then choose the best one based on multiple criteria.
•In general, Work with a holdout sample (a part of the
data you leave out of the model building so it can be
used to evaluate the model afterward).
•The model is then unleashed on the unseen data and
error measures are calculated to evaluate it.
•Multiple error measures are available (distance, mean
square, … etc.)
DSC601-Dr NYAMSI 48
Data presentation and automation
DSC601-Dr NYAMSI 49
Data presentation
•After you’ve successfully analyzed the data and built a
well-performing model, you’re ready to present your
findings to the world.
•You’ll need to repeat it over and over again because
they value the predictions of your models or the
insights that you produced.
DSC601-Dr NYAMSI 50