Data Science Project Lifecycle

JasonGeng 5,692 views 11 slides Feb 27, 2017
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Presentation at Dallas Data Science Conference 2017 (www.dsassn.org/dallas)


Slide Content

Data Science Project Lifecycle
Jason Geng@Data Application Lab
Miya Du @Data Science Association

Business
Requirement
Data
Acquisition
Data
Preparation
Hypothesis &
Modeling
Evaluation &
Interpretation
Deployment
Operations
Optimization

Business Requirements
uData scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
uSpecify the business requirements
uFor instance, the healthcare data

e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling

Data Collection
uData from product line
uPurchase third party data
uSocial media (Facebook, LinkedIn)
uWeb crawling
uOpen source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management

Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL

Product Line
Business
Intelligence
Data Science
App

Data Preparation (Data Wrangling)
uCleaning data (semantic errors, missing entries, or inconsistent
formatting)
uChallenge: data integration
u80%timein project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse

Feature Engineering
Select or
creating
features
Research
feature
relevance
Experiment
and
validation
Change the
feature set
Go back to
feature
selection
step

Modeling
Reference Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/

Deploy to Product Line

Thank you!
https://www.DataAppLab.com
Feb 2017PPT: Xiaolu Zhao @ Feb 16, 2017