Presentation at Dallas Data Science Conference 2017 (www.dsassn.org/dallas)
Size: 2.09 MB
Language: en
Added: Feb 27, 2017
Slides: 11 pages
Slide Content
Data Science Project Lifecycle
Jason Geng@Data Application Lab
Miya Du @Data Science Association
Business
Requirement
Data
Acquisition
Data
Preparation
Hypothesis &
Modeling
Evaluation &
Interpretation
Deployment
Operations
Optimization
Business Requirements
uData scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
uSpecify the business requirements
uFor instance, the healthcare data
e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling
Data Collection
uData from product line
uPurchase third party data
uSocial media (Facebook, LinkedIn)
uWeb crawling
uOpen source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management
Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL
…
Product Line
Business
Intelligence
Data Science
App
Data Preparation (Data Wrangling)
uCleaning data (semantic errors, missing entries, or inconsistent
formatting)
uChallenge: data integration
u80%timein project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse
Feature Engineering
Select or
creating
features
Research
feature
relevance
Experiment
and
validation
Change the
feature set
Go back to
feature
selection
step