Data Science and Culture

icaromedeiros 2,482 views 50 slides Jun 29, 2017
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Hiring data scientists and deploying Hadoop is not enough. Your company needs a data driven culture, based on values such as honesty, democracy, creativity and strategy. Your company also needs good data engineering and good experimentation practices.


Slide Content

Data Science &
Culture
(Or how to stop worrying and love data driven culture)
Ícaro Medeiros
Data Science Forum
São Paulo, Jun 2017

Inspired by
(not limited to)
refs

Big Data
http://www.kdnuggets.com/2017/02/origins-big-data.html
✦Fundamental blocks: evolutions on CS e.g.
distributed systems, databases, massive AI, etc
✦Fuzzy concept, ill-defined
✦Popularized by Gartner

(hype-fueled consulting firm)

✦Big Data no longer considered an emerging
technology (pervasive in industry)
✦Entered Trough of Disillusionment in 2013
https://knowledgeimmersion.wordpress.com/2016/06/22/disillusionment-of-big-data/

http://www.mikelnino.com/2016/03/chronology-big-data.html
Chronology of antecedents

Data science
✦Statistics (late 19th century)
✦Computer Science (1950s)
✦Machine Learning (1950s)
✦Data Mining (1990s)
✦Data Science (2010s)
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
yet another hyped term

Beware: controversy
✦Data science is not all-science
✴It’s getting more and more engineering-like, a practice
✴Data storytelling is a creative endeavor
✦Hyper-inflated expectations, misunderstood
concepts and hurry to get value: a dangerous
recipe

A new hope

machine learning
big data
https://trends.google.com/trends/explore?date=today%2012-m&geo=US&q=machine%20learning,big%20data
or hype

Hype: not that bad
✦Haters gonna hate i.e. don’t fully hate the hype
✴more practitioners = faster tech and processes evolution
✴Highly skilled professionals and innovation
✦Academics sometimes look for difficult unwanted
problems

industry is more pragmatic, specially in tech
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science

What we need…
✦Forget about Big Data pokémons
✴OH so in Big Data we don’t need people to think schemas?
✦Forget about misunderstood business expectations
✴OH in deep learning we don’t need people to train models?
✦You need PEOPLE
✴Collaborating with shared values
✴Awesome in tech but more importantly: CREATIVE

Shared values
and practices
Culture

Good people
✦People are more important than ideas
✴A mediocre team will screw up a good idea
✴Mediocre idea to great team: they will fix it or rethink it
✦A good lab: different kinds of autonomous thinkers
✴Why hire smart people if they can't fix what’s broken?
✦Prefer a heterogeneous and complimentary team
instead of looking for unicorns

The mythical 10x professional
https://twitter.com/icaromedeiros/status/838968884023668737

Good communication
✦Honesty, excellence, originality and self-
criticism (values)
✦Communication structure <> organizational
✦Be ready to hear the truth
✴Sincerity is only valuable if people are open and willing to give
up on ideas that will not work
✦Braintrust: Leave ego and Jobs outside the door

Power to the people!
✦Product quality is everyone’s responsibility
✴Don’t ask permission to take responsibility
✦Passion and excellence versus autonomy
✦Good things might shadow the bad
✴People struggle to explore bad things to avoid being called
“complainers”

Rebels
http://qaspire.com/2017/05/19/sketchnote-what-rebels-want-from-their-boss/

Destroy data silos!
✦Without information about data there is no science
✦Software and data should be a collective property
within the company
✦Knowledge management matter
✦Communication between areas must be enforced

Data portals
✦Self-service platforms to publish datasets
✴Descriptions, schemas, samples, relations between datasets,
etc
✦Open Data initiatives, mostly governments
✦OSS platforms: CKAN, AirBNB’s Dataportal
✦Examples: data.gov.uk, dados.gov.br, etc

“When it comes to creative
inspiration, job titles and
hierarchy are meaningless”

Data storytelling
✦Explain what numbers tell in layman, clear terms
✦Make hidden premises clear
✴Outside data insights
✦Convince others about actions
✴Decreases insights-to-value interval
✦From data to knowledge
https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs

What is creativity
✦Unexpected connections of concepts and ideas
✦It's a marathon, it needs rhythm
✦Creativity must start somewhere and there’s power
on healthy feedback in a iterative process

Visual communication
✦Clean straightforward graphs > visually appealing
✴Choose dataviz libs wisely
✦“Don’t make me think”
✦The right graph for the right audience
✴Prefer a language everyone understands

Visual communication 101

Stats are not enough
https://www.autodeskresearch.com/publications/samestats

Stats are not enough
https://www.autodeskresearch.com/publications/samestats

Strateg a

Avoid egotrip data science
✦“OH my cluster has 10 Petabytes, I’m awesome”
✦Fancy ML algorithms are not the goal
✦The most important V in Big Data is value
https://twitter.com/amyhoy/status/847097034536554497

KPI versus HiPPO
✦Tech adoption per se is meaningless
✴Slide-driven Big Data
✴KPIs should grow from Big Data and data insights initatives
✦Poor defined goals -> bad decisions
✦Define viable but ambitious goals
✦Data beats opinion

Set goal, plan and GO!
✦Business questions can't be like “OH we want to
detect things related to millennials”
✦Clear goals must be set, with actionable metrics
✦Balance perfect models versus time-to-market
✦Brad Bird: “Sometimes, as a director, you’re
guiding. Sometimes you’re letting the car drive”
https://hbr.org/2017/02/how-chief-data-officers-can-get-their-companies-to-collect-clean-data

The process
✦The process is not the goal
✴It has no agenda or taste, it’s just a tool
✦Quality is the best business plan
✦Agile is a mindset: not only kanbans or scrum
✦If the model will become operational, mix scientists
and engineers from start

Build vs Buy
✦If you buy and your core business is not techie, you can be
illiterate in tech
✴Benchmark before buying
✴Accelerate results and boost internal knowledge
✦If you build and have a good-enough techie culture, you’re
more or less good to go
✴Assess pros and cons consciously
✦If you surf the tech hype AND build good systems you’re
awesome

https://twitter.com/Doug_Laney/status/847452219641356288
When data goes to vendors…

http://www.louisdorard.com/machine-learning-canvas/

DATA
ENGINEERING

Big Data vs Great Data
✦If your logical models do not make sense
✦Most performed queries are slow
✦If you have string-only databases
✦If you have unused expensive data
✦Maybe your data lake is a swamp

“The data is a mess”
✦First step: accelerate human understanding of data
✴Metadata, context, hidden assumptions
✦Datasets might serves multiple purposes
✴Define rationale and context
✴Data portals and understandable datasets > Dashboards
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science
https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770

Data lost in translation
✦Heterogeneous and siloed databases (and people)
✦Rethink ESB (microservices network)
✦State-of-the-art: data workflow
✴Luigi, Airflow (open source), almost every big tech vendor
✴Transparency, reusability, reproducibility, traceability
✴Automation and monitoring all the way!
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science

Beyond relational models
✦Not all data problems fits well in traditional SQL or
DW models
✴Key-value, columnar, graph-based, inverted index, etc
✦Models are a framework for problem-solving
✴Not the ultimate answer
✴There’s no one-size-fits-all model

Do not forget fluency
✦Check the company lingua franca
✦Make it easy for critical decision-makers
✴Adhoc SQL queries?
✴Dashboards?
✴Reports?

EXPERIMENTATION

Experiments
✦Missions to discover facts towards understanding
✴They don’t fail, any result produces new information
✴If the initial theory was wrong: good
✴With new facts you can reformulate the question
✦Get more modeling questions asked more often
✦Iterative data science

Product experimentation (A/B)
✦Product experimentation should be hypothesis-
driven (not feature-driven)
✦Define the proper exposed population
✴No new users, no heavy users only, no early adopters
✦Understanding effect is essential
https://medium.com/airbnb-engineering/4-principles-for-making-experimentation-count-7a5f1a5268a

5 stages of A/B tests
https://www.linkedin.com/pulse/ab-testing-which-do-i-pick-sahar-heidari

Some other quick tips
✦Focus on outcomes (not algorithms or methods)
✦Design the right metric and evaluation
✦Good experiments don't produce obvious insights
✦Mix of data and intuition
https://twitter.com/mrdatascience/status/869957499662860288

Being data driven
✦Be BAYESIAN - uncertainty is everywhere
✦Be CURIOUS - keep learning
✦Be AGILE - Fail fast, not too fast: evidence comes first
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/

Being data driven
✦Be TRUTHFUL - don’t torture data to please opinions
✦Be HELPFUL - work across silos, support democracy
✦Be WISE - know when to be analytical or intuitive
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/

With the right people,
Democracy,
Creativity,
Strategy,
Big Great Data™
and Experiments
there's a good chance to do great
SCIENCE
Take-away message

Ícaro Medeiros
Data Scientist
icaromedeiros