Data Science &
Culture
(Or how to stop worrying and love data driven culture)
Ícaro Medeiros
Data Science Forum
São Paulo, Jun 2017
Inspired by
(not limited to)
refs
Big Data
http://www.kdnuggets.com/2017/02/origins-big-data.html
✦ Fundamental blocks: evolutions on CS e.g.
distributed systems, databases, massive AI, etc

✦ Fuzzy concept, ill-defined

✦ Popularized by Gartner

(hype-fueled consulting firm)
✦ Big Data no longer considered an emerging
technology (pervasive in industry)

✦ Entered Trough of Disillusionment in 2013
https://knowledgeimmersion.wordpress.com/2016/06/22/disillusionment-of-big-data/
http://www.mikelnino.com/2016/03/chronology-big-data.html
Chronology of antecedents
Data science
✦ Statistics (late 19th century)

✦ Computer Science (1950s)

✦ Machine Learning (1950s)

✦ Data Mining (1990s)

✦ Data Science (2010s)
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
yet another hyped term
Beware: controversy
✦ Data science is not all-science
✴ It’s getting more and more engineering-like, a practice

✴ Data storytelling is a creative endeavor
✦ Hyper-inflated expectations, misunderstood
concepts and hurry to get value: a dangerous
recipe
A new hope
machine learning
big data
https://trends.google.com/trends/explore?date=today%2012-m&geo=US&q=machine%20learning,big%20data
or hype
Hype: not that bad
✦ Haters gonna hate i.e. don’t fully hate the hype

✴ more practitioners = faster tech and processes evolution
✴ Highly skilled professionals and innovation

✦ Academics sometimes look for difficult unwanted
problems

✴
industry is more pragmatic, specially in tech
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science
What we need…
✦ Forget about Big Data pokémons

✴ OH so in Big Data we don’t need people to think schemas?

✦ Forget about misunderstood business expectations

✴ OH in deep learning we don’t need people to train models?

✦ You need PEOPLE

✴ Collaborating with shared values

✴ Awesome in tech but more importantly: CREATIVE
Shared values
and practices
Culture
Good people
✦ People are more important than ideas

✴ A mediocre team will screw up a good idea

✴ Mediocre idea to great team: they will fix it or rethink it

✦ A good lab: different kinds of autonomous thinkers

✴ Why hire smart people if they can't fix what’s broken?

✦ Prefer a heterogeneous and complimentary team
instead of looking for unicorns
The mythical 10x professional
https://twitter.com/icaromedeiros/status/838968884023668737
Good communication
✦ Honesty, excellence, originality and self-
criticism (values)

✦ Communication structure <> organizational

✦ Be ready to hear the truth

✴ Sincerity is only valuable if people are open and willing to give
up on ideas that will not work

✦ Braintrust: Leave ego and Jobs outside the door
Power to the people!
✦ Product quality is everyone’s responsibility
✴ Don’t ask permission to take responsibility

✦ Passion and excellence versus autonomy

✦ Good things might shadow the bad

✴ People struggle to explore bad things to avoid being called
“complainers”
Rebels
http://qaspire.com/2017/05/19/sketchnote-what-rebels-want-from-their-boss/
Destroy data silos!
✦ Without information about data there is no science

✦ Software and data should be a collective property
within the company

✦ Knowledge management matter

✦ Communication between areas must be enforced
Data portals
✦ Self-service platforms to publish datasets

✴ Descriptions, schemas, samples, relations between datasets,
etc

✦ Open Data initiatives, mostly governments

✦ OSS platforms: CKAN, AirBNB’s Dataportal

✦ Examples: data.gov.uk, dados.gov.br, etc
“When it comes to creative
inspiration, job titles and
hierarchy are meaningless”
Data storytelling
✦ Explain what numbers tell in layman, clear terms

✦ Make hidden premises clear

✴ Outside data insights

✦ Convince others about actions

✴ Decreases insights-to-value interval
✦ From data to knowledge
https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs
What is creativity
✦ Unexpected connections of concepts and ideas

✦ It's a marathon, it needs rhythm

✦ Creativity must start somewhere and there’s power
on healthy feedback in a iterative process
Visual communication
✦ Clean straightforward graphs > visually appealing

✴ Choose dataviz libs wisely

✦ “Don’t make me think”

✦ The right graph for the right audience

✴ Prefer a language everyone understands
Visual communication 101
Stats are not enough
https://www.autodeskresearch.com/publications/samestats
Stats are not enough
https://www.autodeskresearch.com/publications/samestats
Strateg a
Avoid egotrip data science
✦ “OH my cluster has 10 Petabytes, I’m awesome”

✦ Fancy ML algorithms are not the goal

✦ The most important V in Big Data is value
https://twitter.com/amyhoy/status/847097034536554497
KPI versus HiPPO
✦ Tech adoption per se is meaningless

✴ Slide-driven Big Data

✴ KPIs should grow from Big Data and data insights initatives

✦ Poor defined goals -> bad decisions

✦ Define viable but ambitious goals

✦ Data beats opinion
Set goal, plan and GO!
✦ Business questions can't be like “OH we want to
detect things related to millennials”

✦ Clear goals must be set, with actionable metrics

✦ Balance perfect models versus time-to-market

✦ Brad Bird: “Sometimes, as a director, you’re
guiding. Sometimes you’re letting the car drive”
https://hbr.org/2017/02/how-chief-data-officers-can-get-their-companies-to-collect-clean-data
The process
✦ The process is not the goal

✴ It has no agenda or taste, it’s just a tool

✦ Quality is the best business plan

✦ Agile is a mindset: not only kanbans or scrum

✦ If the model will become operational, mix scientists
and engineers from start
Build vs Buy
✦ If you buy and your core business is not techie, you can be
illiterate in tech
✴ Benchmark before buying

✴ Accelerate results and boost internal knowledge

✦ If you build and have a good-enough techie culture, you’re
more or less good to go

✴ Assess pros and cons consciously

✦ If you surf the tech hype AND build good systems you’re
awesome
https://twitter.com/Doug_Laney/status/847452219641356288
When data goes to vendors…
http://www.louisdorard.com/machine-learning-canvas/
DATA
ENGINEERING
Big Data vs Great Data
✦ If your logical models do not make sense

✦ Most performed queries are slow

✦ If you have string-only databases

✦ If you have unused expensive data

✦ Maybe your data lake is a swamp
“The data is a mess”
✦ First step: accelerate human understanding of data

✴ Metadata, context, hidden assumptions

✦ Datasets might serves multiple purposes

✴ Define rationale and context

✴ Data portals and understandable datasets > Dashboards
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science
https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
Data lost in translation
✦ Heterogeneous and siloed databases (and people)

✦ Rethink ESB (microservices network)

✦ State-of-the-art: data workflow

✴ Luigi, Airflow (open source), almost every big tech vendor

✴ Transparency, reusability, reproducibility, traceability

✴ Automation and monitoring all the way!
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science
Beyond relational models
✦ Not all data problems fits well in traditional SQL or
DW models

✴ Key-value, columnar, graph-based, inverted index, etc

✦ Models are a framework for problem-solving
✴ Not the ultimate answer

✴ There’s no one-size-fits-all model
Do not forget fluency
✦ Check the company lingua franca

✦ Make it easy for critical decision-makers

✴ Adhoc SQL queries?

✴ Dashboards?

✴ Reports?
EXPERIMENTATION
Experiments
✦ Missions to discover facts towards understanding

✴ They don’t fail, any result produces new information

✴ If the initial theory was wrong: good

✴ With new facts you can reformulate the question

✦ Get more modeling questions asked more often

✦ Iterative data science
Product experimentation (A/B)
✦ Product experimentation should be hypothesis-
driven (not feature-driven)

✦ Define the proper exposed population
✴ No new users, no heavy users only, no early adopters

✦ Understanding effect is essential
https://medium.com/airbnb-engineering/4-principles-for-making-experimentation-count-7a5f1a5268a
5 stages of A/B tests
https://www.linkedin.com/pulse/ab-testing-which-do-i-pick-sahar-heidari
Some other quick tips
✦ Focus on outcomes (not algorithms or methods)

✦ Design the right metric and evaluation
✦ Good experiments don't produce obvious insights

✦ Mix of data and intuition
https://twitter.com/mrdatascience/status/869957499662860288
Being data driven
✦ Be BAYESIAN - uncertainty is everywhere

✦ Be CURIOUS - keep learning
✦ Be AGILE - Fail fast, not too fast: evidence comes first
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
Being data driven
✦ Be TRUTHFUL - don’t torture data to please opinions

✦ Be HELPFUL - work across silos, support democracy
✦ Be WISE - know when to be analytical or intuitive
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
With the right people,
Democracy,
Creativity,
Strategy,
Big Great Data™
and Experiments
there's a good chance to do great
SCIENCE
Take-away message
Ícaro Medeiros
Data Scientist
icaromedeiros

Data Science and Culture

  • 1.
    Data Science & Culture (Orhow to stop worrying and love data driven culture) Ícaro Medeiros Data Science Forum São Paulo, Jun 2017
  • 2.
  • 3.
    Big Data http://www.kdnuggets.com/2017/02/origins-big-data.html ✦ Fundamentalblocks: evolutions on CS e.g. distributed systems, databases, massive AI, etc ✦ Fuzzy concept, ill-defined ✦ Popularized by Gartner
 (hype-fueled consulting firm)
  • 4.
    ✦ Big Datano longer considered an emerging technology (pervasive in industry) ✦ Entered Trough of Disillusionment in 2013 https://knowledgeimmersion.wordpress.com/2016/06/22/disillusionment-of-big-data/
  • 5.
  • 6.
    Data science ✦ Statistics(late 19th century) ✦ Computer Science (1950s) ✦ Machine Learning (1950s) ✦ Data Mining (1990s) ✦ Data Science (2010s) https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century yet another hyped term
  • 7.
    Beware: controversy ✦ Datascience is not all-science ✴ It’s getting more and more engineering-like, a practice ✴ Data storytelling is a creative endeavor ✦ Hyper-inflated expectations, misunderstood concepts and hurry to get value: a dangerous recipe
  • 8.
    A new hope machinelearning big data https://trends.google.com/trends/explore?date=today%2012-m&geo=US&q=machine%20learning,big%20data or hype
  • 9.
    Hype: not thatbad ✦ Haters gonna hate i.e. don’t fully hate the hype ✴ more practitioners = faster tech and processes evolution ✴ Highly skilled professionals and innovation ✦ Academics sometimes look for difficult unwanted problems ✴ industry is more pragmatic, specially in tech https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science
  • 10.
    What we need… ✦Forget about Big Data pokémons ✴ OH so in Big Data we don’t need people to think schemas? ✦ Forget about misunderstood business expectations ✴ OH in deep learning we don’t need people to train models? ✦ You need PEOPLE ✴ Collaborating with shared values ✴ Awesome in tech but more importantly: CREATIVE
  • 11.
  • 13.
    Good people ✦ Peopleare more important than ideas ✴ A mediocre team will screw up a good idea ✴ Mediocre idea to great team: they will fix it or rethink it ✦ A good lab: different kinds of autonomous thinkers ✴ Why hire smart people if they can't fix what’s broken? ✦ Prefer a heterogeneous and complimentary team instead of looking for unicorns
  • 14.
    The mythical 10xprofessional https://twitter.com/icaromedeiros/status/838968884023668737
  • 15.
    Good communication ✦ Honesty,excellence, originality and self- criticism (values) ✦ Communication structure <> organizational ✦ Be ready to hear the truth ✴ Sincerity is only valuable if people are open and willing to give up on ideas that will not work ✦ Braintrust: Leave ego and Jobs outside the door
  • 16.
    Power to thepeople! ✦ Product quality is everyone’s responsibility ✴ Don’t ask permission to take responsibility ✦ Passion and excellence versus autonomy ✦ Good things might shadow the bad ✴ People struggle to explore bad things to avoid being called “complainers”
  • 17.
  • 18.
    Destroy data silos! ✦Without information about data there is no science ✦ Software and data should be a collective property within the company ✦ Knowledge management matter ✦ Communication between areas must be enforced
  • 19.
    Data portals ✦ Self-serviceplatforms to publish datasets ✴ Descriptions, schemas, samples, relations between datasets, etc ✦ Open Data initiatives, mostly governments ✦ OSS platforms: CKAN, AirBNB’s Dataportal ✦ Examples: data.gov.uk, dados.gov.br, etc
  • 20.
    “When it comesto creative inspiration, job titles and hierarchy are meaningless”
  • 22.
    Data storytelling ✦ Explainwhat numbers tell in layman, clear terms ✦ Make hidden premises clear ✴ Outside data insights ✦ Convince others about actions ✴ Decreases insights-to-value interval ✦ From data to knowledge https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs
  • 23.
    What is creativity ✦Unexpected connections of concepts and ideas ✦ It's a marathon, it needs rhythm ✦ Creativity must start somewhere and there’s power on healthy feedback in a iterative process
  • 24.
    Visual communication ✦ Cleanstraightforward graphs > visually appealing ✴ Choose dataviz libs wisely ✦ “Don’t make me think” ✦ The right graph for the right audience ✴ Prefer a language everyone understands
  • 25.
  • 26.
    Stats are notenough https://www.autodeskresearch.com/publications/samestats
  • 27.
    Stats are notenough https://www.autodeskresearch.com/publications/samestats
  • 28.
  • 29.
    Avoid egotrip datascience ✦ “OH my cluster has 10 Petabytes, I’m awesome” ✦ Fancy ML algorithms are not the goal ✦ The most important V in Big Data is value https://twitter.com/amyhoy/status/847097034536554497
  • 30.
    KPI versus HiPPO ✦Tech adoption per se is meaningless ✴ Slide-driven Big Data ✴ KPIs should grow from Big Data and data insights initatives ✦ Poor defined goals -> bad decisions ✦ Define viable but ambitious goals ✦ Data beats opinion
  • 31.
    Set goal, planand GO! ✦ Business questions can't be like “OH we want to detect things related to millennials” ✦ Clear goals must be set, with actionable metrics ✦ Balance perfect models versus time-to-market ✦ Brad Bird: “Sometimes, as a director, you’re guiding. Sometimes you’re letting the car drive” https://hbr.org/2017/02/how-chief-data-officers-can-get-their-companies-to-collect-clean-data
  • 32.
    The process ✦ Theprocess is not the goal ✴ It has no agenda or taste, it’s just a tool ✦ Quality is the best business plan ✦ Agile is a mindset: not only kanbans or scrum ✦ If the model will become operational, mix scientists and engineers from start
  • 33.
    Build vs Buy ✦If you buy and your core business is not techie, you can be illiterate in tech ✴ Benchmark before buying ✴ Accelerate results and boost internal knowledge ✦ If you build and have a good-enough techie culture, you’re more or less good to go ✴ Assess pros and cons consciously ✦ If you surf the tech hype AND build good systems you’re awesome
  • 34.
  • 35.
  • 36.
  • 37.
    Big Data vsGreat Data ✦ If your logical models do not make sense ✦ Most performed queries are slow ✦ If you have string-only databases ✦ If you have unused expensive data ✦ Maybe your data lake is a swamp
  • 38.
    “The data isa mess” ✦ First step: accelerate human understanding of data ✴ Metadata, context, hidden assumptions ✦ Datasets might serves multiple purposes ✴ Define rationale and context ✴ Data portals and understandable datasets > Dashboards https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
  • 39.
    Data lost intranslation ✦ Heterogeneous and siloed databases (and people) ✦ Rethink ESB (microservices network) ✦ State-of-the-art: data workflow ✴ Luigi, Airflow (open source), almost every big tech vendor ✴ Transparency, reusability, reproducibility, traceability ✴ Automation and monitoring all the way! https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science
  • 40.
    Beyond relational models ✦Not all data problems fits well in traditional SQL or DW models ✴ Key-value, columnar, graph-based, inverted index, etc ✦ Models are a framework for problem-solving ✴ Not the ultimate answer ✴ There’s no one-size-fits-all model
  • 41.
    Do not forgetfluency ✦ Check the company lingua franca ✦ Make it easy for critical decision-makers ✴ Adhoc SQL queries? ✴ Dashboards? ✴ Reports?
  • 42.
  • 43.
    Experiments ✦ Missions todiscover facts towards understanding ✴ They don’t fail, any result produces new information ✴ If the initial theory was wrong: good ✴ With new facts you can reformulate the question ✦ Get more modeling questions asked more often ✦ Iterative data science
  • 44.
    Product experimentation (A/B) ✦Product experimentation should be hypothesis- driven (not feature-driven) ✦ Define the proper exposed population ✴ No new users, no heavy users only, no early adopters ✦ Understanding effect is essential https://medium.com/airbnb-engineering/4-principles-for-making-experimentation-count-7a5f1a5268a
  • 45.
    5 stages ofA/B tests https://www.linkedin.com/pulse/ab-testing-which-do-i-pick-sahar-heidari
  • 46.
    Some other quicktips ✦ Focus on outcomes (not algorithms or methods) ✦ Design the right metric and evaluation ✦ Good experiments don't produce obvious insights ✦ Mix of data and intuition https://twitter.com/mrdatascience/status/869957499662860288
  • 47.
    Being data driven ✦Be BAYESIAN - uncertainty is everywhere ✦ Be CURIOUS - keep learning ✦ Be AGILE - Fail fast, not too fast: evidence comes first https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
  • 48.
    Being data driven ✦Be TRUTHFUL - don’t torture data to please opinions ✦ Be HELPFUL - work across silos, support democracy ✦ Be WISE - know when to be analytical or intuitive https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
  • 49.
    With the rightpeople, Democracy, Creativity, Strategy, Big Great Data™ and Experiments there's a good chance to do great SCIENCE Take-away message
  • 50.