1© Cloudera, Inc. All rights reserved.
A Moneyball Approach
Josh Wills | Senior Director of Data Science
Building Data Science Teams
2© Cloudera, Inc. All rights reserved.
About Me
3© Cloudera, Inc. All rights reserved.
A Team Building Exercise
4© Cloudera, Inc. All rights reserved.
Data Scientist Supply vs. Data Scientist Demand
5© Cloudera, Inc. All rights reserved.
Recruiting Techniques
6© Cloudera, Inc. All rights reserved.
Moneyball and Data Science
7© Cloudera, Inc. All rights reserved.
Choosing The Right Metrics
8© Cloudera, Inc. All rights reserved.
1. Analyzing “Unstructured” Data Sources
9© Cloudera, Inc. All rights reserved.
2. Building Machine Learning Models
10© Cloudera, Inc. All rights reserved.
3. Turn Static Reports Into Analytical Applications
11© Cloudera, Inc. All rights reserved.
Answering More Questions in Less Time
12© Cloudera, Inc. All rights reserved.
How To Answer Questions
Like A Data Scientist
13© Cloudera, Inc. All rights reserved.
1. Read and deserialize input data.
2. Project/filter input records.
3. Shuffle: serialize it, send over the
network, deserialize it.
4. Apply aggregation logic.
5. Serialize output data.
The Life of a Data Processing Job
14© Cloudera, Inc. All rights reserved.
Handling the Cost of Serialization
15© Cloudera, Inc. All rights reserved.
The Traditional RDBMS Approach
16© Cloudera, Inc. All rights reserved.
The Cost of The Traditional RDBMS Approach
17© Cloudera, Inc. All rights reserved.
Query Scheduling and Exploratory Data Analysis
18© Cloudera, Inc. All rights reserved.
The Spark Approach
19© Cloudera, Inc. All rights reserved.
The Cost of the Spark Approach
20© Cloudera, Inc. All rights reserved.
The MapReduce Approach
21© Cloudera, Inc. All rights reserved.
MapReduce In The Hands of a Data Scientist
22© Cloudera, Inc. All rights reserved.
Example: Hive Multi-Insert
23© Cloudera, Inc. All rights reserved.
Our Goal: Public Transit for Questions
24© Cloudera, Inc. All rights reserved.
Data Modeling for Data Scientists
25© Cloudera, Inc. All rights reserved.
Motivating Example: Spelling Correction
26© Cloudera, Inc. All rights reserved.
Event Series Analytics
27© Cloudera, Inc. All rights reserved.
A Simple Star Schema for Spell Correction
28© Cloudera, Inc. All rights reserved.
The Combinatorial Explosion
29© Cloudera, Inc. All rights reserved.
• What parameters does this model
need…
• during the analysis phase?
• during deployment?
• Some Candidates
• Lag time between events
• Similarity of queries
• What else?
Designing the Spell Correction Data Product
30© Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
31© Cloudera, Inc. All rights reserved.
Spell Correction in SQL
32© Cloudera, Inc. All rights reserved.
Exhibit: http://github.com/jwills/exhibit
33© Cloudera, Inc. All rights reserved.
Querying Nested Types with Impala
34© Cloudera, Inc. All rights reserved.
• Core Metric: # Outputs/ # Jobs
• Measure on both an individual and
aggregate level
• Drive the marginal cost of asking one
additional question towards zero
• Point business analysts at output
tables for interactive analysis with
Impala
• Self-serve BI frees up resources
(compute + data science time)
Trading Up: From Data Analyst to Data Scientist
35© Cloudera, Inc. All rights reserved.
Thanks!
@josh_wills

Building Data Science Teams: A Moneyball Approach

  • 1.
    1© Cloudera, Inc.All rights reserved. A Moneyball Approach Josh Wills | Senior Director of Data Science Building Data Science Teams
  • 2.
    2© Cloudera, Inc.All rights reserved. About Me
  • 3.
    3© Cloudera, Inc.All rights reserved. A Team Building Exercise
  • 4.
    4© Cloudera, Inc.All rights reserved. Data Scientist Supply vs. Data Scientist Demand
  • 5.
    5© Cloudera, Inc.All rights reserved. Recruiting Techniques
  • 6.
    6© Cloudera, Inc.All rights reserved. Moneyball and Data Science
  • 7.
    7© Cloudera, Inc.All rights reserved. Choosing The Right Metrics
  • 8.
    8© Cloudera, Inc.All rights reserved. 1. Analyzing “Unstructured” Data Sources
  • 9.
    9© Cloudera, Inc.All rights reserved. 2. Building Machine Learning Models
  • 10.
    10© Cloudera, Inc.All rights reserved. 3. Turn Static Reports Into Analytical Applications
  • 11.
    11© Cloudera, Inc.All rights reserved. Answering More Questions in Less Time
  • 12.
    12© Cloudera, Inc.All rights reserved. How To Answer Questions Like A Data Scientist
  • 13.
    13© Cloudera, Inc.All rights reserved. 1. Read and deserialize input data. 2. Project/filter input records. 3. Shuffle: serialize it, send over the network, deserialize it. 4. Apply aggregation logic. 5. Serialize output data. The Life of a Data Processing Job
  • 14.
    14© Cloudera, Inc.All rights reserved. Handling the Cost of Serialization
  • 15.
    15© Cloudera, Inc.All rights reserved. The Traditional RDBMS Approach
  • 16.
    16© Cloudera, Inc.All rights reserved. The Cost of The Traditional RDBMS Approach
  • 17.
    17© Cloudera, Inc.All rights reserved. Query Scheduling and Exploratory Data Analysis
  • 18.
    18© Cloudera, Inc.All rights reserved. The Spark Approach
  • 19.
    19© Cloudera, Inc.All rights reserved. The Cost of the Spark Approach
  • 20.
    20© Cloudera, Inc.All rights reserved. The MapReduce Approach
  • 21.
    21© Cloudera, Inc.All rights reserved. MapReduce In The Hands of a Data Scientist
  • 22.
    22© Cloudera, Inc.All rights reserved. Example: Hive Multi-Insert
  • 23.
    23© Cloudera, Inc.All rights reserved. Our Goal: Public Transit for Questions
  • 24.
    24© Cloudera, Inc.All rights reserved. Data Modeling for Data Scientists
  • 25.
    25© Cloudera, Inc.All rights reserved. Motivating Example: Spelling Correction
  • 26.
    26© Cloudera, Inc.All rights reserved. Event Series Analytics
  • 27.
    27© Cloudera, Inc.All rights reserved. A Simple Star Schema for Spell Correction
  • 28.
    28© Cloudera, Inc.All rights reserved. The Combinatorial Explosion
  • 29.
    29© Cloudera, Inc.All rights reserved. • What parameters does this model need… • during the analysis phase? • during deployment? • Some Candidates • Lag time between events • Similarity of queries • What else? Designing the Spell Correction Data Product
  • 30.
    30© Cloudera, Inc.All rights reserved. A Supernova Schema for Search
  • 31.
    31© Cloudera, Inc.All rights reserved. Spell Correction in SQL
  • 32.
    32© Cloudera, Inc.All rights reserved. Exhibit: http://github.com/jwills/exhibit
  • 33.
    33© Cloudera, Inc.All rights reserved. Querying Nested Types with Impala
  • 34.
    34© Cloudera, Inc.All rights reserved. • Core Metric: # Outputs/ # Jobs • Measure on both an individual and aggregate level • Drive the marginal cost of asking one additional question towards zero • Point business analysts at output tables for interactive analysis with Impala • Self-serve BI frees up resources (compute + data science time) Trading Up: From Data Analyst to Data Scientist
  • 35.
    35© Cloudera, Inc.All rights reserved. Thanks! @josh_wills

Editor's Notes

  • #3 Expand on this definition here.
  • #8 Companies are trying to acquire data scientists. What they should be trying to acquire is insights. How do data scientists leverage their programming skills to create more insights than an equivalently knowledgeable data analyst?
  • #10 ML models: fraud/risk, ad clicks, next best action/recommenders, etc., etc.
  • #12 ML models: fraud/risk, ad clicks, next best action/recommenders, etc., etc.
  • #15 SUM MAX EXAMPLE
  • #17 Discuss traffic congestion and the problem of induced demand.
  • #18 Discuss scheduling and resource management (i.e., you’re only allowed to drive your Ferrari between midnight and 6 AM.)
  • #22 Data scientists know how to structure data in a way that maximizes the number of questions that can be answered by a single MR job.