Building Data Products

•

6 likes•4,740 views

Josh Wills shares how to be successful building data products and explains what a Data Scientist is at the Federal Big Data Forum.

Building
Data
Products

Josh
Wills,
Senior
Director
of
Data
Science

1

Hadoop
as
a
PlaMorm
for
Data
Products

14

ETL,
Data
Science,
and
Machine
Learning

15

The
Five
Ques<ons

1.  When
should
I
use
it?

2.  What
does
the
input
look
like?

3.  What
does
the
output
look
like?

4.  How
many
parameters
do
I
have
to
tune?

5.  Why
will
it
fail?

18

Collabora<ve
Filtering
(cont.)

1.  To
see
things
that
are
hidden.

2.  <user_id>,<item_id>,<weight>

3.  <item1>,<item2>,<score>

4.  The
distance
metric
and
the
weight
calcula<ons.

5.  If
the
input
data
is
too
sparse.

20

K-‐Means
Clustering
(cont.)

1.  To
ﬁnd
anomalous
events.

2.  Vectors
of
normally
distributed
values.

3.  Cluster
centroids.

4.  The
choice(s)
of
K.

5.  The
points
aren’t
even
remotely
normally

distributed.

23

Random
Forests
(cont.)

1.  To
classify
and
predict.

2.  A
dependent
variable
and
many
independent

variables.

3.  Lots
and
lots
of
liale
trees.

4.  The
number
of
variables
to
consider
at
each
level.

5.  Too
many
independent
variables.

26

Random
Forests
on
Hadoop

•  R’s
randomForest
and

rhadoop
tools

•  Map:
par<<on
the
input

data
among
the

reducers

•  Reduce:
ﬁt
the
random

forests
to
each
par<<on

•  Re-‐combine
the

resul<ng
trees
in
the

client

27

Introduc<on
to
Data
Science:

Building
Recommender
Systems

hap://university.cloudera.com/

31

Thank
you!

Josh
Wills,
Director
of
Data
Science,
Cloudera

@josh_wills

Viewers also liked

Data Science & Data Products at Neue Zürcher ZeitungRené Pfitzner

Launching Data Products for Fun and ProfitZach Gemignani

Data Over Matter: Innovating the next generation of productsEli Bressert

LinkedIn Data ProductsVitaly Gordon

2015 Lean Startup Conference - Leaders' Guide WorkshopJanice Fraser

2015 Balanced Teams: Product Management, Engineering, UX DesignJanice Fraser

Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.

Product Management Roles - Briefly ExplainedBrainmates Pty Limited

What is Product Management?Brainmates Pty Limited

Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013Cain Ransbottyn

Slideshare pptMandy Suzanne

Viewers also liked (11)

Data Science & Data Products at Neue Zürcher Zeitung

Launching Data Products for Fun and Profit

Data Over Matter: Innovating the next generation of products

LinkedIn Data Products

2015 Lean Startup Conference - Leaders' Guide Workshop

2015 Balanced Teams: Product Management, Engineering, UX Design

Data Engineering: Elastic, Low-Cost Data Processing in the Cloud

Product Management Roles - Briefly Explained

What is Product Management?

Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013

Slideshare ppt

Similar to Building Data Products

Builiding analytical apps on HadoopDmitry Makarchuk

Introduction to Big Data/Machine LearningLars Marius Garshol

Distributed data miningAhmad Ammari

Three Tools for "Human-in-the-loop" Data ScienceAditya Parameswaran

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Large scale computing Bhupesh Bansal

Introduction to Big DataRoi Blanco

Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson

The Hadoop Ecosystem for DevelopersZohar Elkayam

Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan

Big Data RampageNiko Vuokko

Big DataCatarina Moreira

HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHarsha Siva Sai

Introduction to Big DataKaran Desai

Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N

Big data 101Lars Marius Garshol

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti

Introduction to HadoopYork University

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

SparkNitish Upreti

Similar to Building Data Products (20)

Builiding analytical apps on Hadoop

Introduction to Big Data/Machine Learning

Distributed data mining

Three Tools for "Human-in-the-loop" Data Science

Scaling ETL with Hadoop - Avoiding Failure

Large scale computing

Introduction to Big Data

Hadoop for Bioinformatics: Building a Scalable Variant Store

The Hadoop Ecosystem for Developers

Big data Intro - Presentation to OCHackerz Meetup Group

Big Data Rampage

Big Data

HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE

Introduction to Big Data

Introduction to Cloud computing and Big Data-Hadoop

Big data 101

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...

Introduction to Hadoop

Distributed Computing with Apache Hadoop: Technology Overview

Spark

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Building Data Products

1. Building Data Products Josh Wills, Senior Director of Data Science 1

2. About Me 2

3. What Do Data Scien<sts Do? 3

4. What I Think I Do 4

5. What Other People Think I Do 5

6. What I Actually Do 6

7. Data Science and Data Products 7

8. Thinking About Data Products 8

9. The Best Way To Find Insights 9

10. Build A Team 10

11. Measure Everything 11

12. Solve the Right Problem 12

13. Building Data Products with Hadoop 13

14. Hadoop as a PlaMorm for Data Products 14

15. ETL, Data Science, and Machine Learning 15

16. Changing the Unit of Analysis 16

17. Machine Learning and You 17

18. The Five Ques<ons 1.  When should I use it? 2.  What does the input look like? 3.  What does the output look like? 4.  How many parameters do I have to tune? 5.  Why will it fail? 18

19. 1. Collabora<ve Filtering 19

20. Collabora<ve Filtering (cont.) 1.  To see things that are hidden. 2.  <user_id>,<item_id>,<weight> 3.  <item1>,<item2>,<score> 4.  The distance metric and the weight calcula<ons. 5.  If the input data is too sparse. 20

21. Collabora<ve Filtering on Hadoop 21

22. 2. K-‐Means Clustering 22

23. K-‐Means Clustering (cont.) 1.  To ﬁnd anomalous events. 2.  Vectors of normally distributed values. 3.  Cluster centroids. 4.  The choice(s) of K. 5.  The points aren’t even remotely normally distributed. 23

24. K-‐Means on Hadoop 24

25. 3. Random Forests 25

26. Random Forests (cont.) 1.  To classify and predict. 2.  A dependent variable and many independent variables. 3.  Lots and lots of liale trees. 4.  The number of variables to consider at each level. 5.  Too many independent variables. 26

27. Random Forests on Hadoop •  R’s randomForest and rhadoop tools •  Map: par<<on the input data among the reducers •  Reduce: ﬁt the random forests to each par<<on •  Re-‐combine the resul<ng trees in the client 27

28. The Art of Model Design 28

29. Cau<on: Mind the Gap 29

30. The Joy of Experiments 30

31. Introduc<on to Data Science: Building Recommender Systems hap://university.cloudera.com/ 31

32. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills

Building Data Products

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Building Data Products

Similar to Building Data Products (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Building Data Products