Data Science: A Personal History    Jeff Hammerbacher1
Data Scientist2
Data Applications Scientist    “I have only heard back from one person about that    ‘Data Applications Scientist’ thing. ...
“I guess I’m arguing for ‘Data’ to replace ‘Research’ in    those titles (I am happy to drop the ‘Applications’) as    the...
Data Scientist    “I’d like to avoid specialization at this early stage and I    expect every member of our group to have ...
Facebook Data Team    The Facebook Data Team built scalable platforms for    the collection, management, and analysis of d...
Data Science7
Introduction to Data Science    1.   Data Preparation    2.   Data Presentation    3.   Experimentation    4.   Observatio...
Data Scientist-Computer Symbiosis9
Philosophy     •   Instrument everything     •   Put all of your data in one place     •   Data first, questions later    ...
CDH     •   Storage          •   Append-only unstructured data          •   Append-only tabular data          •   Mutable ...
CDH     •   Compute          •   Resource management          •   Parallel frameworks          •   High-level interfaces  ...
CDH     •   Integration          •   File system API          •   Database API          •   Batch data import/export      ...
Cloudera Products     •   Subscription          •   Proprietary software          •   Support     •   Training and Certifi...
Cloudera Deployment        Application                Data         Database     CD H       Warehouse                      ...
Cloudera Workloads (Batch)     •   Active archive     •   Data reservoir     •   ETL/ELT offload16
Cloudera Workloads (Interactive)     •   Application data delivery17
Cloudera Customer Survey     •   67% use Hive     •   54% use HBase     •   51% load data every 90 minutes or less     •  ...
Cloudera Impala     •   General-purpose SQL query engine          •   Should work both for analytic and transactional work...
Cloudera Impala     •   Runs directly within Hadoop          •   Reads widely used Hadoop file formats          •   Talks ...
Cloudera Impala     •   High performance          •   C++ instead of Java          •   Runtime code generation          • ...
Cloudera Impala     •   Validated Beta Partners          •   MicroStrategy          •   QlikView          •   Tableau     ...
New Cloudera Workloads (Interactive)     •   Operational reporting     •   Ad hoc query23
Cloudera Deployment        Application                Data         Database     CD H       Warehouse                      ...
The Future25
Potential Future Workloads     •   Search     •   MPI     •   Stream processing     •   Graph computations     •   Linear ...
The Last Mile     •   Data libraries     •   Language     •   Libraries     •   IDE for Data Scientists     •   Mixed-init...
Doing Data Science     •   More data sources     •   More rows     •   More columns (novel or derived)     •   Better data...
29
Upcoming SlideShare
Loading in …5
×

Data Science Day New York: Data Science: A Personal History

4,310 views

Published on

Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.

Data Science Day New York: Data Science: A Personal History

  1. 1. Data Science: A Personal History Jeff Hammerbacher1
  2. 2. Data Scientist2
  3. 3. Data Applications Scientist “I have only heard back from one person about that ‘Data Applications Scientist’ thing. I had anticipated more discussion” – me, February 29, 20083
  4. 4. “I guess I’m arguing for ‘Data’ to replace ‘Research’ in those titles (I am happy to drop the ‘Applications’) as the primary focus of our organization is not corporate research.” – me, March 1, 20084
  5. 5. Data Scientist “I’d like to avoid specialization at this early stage and I expect every member of our group to have a mix of research, engineering, and analysis in their workload.” – me, March 1, 20085
  6. 6. Facebook Data Team The Facebook Data Team built scalable platforms for the collection, management, and analysis of data. We used these platforms to drive informed decisions in areas critical to the success of the company and to build data-intensive products and services.6
  7. 7. Data Science7
  8. 8. Introduction to Data Science 1. Data Preparation 2. Data Presentation 3. Experimentation 4. Observation 5. Data Products8
  9. 9. Data Scientist-Computer Symbiosis9
  10. 10. Philosophy • Instrument everything • Put all of your data in one place • Data first, questions later • Store first, structure later • Keep raw data forever • Let everyone party on the data • Produce tools to support the whole research cycle • Modular and composable infrastructure10
  11. 11. CDH • Storage • Append-only unstructured data • Append-only tabular data • Mutable tabular data11
  12. 12. CDH • Compute • Resource management • Parallel frameworks • High-level interfaces • Libraries12
  13. 13. CDH • Integration • File system API • Database API • Batch data import/export • Event data import • User interface13
  14. 14. Cloudera Products • Subscription • Proprietary software • Support • Training and Certification • Services14
  15. 15. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence15
  16. 16. Cloudera Workloads (Batch) • Active archive • Data reservoir • ETL/ELT offload16
  17. 17. Cloudera Workloads (Interactive) • Application data delivery17
  18. 18. Cloudera Customer Survey • 67% use Hive • 54% use HBase • 51% load data every 90 minutes or less • 71% move data from Hadoop to RDBMS for interactive SQL • 62% would like to consolidate into single platform18
  19. 19. Cloudera Impala • General-purpose SQL query engine • Should work both for analytic and transactional workloads • Will support queries that take from microseconds to hours19
  20. 20. Cloudera Impala • Runs directly within Hadoop • Reads widely used Hadoop file formats • Talks to widely used Hadoop storage managers • Runs on same nodes that run Hadoop processes20
  21. 21. Cloudera Impala • High performance • C++ instead of Java • Runtime code generation • Completely new execution engine—not MapReduce21
  22. 22. Cloudera Impala • Validated Beta Partners • MicroStrategy • QlikView • Tableau • Pentaho • Karmasphere • Capgemini22
  23. 23. New Cloudera Workloads (Interactive) • Operational reporting • Ad hoc query23
  24. 24. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence24
  25. 25. The Future25
  26. 26. Potential Future Workloads • Search • MPI • Stream processing • Graph computations • Linear algebra • Optimization • Simulation26
  27. 27. The Last Mile • Data libraries • Language • Libraries • IDE for Data Scientists • Mixed-initiative • Memory • Collaboration • Model and analysis path selection27
  28. 28. Doing Data Science • More data sources • More rows • More columns (novel or derived) • Better data quality • Better outcomes • Better loss functions • Causal inference in observational studies • Effect size estimates • Meta-analysis • Model lifecycle28
  29. 29. 29

×