Your SlideShare is downloading. ×
0
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Science Day New York: Data Science: A Personal History

3,556

Published on

Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.

Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,556
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
125
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Data Science: A Personal History Jeff Hammerbacher1
  2. Data Scientist2
  3. Data Applications Scientist “I have only heard back from one person about that ‘Data Applications Scientist’ thing. I had anticipated more discussion” – me, February 29, 20083
  4. “I guess I’m arguing for ‘Data’ to replace ‘Research’ in those titles (I am happy to drop the ‘Applications’) as the primary focus of our organization is not corporate research.” – me, March 1, 20084
  5. Data Scientist “I’d like to avoid specialization at this early stage and I expect every member of our group to have a mix of research, engineering, and analysis in their workload.” – me, March 1, 20085
  6. Facebook Data Team The Facebook Data Team built scalable platforms for the collection, management, and analysis of data. We used these platforms to drive informed decisions in areas critical to the success of the company and to build data-intensive products and services.6
  7. Data Science7
  8. Introduction to Data Science 1. Data Preparation 2. Data Presentation 3. Experimentation 4. Observation 5. Data Products8
  9. Data Scientist-Computer Symbiosis9
  10. Philosophy • Instrument everything • Put all of your data in one place • Data first, questions later • Store first, structure later • Keep raw data forever • Let everyone party on the data • Produce tools to support the whole research cycle • Modular and composable infrastructure10
  11. CDH • Storage • Append-only unstructured data • Append-only tabular data • Mutable tabular data11
  12. CDH • Compute • Resource management • Parallel frameworks • High-level interfaces • Libraries12
  13. CDH • Integration • File system API • Database API • Batch data import/export • Event data import • User interface13
  14. Cloudera Products • Subscription • Proprietary software • Support • Training and Certification • Services14
  15. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence15
  16. Cloudera Workloads (Batch) • Active archive • Data reservoir • ETL/ELT offload16
  17. Cloudera Workloads (Interactive) • Application data delivery17
  18. Cloudera Customer Survey • 67% use Hive • 54% use HBase • 51% load data every 90 minutes or less • 71% move data from Hadoop to RDBMS for interactive SQL • 62% would like to consolidate into single platform18
  19. Cloudera Impala • General-purpose SQL query engine • Should work both for analytic and transactional workloads • Will support queries that take from microseconds to hours19
  20. Cloudera Impala • Runs directly within Hadoop • Reads widely used Hadoop file formats • Talks to widely used Hadoop storage managers • Runs on same nodes that run Hadoop processes20
  21. Cloudera Impala • High performance • C++ instead of Java • Runtime code generation • Completely new execution engine—not MapReduce21
  22. Cloudera Impala • Validated Beta Partners • MicroStrategy • QlikView • Tableau • Pentaho • Karmasphere • Capgemini22
  23. New Cloudera Workloads (Interactive) • Operational reporting • Ad hoc query23
  24. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence24
  25. The Future25
  26. Potential Future Workloads • Search • MPI • Stream processing • Graph computations • Linear algebra • Optimization • Simulation26
  27. The Last Mile • Data libraries • Language • Libraries • IDE for Data Scientists • Mixed-initiative • Memory • Collaboration • Model and analysis path selection27
  28. Doing Data Science • More data sources • More rows • More columns (novel or derived) • Better data quality • Better outcomes • Better loss functions • Causal inference in observational studies • Effect size estimates • Meta-analysis • Model lifecycle28
  29. 29

×