Your SlideShare is downloading. ×
0
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Science Day New York: Data Science: A Personal History

3,362

Published on

Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.

Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,362
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
120
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Science: A Personal History Jeff Hammerbacher1
  • 2. Data Scientist2
  • 3. Data Applications Scientist “I have only heard back from one person about that ‘Data Applications Scientist’ thing. I had anticipated more discussion” – me, February 29, 20083
  • 4. “I guess I’m arguing for ‘Data’ to replace ‘Research’ in those titles (I am happy to drop the ‘Applications’) as the primary focus of our organization is not corporate research.” – me, March 1, 20084
  • 5. Data Scientist “I’d like to avoid specialization at this early stage and I expect every member of our group to have a mix of research, engineering, and analysis in their workload.” – me, March 1, 20085
  • 6. Facebook Data Team The Facebook Data Team built scalable platforms for the collection, management, and analysis of data. We used these platforms to drive informed decisions in areas critical to the success of the company and to build data-intensive products and services.6
  • 7. Data Science7
  • 8. Introduction to Data Science 1. Data Preparation 2. Data Presentation 3. Experimentation 4. Observation 5. Data Products8
  • 9. Data Scientist-Computer Symbiosis9
  • 10. Philosophy • Instrument everything • Put all of your data in one place • Data first, questions later • Store first, structure later • Keep raw data forever • Let everyone party on the data • Produce tools to support the whole research cycle • Modular and composable infrastructure10
  • 11. CDH • Storage • Append-only unstructured data • Append-only tabular data • Mutable tabular data11
  • 12. CDH • Compute • Resource management • Parallel frameworks • High-level interfaces • Libraries12
  • 13. CDH • Integration • File system API • Database API • Batch data import/export • Event data import • User interface13
  • 14. Cloudera Products • Subscription • Proprietary software • Support • Training and Certification • Services14
  • 15. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence15
  • 16. Cloudera Workloads (Batch) • Active archive • Data reservoir • ETL/ELT offload16
  • 17. Cloudera Workloads (Interactive) • Application data delivery17
  • 18. Cloudera Customer Survey • 67% use Hive • 54% use HBase • 51% load data every 90 minutes or less • 71% move data from Hadoop to RDBMS for interactive SQL • 62% would like to consolidate into single platform18
  • 19. Cloudera Impala • General-purpose SQL query engine • Should work both for analytic and transactional workloads • Will support queries that take from microseconds to hours19
  • 20. Cloudera Impala • Runs directly within Hadoop • Reads widely used Hadoop file formats • Talks to widely used Hadoop storage managers • Runs on same nodes that run Hadoop processes20
  • 21. Cloudera Impala • High performance • C++ instead of Java • Runtime code generation • Completely new execution engine—not MapReduce21
  • 22. Cloudera Impala • Validated Beta Partners • MicroStrategy • QlikView • Tableau • Pentaho • Karmasphere • Capgemini22
  • 23. New Cloudera Workloads (Interactive) • Operational reporting • Ad hoc query23
  • 24. Cloudera Deployment Application Data Database CD H Warehouse Business Analytics Intelligence24
  • 25. The Future25
  • 26. Potential Future Workloads • Search • MPI • Stream processing • Graph computations • Linear algebra • Optimization • Simulation26
  • 27. The Last Mile • Data libraries • Language • Libraries • IDE for Data Scientists • Mixed-initiative • Memory • Collaboration • Model and analysis path selection27
  • 28. Doing Data Science • More data sources • More rows • More columns (novel or derived) • Better data quality • Better outcomes • Better loss functions • Causal inference in observational studies • Effect size estimates • Meta-analysis • Model lifecycle28
  • 29. 29

×