Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why hadoop for data science?

11,330 views

Published on

Why hadoop for data science?

  1. 1. Why Hadoop for data science?Ofer MendelevitchPASS BA Conference, April 2013© Hortonworks Inc. 2013
  2. 2. A brief history of Apache Hadoop Apache Project Yahoo! begins to Hortonworks Established Operate at scale Data Platform 2013 2004 2006 2008 2010 2012 Enterprise Hadoop2005: Yahoo! creates team under E14 to Focus on INNOVATION work on Hadoop 2008: Yahoo team extends focus to operations to support multiple Focus on OPERATIONS projects & growing clusters 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 STABILITY key Hadoop engineers from Yahoo Page 2 © Hortonworks Inc. 2013
  3. 3. Core Hadoop: HDFS & Map ReduceDeliver high-scale storage & processing• HDFS: distributed, self-healing data store• Map-reduce: distributed computation framework that handles the complexities of distributed programming Page 3 © Hortonworks Inc. 2013
  4. 4. Keys to Hadoop’s power• Computation co-located with data – Data and computation system co-designed and co- developed to work together• Process data in parallel across thousands of “commodity” hardware nodes – Self-healing; failure handled by software• Designed for one write and multiple reads – There are no random writes – Optimized for minimum seek on hard drives© Hortonworks Inc. 2013 Page 4
  5. 5. HDP: Enterprise-Ready Hadoop OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013
  6. 6. What is a data product?© Hortonworks Inc. 2013 Page 6
  7. 7. What is a data product?“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”© Hortonworks Inc. 2013 Page 7
  8. 8. Example 1: Google Adwords© Hortonworks Inc. 2013 Page 8
  9. 9. Example 2: People you may know© Hortonworks Inc. 2013 Page 9
  10. 10. Example 3: spell correction© Hortonworks Inc. 2013 Page 10
  11. 11. What is data science?© Hortonworks Inc. 2013 Page 11
  12. 12. What is data science?#1: Extracting deep meaning from data(data mining; finding “gems” in data)© Hortonworks Inc. 2013 Page 12
  13. 13. Common data science tasks Descriptive Predictive Clustering Classification Detect natural groupings Predict a category Outlier detection Regression Detect anomalies Predict a value Affinity Analysis Recommendation Co-occurrence patterns Predict a preference© Hortonworks Inc. 2013 Page 13
  14. 14. What is data science?#2: Building data products(Delivering gems on a regular basis) Online serving Pre-process Build model SQL Periodic batch processing© Hortonworks Inc. 2013 Page 14
  15. 15. Why Hadoop for data science?Reason #1:Explore full datasets© Hortonworks Inc. 2013 Page 15
  16. 16. Explore large datasets directly with HadoopResearcher laptopR, Matlab, SAS, etc Measure/Evaluate Model Acquire Full dataset stored on Hadoop Visualize, Grok Clean Data © Hortonworks Inc. 2013 Page 16
  17. 17. Integrate Hadoop in your data analysis flow• Exploratory data analysis on full dataset –Simple statistics: mean, median, quantile, etc –Pre-processing: grep, regex, etc• Ad-hoc sampling / filtering –Random: with or without replacement –Sample by unique key –K-fold cross-validation© Hortonworks Inc. 2013 Page 17
  18. 18. Why Hadoop for data science?Reason #2:Mine larger datasets© Hortonworks Inc. 2013 Page 18
  19. 19. More data -> better outcomes Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009© Hortonworks Inc. 2013 Page 19
  20. 20. Learning algorithms with large datasets…Challenges:• Data won’t fit in memory• Learning takes a lot longer…Using Hadoop:• Distribute data across nodes in the Hadoop cluster• Implement a distributed/parallel algorithm –Recommendation: Alternate Least Squares (ALS) –Clustering: K-means© Hortonworks Inc. 2013 Page 20
  21. 21. Why Hadoop for data science?Reason #3:Large-scale data preparation© Hortonworks Inc. 2013 Page 21
  22. 22. 80% of data science work is data preparation Sampling, filtering Joins Processed Raw Data Entity resolution Data Strip away HTML/PDF/DOC/P PT Document vector generation Term normalization© Hortonworks Inc. 2013 Page 22
  23. 23. Hadoop is ideal for batch data preparation andcleanup of large datasets© Hortonworks Inc. 2013 Page 23
  24. 24. Why Hadoop for data science?Reason 4:Accelerate data-driven innovation© Hortonworks Inc. 2013 Page 24
  25. 25. Barriers to speed with traditional data architectures• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovation Finally, Let me I need we start see… is it new data collecting any good? Start 6 months 9 months Schema change project© Hortonworks Inc. 2013 Page 25
  26. 26. “Schema on read” means faster time-to-innovation• Hadoop uses “schema on read”• Low barrier for data-driven innovation Let me I need My model is see… is it new data awesome! any good? Start 3 months 6 months Let’s just put it in a folder on HDFS© Hortonworks Inc. 2013 Page 26
  27. 27. Summary Why use Hadoop for data science? 1. Data exploration with full datasets 2. Mine larger datasets 3. Pre-processing at scale 4. Faster data-driven cycles© Hortonworks Inc. 2013 Page 27
  28. 28. Quick start: Hortonworks Sandbox• What is it – A free download of a virtualized single-node implementation of the enterprise-ready Hortonworks Data Platform – A personal Hadoop environment – An integrated learning environment with frequently, easily updatable hands-on step-by-step tutorials• What it does – Dramatically accelerates the process of learning Apache Hadoop – Accelerate and validates the use of Hadoop within your unique data architecture – Use your data to explore and investigate your use cases• ZERO to big data in 15 minutes Download Hortonworks Sandbox www.hortonworks.com/sandbox Sign up for Training for in-depth learning hortonworks.com/hadoop-training/ Page 28 © Hortonworks Inc. 2013
  29. 29. Thank you! Any Questions?Ofer MendelevitchDirector, Data Sciences @ Hortonworksofer@hortonworks.com@ofermend, @hortonworksCome visit us @ Booth S5We’re hiring!© Hortonworks Inc. 2013 Page 29

×