Why hadoop for data science?

11,073 views

Published on

2 Comments
16 Likes
Statistics
Notes
No Downloads
Views
Total views
11,073
On SlideShare
0
From Embeds
0
Number of Embeds
5,540
Actions
Shares
0
Downloads
210
Comments
2
Likes
16
Embeds 0
No embeds

No notes for slide
  • Add 2007: formed first engineering team focused on this? Want to make point that 5 years of experienceAdd: timing that Cloudera bailedStarted as Nutch project at Yahoo, became Hadoop
  • At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.
  • Why hadoop for data science?

    1. 1. Why Hadoop for data science?Ofer MendelevitchPASS BA Conference, April 2013© Hortonworks Inc. 2013
    2. 2. A brief history of Apache Hadoop Apache Project Yahoo! begins to Hortonworks Established Operate at scale Data Platform 2013 2004 2006 2008 2010 2012 Enterprise Hadoop2005: Yahoo! creates team under E14 to Focus on INNOVATION work on Hadoop 2008: Yahoo team extends focus to operations to support multiple Focus on OPERATIONS projects & growing clusters 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 STABILITY key Hadoop engineers from Yahoo Page 2 © Hortonworks Inc. 2013
    3. 3. Core Hadoop: HDFS & Map ReduceDeliver high-scale storage & processing• HDFS: distributed, self-healing data store• Map-reduce: distributed computation framework that handles the complexities of distributed programming Page 3 © Hortonworks Inc. 2013
    4. 4. Keys to Hadoop’s power• Computation co-located with data – Data and computation system co-designed and co- developed to work together• Process data in parallel across thousands of “commodity” hardware nodes – Self-healing; failure handled by software• Designed for one write and multiple reads – There are no random writes – Optimized for minimum seek on hard drives© Hortonworks Inc. 2013 Page 4
    5. 5. HDP: Enterprise-Ready Hadoop OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013
    6. 6. What is a data product?© Hortonworks Inc. 2013 Page 6
    7. 7. What is a data product?“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”© Hortonworks Inc. 2013 Page 7
    8. 8. Example 1: Google Adwords© Hortonworks Inc. 2013 Page 8
    9. 9. Example 2: People you may know© Hortonworks Inc. 2013 Page 9
    10. 10. Example 3: spell correction© Hortonworks Inc. 2013 Page 10
    11. 11. What is data science?© Hortonworks Inc. 2013 Page 11
    12. 12. What is data science?#1: Extracting deep meaning from data(data mining; finding “gems” in data)© Hortonworks Inc. 2013 Page 12
    13. 13. Common data science tasks Descriptive Predictive Clustering Classification Detect natural groupings Predict a category Outlier detection Regression Detect anomalies Predict a value Affinity Analysis Recommendation Co-occurrence patterns Predict a preference© Hortonworks Inc. 2013 Page 13
    14. 14. What is data science?#2: Building data products(Delivering gems on a regular basis) Online serving Pre-process Build model SQL Periodic batch processing© Hortonworks Inc. 2013 Page 14
    15. 15. Why Hadoop for data science?Reason #1:Explore full datasets© Hortonworks Inc. 2013 Page 15
    16. 16. Explore large datasets directly with HadoopResearcher laptopR, Matlab, SAS, etc Measure/Evaluate Model Acquire Full dataset stored on Hadoop Visualize, Grok Clean Data © Hortonworks Inc. 2013 Page 16
    17. 17. Integrate Hadoop in your data analysis flow• Exploratory data analysis on full dataset –Simple statistics: mean, median, quantile, etc –Pre-processing: grep, regex, etc• Ad-hoc sampling / filtering –Random: with or without replacement –Sample by unique key –K-fold cross-validation© Hortonworks Inc. 2013 Page 17
    18. 18. Why Hadoop for data science?Reason #2:Mine larger datasets© Hortonworks Inc. 2013 Page 18
    19. 19. More data -> better outcomes Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009© Hortonworks Inc. 2013 Page 19
    20. 20. Learning algorithms with large datasets…Challenges:• Data won’t fit in memory• Learning takes a lot longer…Using Hadoop:• Distribute data across nodes in the Hadoop cluster• Implement a distributed/parallel algorithm –Recommendation: Alternate Least Squares (ALS) –Clustering: K-means© Hortonworks Inc. 2013 Page 20
    21. 21. Why Hadoop for data science?Reason #3:Large-scale data preparation© Hortonworks Inc. 2013 Page 21
    22. 22. 80% of data science work is data preparation Sampling, filtering Joins Processed Raw Data Entity resolution Data Strip away HTML/PDF/DOC/P PT Document vector generation Term normalization© Hortonworks Inc. 2013 Page 22
    23. 23. Hadoop is ideal for batch data preparation andcleanup of large datasets© Hortonworks Inc. 2013 Page 23
    24. 24. Why Hadoop for data science?Reason 4:Accelerate data-driven innovation© Hortonworks Inc. 2013 Page 24
    25. 25. Barriers to speed with traditional data architectures• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovation Finally, Let me I need we start see… is it new data collecting any good? Start 6 months 9 months Schema change project© Hortonworks Inc. 2013 Page 25
    26. 26. “Schema on read” means faster time-to-innovation• Hadoop uses “schema on read”• Low barrier for data-driven innovation Let me I need My model is see… is it new data awesome! any good? Start 3 months 6 months Let’s just put it in a folder on HDFS© Hortonworks Inc. 2013 Page 26
    27. 27. Summary Why use Hadoop for data science? 1. Data exploration with full datasets 2. Mine larger datasets 3. Pre-processing at scale 4. Faster data-driven cycles© Hortonworks Inc. 2013 Page 27
    28. 28. Quick start: Hortonworks Sandbox• What is it – A free download of a virtualized single-node implementation of the enterprise-ready Hortonworks Data Platform – A personal Hadoop environment – An integrated learning environment with frequently, easily updatable hands-on step-by-step tutorials• What it does – Dramatically accelerates the process of learning Apache Hadoop – Accelerate and validates the use of Hadoop within your unique data architecture – Use your data to explore and investigate your use cases• ZERO to big data in 15 minutes Download Hortonworks Sandbox www.hortonworks.com/sandbox Sign up for Training for in-depth learning hortonworks.com/hadoop-training/ Page 28 © Hortonworks Inc. 2013
    29. 29. Thank you! Any Questions?Ofer MendelevitchDirector, Data Sciences @ Hortonworksofer@hortonworks.com@ofermend, @hortonworksCome visit us @ Booth S5We’re hiring!© Hortonworks Inc. 2013 Page 29

    ×