10c introduction


Published on

Published in: Education, Technology
  • Arash,

    The slide you are talking about is an animation slide. You do not see the all the transition...
    Are you sure you want to  Yes  No
    Your message goes here
  • The word count example on Slide 16 needs to be more clear to show parallel processing - To someone who does not have previous experience with Hadoop or has not seen this example in the past, the distinction between each parallel task in the map phase tokenizing the document is not seen, nor the consolidation done for each parallel reduce task. For this example, I would suggest showing several documents (or pages from Alice) as the input each going to a different map task and showing the key value pairs, lists, etc in different text boxes during the map and reduce phase. Ultimately combining those into one output with a consolidated list. (Perhaps adding in something in the shuffle and sort phase showing sorting and assigning by alphabetical order.)
    Are you sure you want to  Yes  No
    Your message goes here
  • Please add your comments and suggestions with page number... Thank you!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Problem: Scaling reliably is hardWhat you need is a Fault-tolerant store and a fault-tolerant framework.Handle hardware faults transparently and efficientlyHigh-availability - Not dependent on any one componentEven on a big cluster, some things take daysEven simple things are complicated in a failure-rich environmentEvery point is a point where things can fail, have to manage that failureWith many computers, many disks, failures are commonWith 1000 computers x 10 disk, we can have 1 node failure and 10 disk failures per daySome failures are intermittent or difficult to detectComputation must succeed and not run slower in these conditions
  • Apache Hadoop - a new paradigm Scales to thousands of commodity computers Can effectively use all cores and spindles simultaneously If you buy hardware, you want to maximize use New software stack built on a different foundation Not very mature yet In use by most web 2.0 companies and many Fortune 500
  • The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
  • 10c introduction

    1. 1. Introduction: MapR and Hadoop 7/6/2012© 2012 MapR Technologies Introduction 1
    2. 2. Introduction Agenda • Hadoop Overview • MapReduce Overview • Hadoop Ecosystem • How is MapR Different? • Summary© 2012 MapR Technologies Introduction 2
    3. 3. Introduction Objectives At the end of this module you will be able to: • Explain why Hadoop is an important technology for effectively working with Big Data • Describe the phases of a MapReduce job • Identify some of the tools used with Hadoop • List the similarities and differences between MapR and other Hadoop distributions© 2012 MapR Technologies Introduction 3
    4. 4. Hadoop Overview© 2012 MapR Technologies Introduction 4
    5. 5. Data is Growing Faster than Moore’s Law Business Analytics Requires a New Approach Data Volume Growing 44x 2010: 1.2 Zettabytes 2020: 35.2 Zettabytes IDC Digital Universe Study 2011 © 2012 MapR Technologies Introduction 5Source: IDC Digital Universe Study, sponsored by EMC, May 2010
    6. 6. Before Hadoop Web crawling to power search engines • Must be able to handle gigantic data • Must be fast! Problem: databases (B-Tree) not so fast, and do not scale Solution: Sort and Merge • Eliminate the pesky seek time!© 2012 MapR Technologies Introduction 6
    7. 7. How to Scale? Big Data has Big Problems • Petabytes of data • MTBF on 1000s of nodes is < 1 day • Something is always broken • There are limits to scaling Big Iron • Sequential and random access just don’t scale© 2012 MapR Technologies Introduction 7
    8. 8. Example: Update 1% of 1TB  Data consists of 10 billion records, each 100 bytes  Task: Update 1% of these records© 2012 MapR Technologies Introduction 8
    9. 9. Approach 1: Just Do It  Each update involves read, modify and write – t = 1 seek + 2 disk rotations = 20ms – 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)  Total time dominated by seek and rotation times© 2012 MapR Technologies Introduction 9
    10. 10. Approach 2: The “Hard” Way  Copy the entire database 1GB at a time  Update records sequentially – t = 2 x 1GB / 100MB/s + 20ms = 20s – 103 x 20s = 20,000s = 5.6 hours  100x faster to move 100x more data!  Moral: Read data sequentially even if you only want 1% of it© 2012 MapR Technologies Introduction 10
    11. 11. Introducing Hadoop!  Now imagine you have thousands of disks on hundreds of machines with near linear scaling – Commodity hardware – thousands of nodes! – Handles Big Data – Petabytes and more! – Sequential file access – all spindles at once! – Sharding – data distributed evenly across cluster – Reliability – self-healing, self-balancing – Redundancy – data replicated across multiple hosts and disks – MapReduce • Parallel computing framework • Moves the computation to the data© 2012 MapR Technologies Introduction 11
    12. 12. Hadoop Architecture • MapReduce: Parallel computing – Move the computation to the data – Minimizes network utilization • Distributed storage layer: Keeping track of data and metadata – Data is sharded across the cluster • Cluster management tools • Applications and tools© 2012 MapR Technologies Introduction 12
    13. 13. What’s Driving Hadoop Adoption? “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems© 2012 MapR Technologies Introduction 13
    14. 14. MapReduce Overview© 2012 MapR Technologies Introduction 14
    15. 15. MapReduce • A programming model for processing very large data sets ― A framework for processing parallel problems across huge datasets using a large number of nodes ― Brute force parallel computing paradigm • Phases ― Map • Job partitioned into “splits” ― Shuffle and sort • Map output sent to reducer(s) using a hash ― Reduce© 2012 MapR Technologies Introduction 15
    16. 16. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax has, [1,5,2] come, 1 come, 6 the, [1,2,1] has, 8 … time, the, 4 [10,1,3] time, 14 Input Map … Shuffle Reduce … Output and sort© 2012 MapR Technologies Introduction 16
    17. 17. JobTracker • Sends out tasks • Co-locates tasks with data • Gets data location • Manages TaskTrackers© 2012 MapR Technologies Introduction 17
    18. 18. TaskTracker • Performs tasks (Map, Reduce) • Slots determine number of concurrent tasks • Notifies JobTracker of completed jobs • Heartbeats to the JobTracker • Each task is a separate Java process© 2012 MapR Technologies Introduction 18
    19. 19. Hadoop Ecosystem© 2012 MapR Technologies Introduction 19
    20. 20. Hadoop Ecosystem • PIG: It will eat anything – High level language, set algebra, careful semantics – Filter, transform, co-group, generate, flatten – PIG generates and optimizes map-reduce programs • Hive: Busy as a bee – High level language, more ad hoc than PIG – SQL-ish – Has central meta-data service – Loves external scripts • HBase: NoSQL for your cluster • Mahout: distributed/scalable machine learning algorithms© 2012 MapR Technologies Introduction 20
    21. 21. How is MapR Different?© 2012 MapR Technologies Introduction 21
    22. 22. Mostly, It’s Not!  API-compatible – Move code over without modifications – Use the familiar Hadoop Shell  Supports popular tools and applications – Hive, Pig, HBase—Flume, if you want it© 2012 MapR Technologies Introduction 22
    23. 23. Very Different Where It Counts  No single point of failure  Faster shuffle, faster file creation  Read/write storage layer  NFS-mountable  Management tools—MCS, Rest API, CLI  Data placement, protection, backup  HA at all layers (Naming, NFS, JobTracker, MCS)© 2012 MapR Technologies Introduction 23
    24. 24. Summary© 2012 MapR Technologies Introduction 24
    25. 25. Questions© 2012 MapR Technologies Introduction 25