SlideShare a Scribd company logo
Introduction: MapR and Hadoop
  7/6/2012

© 2012 MapR Technologies   Introduction 1
Introduction
   Agenda
   • Hadoop Overview
   • MapReduce Overview
   • Hadoop Ecosystem
   • How is MapR Different?
   • Summary




© 2012 MapR Technologies      Introduction 2
Introduction
   Objectives
   At the end of this module you will be able to:
   • Explain why Hadoop is an important technology for effectively working with
     Big Data
   • Describe the phases of a MapReduce job
   • Identify some of the tools used with Hadoop
   • List the similarities and differences between MapR and other Hadoop
     distributions




© 2012 MapR Technologies         Introduction 3
Hadoop Overview




© 2012 MapR Technologies       Introduction 4
Data is Growing Faster than Moore’s Law

        Business Analytics Requires a New Approach



                                                   Data Volume
                                                   Growing 44x
                    2010:
                      1.2
                  Zettabytes                                                      2020: 35.2
                                                                                  Zettabytes          IDC
                                                                                               Digital Universe
                                                                                                 Study 2011



   © 2012 MapR Technologies                                      Introduction 5
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
Before Hadoop
  Web crawling to power search engines
  •    Must be able to handle gigantic data
  •    Must be fast!
  Problem: databases (B-Tree) not so fast, and do not scale
  Solution: Sort and Merge
  •    Eliminate the pesky seek time!




© 2012 MapR Technologies       Introduction 6
How to Scale?
  Big Data has Big Problems
  •   Petabytes of data
  •   MTBF on 1000s of nodes is < 1 day
  •   Something is always broken
  •   There are limits to scaling Big Iron
  •   Sequential and random access just don’t scale




© 2012 MapR Technologies       Introduction 7
Example: Update 1% of 1TB

     Data consists of 10 billion records, each 100 bytes
     Task: Update 1% of these records




© 2012 MapR Technologies        Introduction 8
Approach 1: Just Do It

     Each update involves read, modify and write
      –   t = 1 seek + 2 disk rotations = 20ms
      –   1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)
     Total time dominated by seek and rotation times




© 2012 MapR Technologies            Introduction 9
Approach 2: The “Hard” Way

     Copy the entire database 1GB at a time
     Update records sequentially
      –   t = 2 x 1GB / 100MB/s + 20ms = 20s
      –   103 x 20s = 20,000s = 5.6 hours
     100x faster to move 100x more data!
     Moral: Read data sequentially even if you only want 1% of it




© 2012 MapR Technologies           Introduction 10
Introducing Hadoop!
     Now imagine you have thousands of disks on hundreds of
      machines with near linear scaling
      –   Commodity hardware – thousands of nodes!
      –   Handles Big Data – Petabytes and more!
      –   Sequential file access – all spindles at once!
      –   Sharding – data distributed evenly across cluster
      –   Reliability – self-healing, self-balancing
      –   Redundancy – data replicated across multiple hosts and disks
      –   MapReduce
          • Parallel computing framework
          • Moves the computation to the data




© 2012 MapR Technologies             Introduction 11
Hadoop Architecture
   • MapReduce: Parallel computing
           –   Move the computation to the data
           –   Minimizes network utilization

   • Distributed storage layer: Keeping track of data and metadata
           –   Data is sharded across the cluster

   • Cluster management tools
   • Applications and tools




© 2012 MapR Technologies              Introduction 12
What’s Driving Hadoop Adoption?


        “Simple algorithms and lots of data
            trump complex models ”



                                             Halevy, Norvig, and Pereira, Google
                                                         IEEE Intelligent Systems

© 2012 MapR Technologies   Introduction 13
MapReduce Overview




© 2012 MapR Technologies   Introduction 14
MapReduce
   •     A programming model for processing very large data sets
       ― A framework for processing parallel problems across huge datasets using
         a large number of nodes
       ― Brute force parallel computing paradigm

   •     Phases
       ― Map
            •    Job partitioned into “splits”

       ― Shuffle and sort
            •    Map output sent to reducer(s) using a hash

       ― Reduce


© 2012 MapR Technologies                Introduction 15
Inside Map-Reduce




                                  the, 1
              "The time has come," the Walrus said,
                                  time, 1
              "To talk of many things:    come, [3,2,1]
                                  has, 1
              Of shoes—and ships—and sealing-wax
                                          has, [1,5,2]
                                  come, 1                come, 6
                                          the, [1,2,1]   has, 8
                                  …
                                          time,          the, 4
                                          [10,1,3]       time, 14
                 Input      Map           …
                                      Shuffle       Reduce
                                                         …      Output
                                     and sort




© 2012 MapR Technologies              Introduction 16
JobTracker
   • Sends out tasks
   • Co-locates tasks with data
   • Gets data location
   • Manages TaskTrackers




© 2012 MapR Technologies    Introduction 17
TaskTracker
   •     Performs tasks (Map, Reduce)
   •     Slots determine number of concurrent tasks
   •     Notifies JobTracker of completed jobs
   •     Heartbeats to the JobTracker
   •     Each task is a separate Java process




© 2012 MapR Technologies       Introduction 18
Hadoop Ecosystem




© 2012 MapR Technologies       Introduction 19
Hadoop Ecosystem
   • PIG: It will eat anything
     –   High level language, set algebra, careful semantics
     –   Filter, transform, co-group, generate, flatten
     –   PIG generates and optimizes map-reduce programs
   • Hive: Busy as a bee
     –   High level language, more ad hoc than PIG
     –   SQL-ish
     –   Has central meta-data service
     –   Loves external scripts
   • HBase: NoSQL for your cluster
   • Mahout: distributed/scalable machine learning algorithms


© 2012 MapR Technologies            Introduction 20
How is MapR Different?




© 2012 MapR Technologies   Introduction 21
Mostly, It’s Not!

     API-compatible
      –   Move code over without modifications
      –   Use the familiar Hadoop Shell
     Supports popular tools and applications
      –   Hive, Pig, HBase—Flume, if you want it




© 2012 MapR Technologies            Introduction 22
Very Different Where It Counts
   No single point of failure
   Faster shuffle, faster file creation
   Read/write storage layer
   NFS-mountable
   Management tools—MCS, Rest API, CLI
   Data placement, protection, backup
   HA at all layers (Naming, NFS, JobTracker, MCS)




© 2012 MapR Technologies    Introduction 23
Summary




© 2012 MapR Technologies    Introduction 24
Questions




© 2012 MapR Technologies   Introduction 25

More Related Content

What's hot

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
John Sing
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
Xiao Qin
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
Dr Hajji Hicham
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
Ted Dunning
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
David Kaiser
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning
 

What's hot (20)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
12a architecture
12a architecture12a architecture
12a architecture
 

Viewers also liked

Aiesec board reception program jan 2012
Aiesec board reception program jan 2012Aiesec board reception program jan 2012
Aiesec board reception program jan 2012andreialeonard
 
Digital Humanities Breakfast - Andrew White
Digital Humanities Breakfast - Andrew WhiteDigital Humanities Breakfast - Andrew White
Digital Humanities Breakfast - Andrew WhiteDigitalHumanitiesUON
 
geo exam2W
geo exam2Wgeo exam2W
geo exam2W
Lasersunshine
 

Viewers also liked (7)

Aiesec board reception program jan 2012
Aiesec board reception program jan 2012Aiesec board reception program jan 2012
Aiesec board reception program jan 2012
 
Portafolio 3er parcial
Portafolio 3er parcialPortafolio 3er parcial
Portafolio 3er parcial
 
Assignment 8 NARRATIVE THEORY
Assignment 8 NARRATIVE THEORYAssignment 8 NARRATIVE THEORY
Assignment 8 NARRATIVE THEORY
 
Digital Humanities Breakfast - Andrew White
Digital Humanities Breakfast - Andrew WhiteDigital Humanities Breakfast - Andrew White
Digital Humanities Breakfast - Andrew White
 
Hostels
HostelsHostels
Hostels
 
geo exam2W
geo exam2Wgeo exam2W
geo exam2W
 
Presentation1
Presentation1Presentation1
Presentation1
 

Similar to 10c introduction

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
Gavin Heavyside
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
Gavin Heavyside
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
SysFera
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
nvvrajesh
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 

Similar to 10c introduction (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

10c introduction

  • 1. Introduction: MapR and Hadoop 7/6/2012 © 2012 MapR Technologies Introduction 1
  • 2. Introduction Agenda • Hadoop Overview • MapReduce Overview • Hadoop Ecosystem • How is MapR Different? • Summary © 2012 MapR Technologies Introduction 2
  • 3. Introduction Objectives At the end of this module you will be able to: • Explain why Hadoop is an important technology for effectively working with Big Data • Describe the phases of a MapReduce job • Identify some of the tools used with Hadoop • List the similarities and differences between MapR and other Hadoop distributions © 2012 MapR Technologies Introduction 3
  • 4. Hadoop Overview © 2012 MapR Technologies Introduction 4
  • 5. Data is Growing Faster than Moore’s Law Business Analytics Requires a New Approach Data Volume Growing 44x 2010: 1.2 Zettabytes 2020: 35.2 Zettabytes IDC Digital Universe Study 2011 © 2012 MapR Technologies Introduction 5 Source: IDC Digital Universe Study, sponsored by EMC, May 2010
  • 6. Before Hadoop Web crawling to power search engines • Must be able to handle gigantic data • Must be fast! Problem: databases (B-Tree) not so fast, and do not scale Solution: Sort and Merge • Eliminate the pesky seek time! © 2012 MapR Technologies Introduction 6
  • 7. How to Scale? Big Data has Big Problems • Petabytes of data • MTBF on 1000s of nodes is < 1 day • Something is always broken • There are limits to scaling Big Iron • Sequential and random access just don’t scale © 2012 MapR Technologies Introduction 7
  • 8. Example: Update 1% of 1TB  Data consists of 10 billion records, each 100 bytes  Task: Update 1% of these records © 2012 MapR Technologies Introduction 8
  • 9. Approach 1: Just Do It  Each update involves read, modify and write – t = 1 seek + 2 disk rotations = 20ms – 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)  Total time dominated by seek and rotation times © 2012 MapR Technologies Introduction 9
  • 10. Approach 2: The “Hard” Way  Copy the entire database 1GB at a time  Update records sequentially – t = 2 x 1GB / 100MB/s + 20ms = 20s – 103 x 20s = 20,000s = 5.6 hours  100x faster to move 100x more data!  Moral: Read data sequentially even if you only want 1% of it © 2012 MapR Technologies Introduction 10
  • 11. Introducing Hadoop!  Now imagine you have thousands of disks on hundreds of machines with near linear scaling – Commodity hardware – thousands of nodes! – Handles Big Data – Petabytes and more! – Sequential file access – all spindles at once! – Sharding – data distributed evenly across cluster – Reliability – self-healing, self-balancing – Redundancy – data replicated across multiple hosts and disks – MapReduce • Parallel computing framework • Moves the computation to the data © 2012 MapR Technologies Introduction 11
  • 12. Hadoop Architecture • MapReduce: Parallel computing – Move the computation to the data – Minimizes network utilization • Distributed storage layer: Keeping track of data and metadata – Data is sharded across the cluster • Cluster management tools • Applications and tools © 2012 MapR Technologies Introduction 12
  • 13. What’s Driving Hadoop Adoption? “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems © 2012 MapR Technologies Introduction 13
  • 14. MapReduce Overview © 2012 MapR Technologies Introduction 14
  • 15. MapReduce • A programming model for processing very large data sets ― A framework for processing parallel problems across huge datasets using a large number of nodes ― Brute force parallel computing paradigm • Phases ― Map • Job partitioned into “splits” ― Shuffle and sort • Map output sent to reducer(s) using a hash ― Reduce © 2012 MapR Technologies Introduction 15
  • 16. Inside Map-Reduce the, 1 "The time has come," the Walrus said, time, 1 "To talk of many things: come, [3,2,1] has, 1 Of shoes—and ships—and sealing-wax has, [1,5,2] come, 1 come, 6 the, [1,2,1] has, 8 … time, the, 4 [10,1,3] time, 14 Input Map … Shuffle Reduce … Output and sort © 2012 MapR Technologies Introduction 16
  • 17. JobTracker • Sends out tasks • Co-locates tasks with data • Gets data location • Manages TaskTrackers © 2012 MapR Technologies Introduction 17
  • 18. TaskTracker • Performs tasks (Map, Reduce) • Slots determine number of concurrent tasks • Notifies JobTracker of completed jobs • Heartbeats to the JobTracker • Each task is a separate Java process © 2012 MapR Technologies Introduction 18
  • 19. Hadoop Ecosystem © 2012 MapR Technologies Introduction 19
  • 20. Hadoop Ecosystem • PIG: It will eat anything – High level language, set algebra, careful semantics – Filter, transform, co-group, generate, flatten – PIG generates and optimizes map-reduce programs • Hive: Busy as a bee – High level language, more ad hoc than PIG – SQL-ish – Has central meta-data service – Loves external scripts • HBase: NoSQL for your cluster • Mahout: distributed/scalable machine learning algorithms © 2012 MapR Technologies Introduction 20
  • 21. How is MapR Different? © 2012 MapR Technologies Introduction 21
  • 22. Mostly, It’s Not!  API-compatible – Move code over without modifications – Use the familiar Hadoop Shell  Supports popular tools and applications – Hive, Pig, HBase—Flume, if you want it © 2012 MapR Technologies Introduction 22
  • 23. Very Different Where It Counts  No single point of failure  Faster shuffle, faster file creation  Read/write storage layer  NFS-mountable  Management tools—MCS, Rest API, CLI  Data placement, protection, backup  HA at all layers (Naming, NFS, JobTracker, MCS) © 2012 MapR Technologies Introduction 23
  • 24. Summary © 2012 MapR Technologies Introduction 24
  • 25. Questions © 2012 MapR Technologies Introduction 25

Editor's Notes

  1. Problem: Scaling reliably is hardWhat you need is a Fault-tolerant store and a fault-tolerant framework.Handle hardware faults transparently and efficientlyHigh-availability - Not dependent on any one componentEven on a big cluster, some things take daysEven simple things are complicated in a failure-rich environmentEvery point is a point where things can fail, have to manage that failureWith many computers, many disks, failures are commonWith 1000 computers x 10 disk, we can have 1 node failure and 10 disk failures per daySome failures are intermittent or difficult to detectComputation must succeed and not run slower in these conditions
  2. Apache Hadoop - a new paradigm Scales to thousands of commodity computers Can effectively use all cores and spindles simultaneously If you buy hardware, you want to maximize use New software stack built on a different foundation Not very mature yet In use by most web 2.0 companies and many Fortune 500
  3. The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.