Overview on HADOOP Distributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011
Dealing with lots of Data 20 billion web pages * 20 kb =400 TB 1000 hard disks to store web 1 computer can read ~50 MB/sec  from disk => 3 months Sol : spread the work over many machines  Hardware  & Software Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library.  2/7/2011
2/7/2011
2/7/2011
Standard Model 2/7/2011
Hadoop EcoSystem 2/7/2011
2/7/2011
2/7/2011
Hadoop, Why? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day –  Failure is expected, rather than exceptional. –  The number of nodes in a cluster is not constant. Need common infrastructure –  Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound 2/7/2011
2/7/2011
2/7/2011
HDFS  splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
Goals of HDFS Very Large Distributed File System –  10K nodes, 100 million files, 10 PB Assumes Commodity Hardware –  Files are replicated to handle hardware failure –  Detect failures and recovers from them Optimized for Batch Processing –  Data locations exposed so that computations can move to where data resides –  Provides very high aggregate bandwidth User Space, runs on heterogeneous OS  2/7/2011
Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode  : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
MapReduce: Programming Model Process data using special  map () and  reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 2/7/2011
MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday. 2/7/2011
MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator 2/7/2011
MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers 2/7/2011
MapReduce Resources The master distributes M map and  R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task) 2/7/2011
MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs 2/7/2011
MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.  Master Map worker Disk locations Local Storage 2/7/2011
MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.  Master Reduce worker Disk locations remote Storage 2/7/2011
MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file 2/7/2011
MapReduce Execution Overview Master process wakes up user process when all tasks have completed.  Output contained in R output files. wakeup User Program Master Output files 2/7/2011
2/7/2011
Pig Data-flow oriented language “ Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for complex tasks •  Developed at Yahoo! Hive •  SQL-based data warehousing app Feature set is similar to Pig –  Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets –  Partition columns –  Sampling –  Buckets Developed at Facebook 2/7/2011
Hbase Column-store database –  Based on design of Google BigTable –  Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model –  (key, val) lookup –  Limited  transactions (only one row) 2/7/2011
ZooKeeper Distributed consensus engine Provides well-defined concurrent access semantics: –  Leader election –  Service discovery –  Distributed locking / mutual exclusion –  Message board / mailboxes 2/7/2011
Some more projects… Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P backend Dumbo – Python library for streaming Ganglia – distributed monitoring 2/7/2011
Conclusions Computing with big datasets is a fundamentally different challenge than doing “big compute” over a small dataset •  New ways of thinking about problems needed –  New tools provide means to capture this –  MapReduce, HDFS, etc. can help 2/7/2011
2/7/2011

Hadoop

  • 1.
    Overview on HADOOPDistributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011
  • 2.
    Dealing with lotsof Data 20 billion web pages * 20 kb =400 TB 1000 hard disks to store web 1 computer can read ~50 MB/sec from disk => 3 months Sol : spread the work over many machines Hardware & Software Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library. 2/7/2011
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Hadoop, Why? Needto process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. Need common infrastructure – Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound 2/7/2011
  • 10.
  • 11.
  • 12.
    HDFS splitsuser data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
  • 13.
    Goals of HDFSVery Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS 2/7/2011
  • 14.
    Secondary NameNode ClientHDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
  • 15.
    MapReduce: Programming ModelHow now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
  • 16.
    MapReduce: Programming ModelProcess data using special map () and reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 2/7/2011
  • 17.
    MapReduce Benefits Greatlyreduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday. 2/7/2011
  • 18.
    MapReduce Examples Wordfrequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
  • 19.
    A Brief HistoryFunctional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator 2/7/2011
  • 20.
    MapReduce Execution OverviewThe user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
  • 21.
    MapReduce Execution OverviewThe user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers 2/7/2011
  • 22.
    MapReduce Resources Themaster distributes M map and R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task) 2/7/2011
  • 23.
    MapReduce Resources Eachmap-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs 2/7/2011
  • 24.
    MapReduce Execution OverviewEach worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. Master Map worker Disk locations Local Storage 2/7/2011
  • 25.
    MapReduce Execution OverviewMaster process gives disk locations to an available reduce-task worker who reads all associated intermediate data. Master Reduce worker Disk locations remote Storage 2/7/2011
  • 26.
    MapReduce Execution OverviewEach reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file 2/7/2011
  • 27.
    MapReduce Execution OverviewMaster process wakes up user process when all tasks have completed. Output contained in R output files. wakeup User Program Master Output files 2/7/2011
  • 28.
  • 29.
    Pig Data-flow orientedlanguage “ Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for complex tasks • Developed at Yahoo! Hive • SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook 2/7/2011
  • 30.
    Hbase Column-store database– Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row) 2/7/2011
  • 31.
    ZooKeeper Distributed consensusengine Provides well-defined concurrent access semantics: – Leader election – Service discovery – Distributed locking / mutual exclusion – Message board / mailboxes 2/7/2011
  • 32.
    Some more projects…Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P backend Dumbo – Python library for streaming Ganglia – distributed monitoring 2/7/2011
  • 33.
    Conclusions Computing withbig datasets is a fundamentally different challenge than doing “big compute” over a small dataset • New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help 2/7/2011
  • 34.