Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop

1,902 views

Published on

This slides demonstrate a basic overview on Hadoop Distributed computing.

Published in: Technology
  • Be the first to comment

Hadoop

  1. 1. Overview on HADOOP Distributed Computing <ul><ul><ul><ul><ul><li>RAGHU JULURI </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Senior Member Technical Staff </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Oracle India Development Center. </li></ul></ul></ul></ul></ul>2/7/2011
  2. 2. Dealing with lots of Data <ul><li>20 billion web pages * 20 kb =400 TB </li></ul><ul><li>1000 hard disks to store web </li></ul><ul><li>1 computer can read ~50 MB/sec from disk => 3 months </li></ul><ul><li>Sol : spread the work over many machines </li></ul><ul><li>Hardware & Software </li></ul><ul><li>Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . </li></ul><ul><li>Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) </li></ul><ul><li>In 2003 Google came up with Map Reduce run time library. </li></ul>2/7/2011
  3. 3. 2/7/2011
  4. 4. 2/7/2011
  5. 5. Standard Model 2/7/2011
  6. 6. Hadoop EcoSystem 2/7/2011
  7. 7. 2/7/2011
  8. 8. 2/7/2011
  9. 9. Hadoop, Why? <ul><li>Need to process Multi Petabyte Datasets </li></ul><ul><li>Expensive to build reliability in each application. </li></ul><ul><li>Nodes fail every day </li></ul><ul><li>– Failure is expected, rather than exceptional. </li></ul><ul><li>– The number of nodes in a cluster is not constant. </li></ul><ul><li>Need common infrastructure </li></ul><ul><li>– Efficient, reliable, Open Source Apache License </li></ul><ul><li>The above goals are same as Condor, but </li></ul><ul><ul><li>Workloads are IO bound and not CPU bound </li></ul></ul>2/7/2011
  10. 10. 2/7/2011
  11. 11. 2/7/2011
  12. 12. HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
  13. 13. Goals of HDFS <ul><li>Very Large Distributed File System </li></ul><ul><li>– 10K nodes, 100 million files, 10 PB </li></ul><ul><li>Assumes Commodity Hardware </li></ul><ul><li>– Files are replicated to handle hardware failure </li></ul><ul><li>– Detect failures and recovers from them </li></ul><ul><li>Optimized for Batch Processing </li></ul><ul><li>– Data locations exposed so that computations can move to where data resides </li></ul><ul><li>– Provides very high aggregate bandwidth </li></ul><ul><li>User Space, runs on heterogeneous OS </li></ul>2/7/2011
  14. 14. Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
  15. 15. MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
  16. 16. MapReduce: Programming Model <ul><li>Process data using special map () and reduce () functions </li></ul><ul><ul><li>The map() function is called on every item in the input and emits a series of intermediate key/value pairs </li></ul></ul><ul><ul><li>All values associated with a given key are grouped together </li></ul></ul><ul><ul><li>The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output </li></ul></ul>2/7/2011
  17. 17. MapReduce Benefits <ul><li>Greatly reduces parallel programming complexity </li></ul><ul><ul><li>Reduces synchronization complexity </li></ul></ul><ul><ul><li>Automatically partitions data </li></ul></ul><ul><ul><li>Provides failure transparency </li></ul></ul><ul><ul><li>Handles load balancing </li></ul></ul><ul><li>Practical </li></ul><ul><ul><li>Approximately 1000 Google MapReduce jobs run everyday. </li></ul></ul>2/7/2011
  18. 18. MapReduce Examples <ul><li>Word frequency </li></ul>Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
  19. 19. A Brief History <ul><li>Functional programming (e.g., Lisp) </li></ul><ul><ul><li>map() function </li></ul></ul><ul><ul><ul><li>Applies a function to each value of a sequence </li></ul></ul></ul><ul><ul><li>reduce() function </li></ul></ul><ul><ul><ul><li>Combines all elements of a sequence using a binary operator </li></ul></ul></ul>2/7/2011
  20. 20. MapReduce Execution Overview <ul><li>The user program, via the MapReduce library, shards the input data </li></ul>User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
  21. 21. MapReduce Execution Overview <ul><li>The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. </li></ul>User Program Master Workers Workers Workers Workers Workers 2/7/2011
  22. 22. MapReduce Resources <ul><li>The master distributes M map and R reduce tasks to idle workers. </li></ul><ul><ul><li>M == number of shards </li></ul></ul><ul><ul><li>R == the intermediate key space is divided into R parts </li></ul></ul>Master Idle Worker Message(Do_map_task) 2/7/2011
  23. 23. MapReduce Resources <ul><li>Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. </li></ul><ul><ul><li>Output buffered in RAM. </li></ul></ul>Map worker Shard 0 Key/value pairs 2/7/2011
  24. 24. MapReduce Execution Overview <ul><li>Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. </li></ul>Master Map worker Disk locations Local Storage 2/7/2011
  25. 25. MapReduce Execution Overview <ul><li>Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. </li></ul>Master Reduce worker Disk locations remote Storage 2/7/2011
  26. 26. MapReduce Execution Overview <ul><li>Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. </li></ul>Reduce worker Sorts data Partition Output file 2/7/2011
  27. 27. MapReduce Execution Overview <ul><li>Master process wakes up user process when all tasks have completed. Output contained in R output files. </li></ul>wakeup User Program Master Output files 2/7/2011
  28. 28. 2/7/2011
  29. 29. <ul><li>Pig </li></ul><ul><li>Data-flow oriented language </li></ul><ul><li>“ Pig latin” </li></ul><ul><li>Datatypes include sets, associative arrays,tuples </li></ul><ul><li>High-level language for routing data, allows easy integration of Java for complex tasks </li></ul><ul><li>• Developed at Yahoo! </li></ul><ul><li>Hive </li></ul><ul><li>• SQL-based data warehousing app </li></ul><ul><li>Feature set is similar to Pig </li></ul><ul><li>– Language is more strictly SQL </li></ul><ul><li>Supports SELECT, JOIN, GROUP BY, etc. </li></ul><ul><li>Features for analyzing very large data sets </li></ul><ul><li>– Partition columns </li></ul><ul><li>– Sampling </li></ul><ul><li>– Buckets </li></ul><ul><li>Developed at Facebook </li></ul>2/7/2011
  30. 30. <ul><li>Hbase </li></ul><ul><li>Column-store database </li></ul><ul><li>– Based on design of Google BigTable </li></ul><ul><li>– Provides interactive access to information </li></ul><ul><li>Holds extremely large datasets (multi-TB) </li></ul><ul><li>Constrained access model </li></ul><ul><li>– (key, val) lookup </li></ul><ul><li>– Limited </li></ul><ul><li>transactions (only one row) </li></ul>2/7/2011
  31. 31. ZooKeeper <ul><li>Distributed consensus engine </li></ul><ul><li>Provides well-defined concurrent access </li></ul><ul><li>semantics: </li></ul><ul><li>– Leader election </li></ul><ul><li>– Service discovery </li></ul><ul><li>– Distributed locking / mutual exclusion </li></ul><ul><li>– Message board / mailboxes </li></ul>2/7/2011
  32. 32. Some more projects… <ul><li>Chukwa – Hadoop log aggregation </li></ul><ul><li>Scribe – More general log aggregation </li></ul><ul><li>Mahout – Machine learning library </li></ul><ul><li>Cassandra – Column store database on a P2P backend </li></ul><ul><li>Dumbo – Python library for streaming </li></ul><ul><li>Ganglia – distributed monitoring </li></ul>2/7/2011
  33. 33. <ul><ul><li>Conclusions </li></ul></ul><ul><li>Computing with big datasets is a </li></ul><ul><li>fundamentally different challenge than </li></ul><ul><li>doing “big compute” over a small dataset </li></ul><ul><li>• New ways of thinking about problems needed </li></ul><ul><li>– New tools provide means to capture this </li></ul><ul><li>– MapReduce, HDFS, etc. can help </li></ul>2/7/2011
  34. 34. 2/7/2011

×