Hadoop

1,810 views
1,765 views

Published on

This slides demonstrate a basic overview on Hadoop Distributed computing.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,810
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
129
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop

  1. 1. Overview on HADOOP Distributed Computing <ul><ul><ul><ul><ul><li>RAGHU JULURI </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Senior Member Technical Staff </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Oracle India Development Center. </li></ul></ul></ul></ul></ul>2/7/2011
  2. 2. Dealing with lots of Data <ul><li>20 billion web pages * 20 kb =400 TB </li></ul><ul><li>1000 hard disks to store web </li></ul><ul><li>1 computer can read ~50 MB/sec from disk => 3 months </li></ul><ul><li>Sol : spread the work over many machines </li></ul><ul><li>Hardware & Software </li></ul><ul><li>Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . </li></ul><ul><li>Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) </li></ul><ul><li>In 2003 Google came up with Map Reduce run time library. </li></ul>2/7/2011
  3. 3. 2/7/2011
  4. 4. 2/7/2011
  5. 5. Standard Model 2/7/2011
  6. 6. Hadoop EcoSystem 2/7/2011
  7. 7. 2/7/2011
  8. 8. 2/7/2011
  9. 9. Hadoop, Why? <ul><li>Need to process Multi Petabyte Datasets </li></ul><ul><li>Expensive to build reliability in each application. </li></ul><ul><li>Nodes fail every day </li></ul><ul><li>– Failure is expected, rather than exceptional. </li></ul><ul><li>– The number of nodes in a cluster is not constant. </li></ul><ul><li>Need common infrastructure </li></ul><ul><li>– Efficient, reliable, Open Source Apache License </li></ul><ul><li>The above goals are same as Condor, but </li></ul><ul><ul><li>Workloads are IO bound and not CPU bound </li></ul></ul>2/7/2011
  10. 10. 2/7/2011
  11. 11. 2/7/2011
  12. 12. HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
  13. 13. Goals of HDFS <ul><li>Very Large Distributed File System </li></ul><ul><li>– 10K nodes, 100 million files, 10 PB </li></ul><ul><li>Assumes Commodity Hardware </li></ul><ul><li>– Files are replicated to handle hardware failure </li></ul><ul><li>– Detect failures and recovers from them </li></ul><ul><li>Optimized for Batch Processing </li></ul><ul><li>– Data locations exposed so that computations can move to where data resides </li></ul><ul><li>– Provides very high aggregate bandwidth </li></ul><ul><li>User Space, runs on heterogeneous OS </li></ul>2/7/2011
  14. 14. Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
  15. 15. MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
  16. 16. MapReduce: Programming Model <ul><li>Process data using special map () and reduce () functions </li></ul><ul><ul><li>The map() function is called on every item in the input and emits a series of intermediate key/value pairs </li></ul></ul><ul><ul><li>All values associated with a given key are grouped together </li></ul></ul><ul><ul><li>The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output </li></ul></ul>2/7/2011
  17. 17. MapReduce Benefits <ul><li>Greatly reduces parallel programming complexity </li></ul><ul><ul><li>Reduces synchronization complexity </li></ul></ul><ul><ul><li>Automatically partitions data </li></ul></ul><ul><ul><li>Provides failure transparency </li></ul></ul><ul><ul><li>Handles load balancing </li></ul></ul><ul><li>Practical </li></ul><ul><ul><li>Approximately 1000 Google MapReduce jobs run everyday. </li></ul></ul>2/7/2011
  18. 18. MapReduce Examples <ul><li>Word frequency </li></ul>Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
  19. 19. A Brief History <ul><li>Functional programming (e.g., Lisp) </li></ul><ul><ul><li>map() function </li></ul></ul><ul><ul><ul><li>Applies a function to each value of a sequence </li></ul></ul></ul><ul><ul><li>reduce() function </li></ul></ul><ul><ul><ul><li>Combines all elements of a sequence using a binary operator </li></ul></ul></ul>2/7/2011
  20. 20. MapReduce Execution Overview <ul><li>The user program, via the MapReduce library, shards the input data </li></ul>User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
  21. 21. MapReduce Execution Overview <ul><li>The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. </li></ul>User Program Master Workers Workers Workers Workers Workers 2/7/2011
  22. 22. MapReduce Resources <ul><li>The master distributes M map and R reduce tasks to idle workers. </li></ul><ul><ul><li>M == number of shards </li></ul></ul><ul><ul><li>R == the intermediate key space is divided into R parts </li></ul></ul>Master Idle Worker Message(Do_map_task) 2/7/2011
  23. 23. MapReduce Resources <ul><li>Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. </li></ul><ul><ul><li>Output buffered in RAM. </li></ul></ul>Map worker Shard 0 Key/value pairs 2/7/2011
  24. 24. MapReduce Execution Overview <ul><li>Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. </li></ul>Master Map worker Disk locations Local Storage 2/7/2011
  25. 25. MapReduce Execution Overview <ul><li>Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. </li></ul>Master Reduce worker Disk locations remote Storage 2/7/2011
  26. 26. MapReduce Execution Overview <ul><li>Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. </li></ul>Reduce worker Sorts data Partition Output file 2/7/2011
  27. 27. MapReduce Execution Overview <ul><li>Master process wakes up user process when all tasks have completed. Output contained in R output files. </li></ul>wakeup User Program Master Output files 2/7/2011
  28. 28. 2/7/2011
  29. 29. <ul><li>Pig </li></ul><ul><li>Data-flow oriented language </li></ul><ul><li>“ Pig latin” </li></ul><ul><li>Datatypes include sets, associative arrays,tuples </li></ul><ul><li>High-level language for routing data, allows easy integration of Java for complex tasks </li></ul><ul><li>• Developed at Yahoo! </li></ul><ul><li>Hive </li></ul><ul><li>• SQL-based data warehousing app </li></ul><ul><li>Feature set is similar to Pig </li></ul><ul><li>– Language is more strictly SQL </li></ul><ul><li>Supports SELECT, JOIN, GROUP BY, etc. </li></ul><ul><li>Features for analyzing very large data sets </li></ul><ul><li>– Partition columns </li></ul><ul><li>– Sampling </li></ul><ul><li>– Buckets </li></ul><ul><li>Developed at Facebook </li></ul>2/7/2011
  30. 30. <ul><li>Hbase </li></ul><ul><li>Column-store database </li></ul><ul><li>– Based on design of Google BigTable </li></ul><ul><li>– Provides interactive access to information </li></ul><ul><li>Holds extremely large datasets (multi-TB) </li></ul><ul><li>Constrained access model </li></ul><ul><li>– (key, val) lookup </li></ul><ul><li>– Limited </li></ul><ul><li>transactions (only one row) </li></ul>2/7/2011
  31. 31. ZooKeeper <ul><li>Distributed consensus engine </li></ul><ul><li>Provides well-defined concurrent access </li></ul><ul><li>semantics: </li></ul><ul><li>– Leader election </li></ul><ul><li>– Service discovery </li></ul><ul><li>– Distributed locking / mutual exclusion </li></ul><ul><li>– Message board / mailboxes </li></ul>2/7/2011
  32. 32. Some more projects… <ul><li>Chukwa – Hadoop log aggregation </li></ul><ul><li>Scribe – More general log aggregation </li></ul><ul><li>Mahout – Machine learning library </li></ul><ul><li>Cassandra – Column store database on a P2P backend </li></ul><ul><li>Dumbo – Python library for streaming </li></ul><ul><li>Ganglia – distributed monitoring </li></ul>2/7/2011
  33. 33. <ul><ul><li>Conclusions </li></ul></ul><ul><li>Computing with big datasets is a </li></ul><ul><li>fundamentally different challenge than </li></ul><ul><li>doing “big compute” over a small dataset </li></ul><ul><li>• New ways of thinking about problems needed </li></ul><ul><li>– New tools provide means to capture this </li></ul><ul><li>– MapReduce, HDFS, etc. can help </li></ul>2/7/2011
  34. 34. 2/7/2011

×