Distributed Computing with Apache Hadoop: Technology Overview


Published on

Presented at Yandex Summer School in Distributed Computing, July 2011. Moscow, Russia.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Computing with Apache Hadoop: Technology Overview

  1. 1. Distributed Computing withApache HadoopTechnology OverviewKonstantin V. Shvachko14 July 2011
  2. 2. Contents• Why life is interesting in Distributed Computing• Computational shift: New Data Domain• Data is more important than Algorithms• Hadoop as a technology• Ecosystem of Hadoop tools2
  3. 3. New Data Domain• Simple calculations can be performed by humans• Devices are need to process larger computations• Large computations assume large data domain• Domain of numbers – the only one until recently– Crunching numbers from ancient times– Computers served the same purpose– Computers served the same purpose– Strict rules• Growth of the Internet provided a new vast domain– Word data: human generated texts– Digital data: photo, video, sound– Fuzzy rules. Errors & deviations are a part of study– Started to process texts– Barely touching digital data3
  4. 4. Words vs. Numbers• In 1997 IBM built Deep Blue supercomputer– Playing chess game with the champion G. Kasparov– Human race was defeated– Strict rules for Chess– Fast deep analyses of current state– Still numbers4• In 2011 IBM built Watson computer toplay Jeopardy– Questions and hints in human terms– Analysis of texts from library and theInternet– Human champions defeated
  5. 5. The Library of Babel• Jorge Luis Borges "The Library of Babel“– Vast storage universe– Composed of all possible manuscriptsuniformly formatted as 410-page books.– Most are meaningless sequences of symbols– The rest excitingly forms a complete and anindestructible knowledge system– Stores any text written or to be written– Provides solutions to all problems in the world– Just find the right book.• Hard copy size is larger than visible universe– a data domain worth discovering• What is the size of the electronic version?• Internet collection is a subset of the The Library of Babel5
  6. 6. New Type of Algorithms• Scalability is more important than efficiency– Classic and Distributed sorting– In place sorting updates common state• More Hardware vs. development time– 20% improvements in efficiency are not important– Can ad more nodes instead• Data is more important than algorithms– Hard to collect data. Historical data 6 months to 1 year• Example: Natural language processing– Effects of training data size on classification accuracy– Accuracy increases linearly on the size of the training data– Machine learning algorithms converge on with increase of training data6
  7. 7. Big Data• Computations that need the power of many computers– Large datasets: hundreds of TBs, PBs– Or use of thousands of CPUs in parallel– Or both• Cluster as a computer– Big Data management, storage and analytics7
  8. 8. Big Data: Examples• Search Webmap as of 2008 @ Y!– Raw disk used 5 PB– 1500 nodes• High-energy physics LHC Collider:• High-energy physics LHC Collider:– PBs of events– 1 PB of data per sec, most filtered out• 2 quadrillionth (1015) digit of πis 0– Tsz-Wo (Nicholas) Sze– 12 days of cluster time, 208 years of CPU time– No data, pure CPU workload8
  9. 9. Big Data: More Examples• eHarmony– Soul matching• Banking• Banking– Fraud detection• Processing of astronomy data– Image Stacking and Mosaicing9
  10. 10. What is Hadoop• Hadoop is an ecosystem of tools for processing“Big Data”• Hadoop is an open source projectThe image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.10
  11. 11. Hadoop: Architecture Principles• Linear scalability: more nodes can do more work in the same time– Linear on data size:– Linear on compute resources:• Move computation to data– Minimize expensive data transfers– Data are large, programs are small• Reliability and Availability: Failures are common– 1 drive fails every 3 years• Probability of failing today 1/1000– How many drives per day fail on 1000 node cluster with 10 drives ?• Simple computational model– hides complexity in efficient execution framework• Sequential data processing (avoid random reads)11
  12. 12. Hadoop Success Factors• Apache Hadoop won the 2011 MediaGuardian Innovation Award– Recognition for its influence on technological innovation– Other nominees: iPad, WikiLeaks1. Scalability2. Open source & commodity software2. Open source & commodity software3. Just works12
  13. 13. Hadoop FamilyHDFS Distributed file systemMapReduce Distributed computationZookeeper Distributed coordinationHBase Column storeHBase Column storePig Dataflow language, SQLHive Data warehouse, SQLOozie Complex job workflowAvro Data Serialization13
  14. 14. Hadoop Core• A reliable, scalable, high performance distributed computing system• Reliable storage layer– The Hadoop Distributed File System (HDFS)– With more sophisticated layers on top• MapReduce – distributed computation framework• Hadoop scales computation capacity, storage capacity, and I/O bandwidth• Hadoop scales computation capacity, storage capacity, and I/O bandwidthby adding commodity servers.• Divide-and-conquer using lots of commodity hardware14
  15. 15. MapReduce• MapReduce – distributed computation framework– Invented by Google researchers• Two stages of a MR job– map: (k1; v1) → {(k2; v2)}– reduce: (k2; {v2}) → {(k3; v3)}• Map – a truly distributed stage• Map – a truly distributed stageReduce – an aggregation, may not be distributed• Shuffle – sort and merge– transition from Map to Reduce– invisible to user• Combiners & Partitioners
  16. 16. MapReduce Workflow
  17. 17. Where MapReduce cannot help• MapReduce solves about 95% of practical problems– Not a tool for everything• Batch processing vs. real-time– Throughput vs. Latency• Simultaneous update of common state• Inter communication between tasks of a job• Inter communication between tasks of a job• Coordinated execution• Use of other computational models– MPI– Driads17
  18. 18. Hadoop Distributed File System• The name space is a hierarchy of files and directories• Files are divided into blocks (typically 128 MB)• Namespace (metadata) is decoupled from data– Lots of fast namespace operations, not slowed down by– Data streaming• Single NameNode keeps the entire name space in RAM• Single NameNode keeps the entire name space in RAM• DataNodes store block replicas as files on local drives• Blocks are replicated on 3 DataNodes for redundancy18
  19. 19. HDFS Read• To read a block, the client requests the list of replica locations from theNameNode• Then pulling data from a replica on one of the DataNodes19
  20. 20. HDFS Write• To write a block of a file, the client requests a list of candidate DataNodesfrom the NameNode, and organizes a write pipeline.20
  21. 21. Replica Location Awareness• MapReduce schedules a task assigned to process block B to a DataNodeserving a replica of B• Local access to data21
  22. 22. Name Node• NameNode keeps 3 types of information– Hierarchical namespace– Block manager: block to data-node mapping– List of DataNodes• The durability of the name space is maintained by a write-ahead journal andcheckpoints– A BackupNode creates periodic checkpoints– A BackupNode creates periodic checkpoints– A journal transaction is guaranteed to be persisted before replying to the client– Block locations are not persisted, but rather discovered from DataNode duringstartup via block reports.22
  23. 23. Data Nodes• DataNodes register with the NameNode, and provide periodic block reportsthat list the block replicas on hand• DataNodes send heartbeats to the NameNode– Heartbeat responses give instructions for managing replicas• If no heartbeat is received during a 10-minute interval, the node ispresumed to be lost, and the replicas hosted by that node to be unavailablepresumed to be lost, and the replicas hosted by that node to be unavailable– NameNode schedules re-replication of lost replicas23
  24. 24. Quiz:What Is the Common Attribute?24
  25. 25. Hadoop Size• Y! cluster– 70 million files, 80 million blocks– 15 PB capacity– 4000+ nodes. 24,000 clients– 50 GB heap for NN• Data warehouse Hadoop cluster at Facebook– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)– 2000 nodes. 21 PB capacity, 30,000 clients– 108 GB heap for NN should allow for 400 million objects• Analytics Cluster at eBay– 768 nodes– Each node: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU– Cluster size is 18 PB.– Runs 26,000 MapReduce tasks simultaneously25
  26. 26. Limitations of the Implementation• “HDFS Scalability: The limits to growth” USENIX ;login:• Single master architecture: a constraining resource• Limit to the number of namespace objects– 100 million objects; 25 PB of data– Block to file ratio is shrinking: 2 –> 1.5 -> 1.2• Limits for linear performance growth• Limits for linear performance growth– linear increase in # of workers puts a higher workload on the single NameNode– Sinple NameNode cannot support 100,000 clients• Hadoop MapReduce framework reached its scalability limit at 40,000 clients– Corresponds to a 4,000-node cluster with 10 MapReduce slots26
  27. 27. Benchmarks• DFSIO– Read: 66 MB/s– Write: 40 MB/s• Observed on busy cluster– Read: 1.02 MB/s– Write: 1.09 MB/s– Write: 1.09 MB/s• Sort (“Very carefully tuned user application”)Bytes(TB)Nodes Maps Reduces Time HDFS I/O Bytes/sAggregate(GB/s)Per Node(MB/s)1 1460 8000 2700 62 s 32 22.11000 3558 80,000 20,000 58,500 s 34.2 9.3527
  28. 28. ZooKeeper• A distributed coordination service for distributed apps– Event coordination and notification– Leader election– Distributed locking• ZooKeeper can help build HA systems28
  29. 29. HBase• Distributed table store on top of HDFS– An implementation of Google’s BigTable• Big table is Big Data, cannot be stored on a single node• Tables: big, sparse, loosely structured.– Consist of rows, having unique row keys– Has arbitrary number of columns,– grouped into small number of column families– Dynamic column creation• Table is partitioned into regions– Horizontally across rows; vertically across column families• HBase provides structured yet flexible access to data• Near real-time data processing29
  30. 30. HBase Functionality• HBaseAdmin: administrative functions– Create, delete, list tables– Create, update, delete columns, families– Split, compact, flush• HTable: access table data– Result HTable.get(Get g) // get cells of a row– void HTable.put(Put p) // update a row– void HTable.put(Put p) // update a row– void HTable.put(Put[] p) // batch update of rows– void HTable.delete(Delete d) // delete cells/row– ResultScanner getScanner(family) // scan col family
  31. 31. HBase Architecture31
  32. 32. Pig• A language on top of and to simplify MapReduce• Pig speaks Pig Latin• SQL-like language• Pig programs are translated into aseries of MapReduce jobs32
  33. 33. Hive• Serves the same purpose as Pig• Closely follows SQL standards• Keeps metadata about Hive tables in MySQL DRBM
  34. 34. Oozie• Workflows actions are arranged as Direct Acyclic Graph– Multiple steps: MR, Pig, Hive, Java, data mover, ...• Coordinator jobs (time/data driven workflow jobs)– A workflow job is scheduled at a regular frequency– The workflow job is started when all inputs are available34
  35. 35. The Future: Next Generation MapReduce• “Apache Hadoop: The scalability update” USENIX ;login:• Next Generation MapReduce– Separation of JobTracker functions1. Job scheduling and resource allocation• Fundamentally centralized2. Job monitoring and job life-cycle coordination• Delegate coordination of different jobs to other nodes– Dynamic partitioning of cluster resources: no fixed slots• HDFS Federation– Independent NameNodes sharing a common pool of DataNodes– Cluster is a family of volumes with shared block storage layer– User sees volumes as isolated file systems– ViewFS: the client-side mount table– Federated approach provides a static partitioning of the federated namespace35
  36. 36. The End36