• Save
Design, Scale and Performance of MapR's Distribution for Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Design, Scale and Performance of MapR's Distribution for Hadoop

  • 31,973 views
Uploaded on

Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows......

Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
31,973
On Slideshare
21,976
From Embeds
9,997
Number of Embeds
19

Actions

Shares
Downloads
0
Comments
2
Likes
52

Embeds 9,997

http://www.mapr.com 8,062
http://lab.howie.tw 1,412
http://mapr.com 168
http://www.scoop.it 139
http://paper.li 93
http://www.cnblogs.com 38
http://blog.newitfarmer.com 23
http://dev-wiki.rakuten.co.jp 16
http://twitter.com 11
http://www.zoominfo.com 10
https://twitter.com 10
http://www.linkedin.com 5
http://us-w1.rockmelt.com 2
http://a0.twimg.com 2
http://www.google.com.tw 2
http://webcache.googleusercontent.com 1
http://mapr.localhost 1
http://www.blogger.com 1
http://translate.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Design, Scale & Performance of the MapR Distribution M.C. Srivas CTO, MapR Technologies, Inc. 6/29/2011 © MapR Technologies, Inc. 1
  • 2. Outline of Talk• What does MapR do?• Motivation: why build this?• Distributed NameNode Architecture • Scalability factors • Programming model • Distributed transactions in MapR• Performance across a variety of loads 6/29/2011 © MapR Technologies, Inc. 2
  • 3. Complete Distribution Integrated, tested, hardened Super simple Unique advanced features 100% compatible with MapReduce, HBase, HDFS APIs No recompile required, drop in and use now 6/29/2011 © MapR Technologies, Inc. 3
  • 4. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services6/29/2011 © MapR Technologies, Inc. 4
  • 5. JIRAs Open For Year(s)• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize block-reports• HDFS-222, 950 – Concatenate files into larger files • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible."• HDFS Append – no blessed Apache Hadoop distro has fix• HDFS-233 – 25/Jun/08 – Snapshot support • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly only if there is complete separation between namespace management and block management." 6/29/2011 © MapR Technologies, Inc. 5
  • 6. Observations on Apache Hadoop Inefficient HDFS-347 1200 MB/sec Scaling problems HDFS-273 1000  NameNode bottleneck HDFS-395 800  Limited number of files HDFS-222 600 READ WRITE Admin overhead significant 400 NameNode failure loses data 200  Not trusted as permanent store 0 HARDWARE HDFS Write-once  Data lost unless file closed  hflush/hsync – unrealistic to expect folks will re-write apps 6/29/2011 © MapR Technologies, Inc. 6
  • 7. MapR Approach• Some are architectural issues• Change at that level is a big deal – Will not be accepted unless proven – Hard to prove without building it first• Build it and prove it – Improve reliability significantly – Make it tremendously faster at the same time – Enable new class of apps (eg, real-time analytics) 6/29/2011 © MapR Technologies, Inc. 7
  • 8. HDFS Architecture Review Files are broken into blocks  Distributed across data-nodes NameNode holds (in memory)  Directories, Files Files  Block replica locations sharded into blocks Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN DataNodes save Blocks 6/29/2011 © MapR Technologies, Inc. 8
  • 9. HDFS Architecture ReviewDataNode (DN) reports blocks to NameNodeNameNode (NN)  Large DN does 60K blocks/report  256M x 60K = 15T = 5 disks @ 3T per DataNode DataNode  >100K causes extreme load  40GB NN restart takes 1-2 hoursAddressing Unit is an individual block  Flat block-address forces DNs to send giant block-reports  NN can hold about ~300M blocks max  Limits cluster size to 10s of Petabytes  Increasing block size negatively impacts map/reduce 6/29/2011 © MapR Technologies, Inc. 9
  • 10. How to Scale• Central meta server does not scale – Make every server a meta-data server too – But need memory for map/reduce • Must page meta-data to disk• Reduce size of block-reports – while increasing number of blocks per DN• Reduce memory footprint of location service – cannot add memory indefinitely• Need fast-restart (HA) 6/29/2011 © MapR Technologies, Inc. 10
  • 11. MapR Goal: Scale to 1000X HDFS MapR # files 150 million 1 trillion # data 10-50 PB 1-10 Exabytes # nodes 2000 10,000+Full random read/write semantics  export via NFS and other protocols  with enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, …Run close to hardware speeds  On extreme scale, efficiency matters extremely  exploit emerging technology like SSD, 10GE 6/29/2011 © MapR Technologies, Inc. 11
  • 12. MapRs Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on serversContainers are 16-  No need to manage32 GB segments of directlydisk, placed on  Use MapR Volumesnodes Patent Pending 6/29/2011 © MapR Technologies, Inc. 12
  • 13. MapR Volumes Significant advantages over “Cluster-/projects wide” or “File-level” approaches /tahoe Volumes allow management attributes /yosemite to be applied in a scalable way at a very granular level and with flexibility/user /msmith • Replication factor • Scheduled mirroring /bjohnson • Scheduled snapshots • Data placement control100K volumes are OK, • Usage tracking create as many as • Administrative permissions desired! 6/29/2011 © MapR Technologies, Inc. 13
  • 14. MapR Distributed NameNodeContainers are tracked globally• Clients cache containers & server info for extended periodsNameNode Map S1, S2, S4 Client S1 Fetches Contacts S1, S3 container server to S1, S4, S5 locations read data S2, S3, S5 from the S3 container S2, S4, S5 S3 S4 S5 S2 6/29/2011 © MapR Technologies, Inc. 14
  • 15. MapRs Distr NameNode ScalingContainers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster)250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected 6/29/2011 © MapR Technologies, Inc. 15
  • 16. MapR Distr NameNode HAMapR Apache Hadoop*1. apt-get install mapr-cldb 1. Stop cluster very carefullywhile cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp) 3. Install, configure DRBD + Heartbeat packages i. yum -y install drbd82 kmod-drbd82 heartbeat ii. chkconfig -add heartbeat (both machines) iii. edit /etc/drbd.conf on 2 machines iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero it if drbd dies & try again xxxx. mkfs ext3 on it, mount /hadoop (both machines) xxxxi. install all rpms in /hadoop, but dont run them yet (chkconfig off) xxxxii. umount /hadoop (!!) xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat ... 40. Restart cluster. If any problems, start at /var/log/ha.log for hints on what went wrong.*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration Author: Christophe Bisciglia, Cloudera. 6/29/2011 © MapR Technologies, Inc. 16
  • 17. Step Back & Rethink ProblemBig disruption in hardware landscape Year 2000 Year 2012 # cores per box 2 128 DRAM per box 4GB 512GB # disks per box 250+ 12 Disk capacity 18GB 6TB Cluster size 2-10 10,000 No spin-locks / mutexes, 10,000+ threads Minimal footprint – preserve resources for App Rapid re-replication, scale to several Exabytes 6/29/2011 © MapR Technologies, Inc. 17
  • 18. MapRs Programming ModelWritten in C++ and is asynchronous ioMgr->read(…, callbackFunc, void *arg)Each module runs requests from its request-queue  One OS thread per cpu-core  Dispatch: map container-> queue -> cpu-core  Callback guaranteed to be invoked on same core  No mutexes needed  When load increases, add cpu-core + move some queues to itState machines on each queue  thread stack is 4K, 10,000+ threads costs ~40M  Context-switch is 3 instructions, 250K c.s./core/sec ok! 6/29/2011 © MapR Technologies, Inc. 18
  • 19. MapR on Linux User-space process, avoids system crashes Minimal footprint  Preserves cpu, memory & resources for app  uses only 1/5th of system memory  runs on 1 or 2 cores, others left for app  Emphasis on efficiency, avoids lots of layering raw devices, direct-IO, doesnt use Linux VM CPU/memory firewalls implemented  runaway tasks no longer impact system processes 6/29/2011 © MapR Technologies, Inc. 19
  • 20. Random Writing in MapR S1 Ask forClient 64M block NameNode Mapwriting Create cont. data S1, S2, S4 attach S1, S3Write S1, S4, S5next chunk S2 Picks master S2, S4, S5to S2 and 2 replica slaves S3 S2, S3, S5 S4 S5 S3 6/29/2011 © MapR Technologies, Inc. 20
  • 21. MapRs Distributed NameNode Distributed transactions to stitch containers together Each node uses write-ahead log  Supports both value-logging and operational-logging  Value log, record = { disk-offset, old, new }  Op log, record = { op-details, undo-op, redo-op }  Recovery in 2 seconds  global ids enable participation in distributed transactions 6/29/2011 © MapR Technologies, Inc. 21
  • 22. 2-Phase Commit Unsuitable App• BeginTrans .. work .. Commit C = coordinator Force P = participant C Log On app-commit  C sends prepare to P P P sends prepare-ack, P  C gives up right to abort C  Waits for C even across crashes/reboots P P  P unlocks only when C commit receivedToo many message exchanges P PSingle failure can lock up entire cluster 6/29/2011 © MapR Technologies, Inc. 22
  • 23. Quorum-completion Unsuitable• BeginTrans .. work .. Commit C = coordinator P = participant On app-commit  C broadcasts prepare P  If majority responds, App C commits C  If not, cluster goes P into election mode  If no majority found, all fails PUpdate throughput very poor PDoes not work with < N/2 nodesMonolithic. Hierarchical? Cycles? Oh No!! 6/29/2011 © MapR Technologies, Inc. 23
  • 24. MapR Lockless Transactions• BeginTrans + work + Commit No explicit commit NN1 NN1 NN1 Uses rollback  confirm callback, piggy-backed  Undo on confirmed failure NN4 NN4 NN2 NN2  Any replica can confirm NN2Update throughput very highNo locks held across messagesCrash resistant, cycles OK NN3 NN3 NN3Patent pending 6/29/2011 © MapR Technologies, Inc. 24
  • 25. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create fileRate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m) 6/29/2011 © MapR Technologies, Inc. 25
  • 26. MapR Distributed NameNodeSame 10 nodes, but with 3x replication added … Test stopped Create here Rate100-bytefiles/sec # of files (millions) 6/29/2011 © MapR Technologies, Inc. 26
  • 27. MapRs Data Integrity End-to-end check-sums on all data (not optional)  Computed in clients memory, written to disk at server  On read, validated at both client & server RPC packets have own independent check-sum  Detects RPC msg corruption Transactional with ACID semantics  Meta data incl. log itself is check-summed  Allocation bitmaps written to two places (dual blocks) Automatic compression built-in 6/29/2011 © MapR Technologies, Inc. 27
  • 28. MapR’s Random-Write Eases Data ImportWith MapR, use NFS Otherwise, use Flume/Scribe1. mount /mapr 1. Set up sinks (find unused real-time, HA machines??) 2. Set up intrusive agents i. tail(“xxx”), tailDir(“y”) ii. agentBESink 3. All reliability levels lose data i. best-effort ii. one-shot iii. disk fail-over iv. end-to-end 4. Data not available now 6/29/2011 © MapR Technologies, Inc. 28
  • 29. MapRs Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000MB Hadoop 750 750persec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 6/29/2011 © MapR Technologies, Inc. 29
  • 30. HBase on MapR YCSB Insert with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 600 500 400 1000records 300 MapR per Apachesecond 200 100 0 WAL off WAL on Higher is better 6/29/2011 © MapR Technologies, Inc. 30
  • 31. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000Records 15000 per MapRsecond Apache 10000 5000 0 Zipfian Uniform Higher is better 6/29/2011 © MapR Technologies, Inc. 31
  • 32. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200Elapsed 150 MapR 30time Hadoop(mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better 6/29/2011 © MapR Technologies, Inc. 32
  • 33. PigMix on MapR 4000 3500 3000 2500 2000Time MapRin 1500 HadoopSec 1000 500 0 Lower is better 6/29/2011 © MapR Technologies, Inc. 33
  • 34. Summary Fully HA  JobTracker, Snapshot, Mirrors, multi-cluster capable Super simple to manage  NFS mountable Complete read/write semantics  Can see file contents immediately MapR has signed Apache CCLA  Zookeeper, Mahout, YCSB, HBase fixes contributed  Continue to contribute more and more Download it at www.mapr.com 6/29/2011 © MapR Technologies, Inc. 34