• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Inside MapR's M7
 

Inside MapR's M7

on

  • 1,305 views

This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of ...

This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of the HBase API written in C++ and fully integrated into the MapR platform.

In the process of implementing M7, we learned some lessons and solved some interesting challenges. Ted Dunning shares some of these experiences and lessons. Many of these lessons apply across the board to high performance query systems in general and can be applied much more widely. Some of the resulting techniques have already been adopted by the Apache Drill project, but there are lots more places that these techniques can be used.

Statistics

Views

Total Views
1,305
Views on SlideShare
1,305
Embed Views
0

Actions

Likes
0
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  • Another major advantage with MapR is the distributed Namenode. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is between 70-100M. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  • This slide needs a lot of work. Can you look at layout changes?
  • The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.

Inside MapR's M7 Inside MapR's M7 Presentation Transcript

  • 1©MapR Technologies - Confidential Inside MapR’s M7 How to get a million ops per second on 10 nodes
  • 2©MapR Technologies - Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Tonight Hash tag - #mapr #fast See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 3©MapR Technologies - Confidential MapR does MapReduce (fast) TeraSort Record 1 TB in 54 seconds 1003 nodes MinuteSort Record 1.5 TB in 59 seconds 2103 nodes
  • 4©MapR Technologies - Confidential MapR: Lights Out Data Center Ready • Automated stateful failover • Automated re-replication • Self-healing from HW and SW failures • Load balancing • Rolling upgrades • No lost jobs or data • 99999’s of uptime Reliable Compute Dependable Storage • Business continuity with snapshots and mirrors • Recover to a point in time • End-to-end check summing • Strong consistency • Built-in compression • Mirror between two sites by RTO policy
  • 5©MapR Technologies - Confidential Part 1: What’s past is prologue
  • 6©MapR Technologies - Confidential Part 1: What’s past is prologue HBase is really good except when it isn’t but it has a heart of gold
  • 7©MapR Technologies - Confidential Part 2: An implementation tour
  • 8©MapR Technologies - Confidential Part 2: An implementation tour with many tricks and clever ploys
  • 9©MapR Technologies - Confidential Part 3: Results
  • 10©MapR Technologies - Confidential
  • 11©MapR Technologies - Confidential Part 1: What’s past is prologue
  • 12©MapR Technologies - Confidential Dynamo DB ZopeDB Shoal CloudKit Vertex DB FlockD B NoSQL
  • 13©MapR Technologies - Confidential HBase Table Architecture  Tables are divided into key ranges (regions)  Regions are served by nodes (RegionServers)  Columns are divided into access groups (columns families) CF1 CF2 CF3 CF4 CF5 R1 R2 R3 R4
  • 14©MapR Technologies - Confidential HBase Architecture is Better  Strong consistency model – when a write returns, all readers will see same value – "eventually consistent" is often "eventually inconsistent"  Scan works – does not broadcast – ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans  Scales automatically – Splits when regions become too large – Uses HDFS to spread data, manage space  Integrated with Hadoop – map-reduce on HBase is straightforward
  • 15©MapR Technologies - Confidential But ... how well do you know HBCK? a.k.a. HBase Recovery  HBase-5843: Improve HBase MTTR – Mean Time To Recover  HBase-6401: HBase may lose edits after a crash with 1.0.3 – uses appends  HBase-3809: .META. may not come back online if ….  etc  about 40-50 Jiras on this topic  Very complex algorithm to assign a region – and still does not get it right on reboot
  • 16©MapR Technologies - Confidential HBase Issues Reliability •Compactions disrupt operations •Very slow crash recovery •Unreliable splitting Business continuity •Common hardware/software issues cause downtime •Administration requires downtime •No point-in-time recovery •Complex backup process Performance •Many bottlenecks result in low throughput •Limited data locality •Limited # of tables Manageability •Compactions, splits and merges must be done manually (in reality) •Basic operations like backup or table rename are complex
  • 17©MapR Technologies - Confidential Examples: Performance Issues  Limited support for multiple column families: HBase has issues handling multiple column family due to compactions. The standard HBase documentation recommends no more than 2-3 column families. (HBASE-3149)  Limited data locality: HBase does not take into account block locations when assigning regions. After a reboot, RegionServers are often reading data over the network rather than the local drives. (HBASE-4755, HBASE-4491)  Cannot utilize disk space: HBase RegionServers struggle with more than 50-150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasting disk space. (http://hbase.apache.org/book/important_configurations.html, http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/)  Limited # of tables: A single cluster can only handle several tens of tables effectively. (http://hbase.apache.org/book/important_configurations.html)
  • 18©MapR Technologies - Confidential Examples: Manageability Issues  Manual major compactions: HBase major compactions are disruptive so production clusters keep them disabled and rely on the administrator to manually trigger compactions. (http://hbase.apache.org/book.html#compaction)  Manual splitting: HBase auto-splitting does not work properly in a busy cluster so users must pre-split a table based on their estimate of data size/growth. (http://chilinglam.blogspot.com/2011/12/my-experience- with-hbase-dynamic.html)  Manual merging: HBase does not automatically merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.  Basic administration is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-643)
  • 19©MapR Technologies - Confidential Examples: Reliability Issues  Compactions disrupt HBase operations: I/O bursts overwhelm nodes (http://hbase.apache.org/book.html#compaction)  Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-1111)  Unreliable splitting: Region splitting may cause data to be inconsistent and unavailable. (http://chilinglam.blogspot.com/2011/12/my-experience-with- hbase-dynamic.html)  No client throttling: HBase client can easily overwhelm RegionServers and cause downtime. (HBASE-5161, HBASE-5162)
  • 20©MapR Technologies - Confidential One Issue – Crash Recovery Too Slow  HBASE-1111 superseded by HBASE-5843 which is blocked by HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315 HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590 HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878 HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006 HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772 HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859 HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924 HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721 HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109 HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705
  • 21©MapR Technologies - Confidential What is the source of these problems?
  • 22©MapR Technologies - Confidential RegionServers are problematic  Coordinating 3 separate distributed systems is very hard – HBase, HDFS, ZK – Each of these systems has multiple internal systems – Too many races, too many undefined properties  Distributed transaction framework not available – Too many failures to deal with  Java GC wipes out the RS from time to time – Cannot use -Xmx20g for a RS  Hence all the bugs – HBCK is your "friend"
  • 23©MapR Technologies - Confidential Region Assignment in Apache HBase
  • 24©MapR Technologies - Confidential  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in DRAM)  Directories, Files  Block replica locations  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN HDFS Architecture Review DataNodes save Blocks Files sharded into blocks
  • 25©MapR Technologies - Confidential  NameNode holds in-memory  Dir hierarchy ("names")  File attrs ("inode")  Composite file structure  Array of block-ids  1-byte file in HDFS  1 HDFS "block" on 3 DN's  3 entries in NN totaling 1K DRAM A File at the NameNode Composite File Structure
  • 26©MapR Technologies - Confidential DN reports blocks to NN – 128M blocks – 12T of disk => DN sends 100K blocks/report – RPC on wire is 4M – causes extreme load • at both DN and NN  With NN-HA, DN's do dual block-reports – one to primary, one to secondary – doubles the load on DN NN scalability problems
  • 27©MapR Technologies - Confidential Scaling Parameters  Unit of I/O – 4K/8K (8K in MapR)  Unit of Chunking (a map-reduce split) – 10-100's of megabytes  Unit of Resync (a replica) – 10-100's of gigabytes – container in MapR i/o 10^3 map-red 10^6 resync 10^9 admin HDFS 'block'  Unit of Administration (snap, repl, mirror, quota, backu p) – 1 gigabyte - 1000's of terabytes – volume in MapR – what data is affected by my missing blocks?
  • 28©MapR Technologies - Confidential NameNode E F NameNode E F NameNode E F MapR's No-NameNode Architecture HDFS Federation MapR (distributed metadata) • Multiple single points of failure • Limited to 50-200 million files • Performance bottleneck • Commercial NAS required • HA w/ automatic failover • Instant cluster restart • Up to 1 trillion files • 20x higher performance • 100% commodity hardware NAS appliance NameNode A B NameNode C D NameNode E F DataNode DataNode DataNode DataNode DataNode DataNode A F C D E D B C E B C F B F A B A D E
  • 29©MapR Technologies - Confidential  Each container contains  Directories & files  Data blocks  Replicated on servers  Millions of containers in a typical cluster MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks Containers are 16- 32 GB segments of disk, placed on nodes Patent Pending
  • 30©MapR Technologies - Confidential M7 Containers  Container holds many files – regular, dir, symlink, btree, chunk-map, region-map, … – all random-write capable – each can hold 100's of millions of files  Container is replicated to servers – unit of resynchronization  Region lives entirely inside 1 container – all files + WALs + btree's + bloom-filters + range-maps
  • 31©MapR Technologies - Confidential Read-write Replication  Write are synchronous – All copies have same data  Data is replicated in a "chain" fashion – better bandwidth, utilizes full- duplex network links well  Meta-data is replicated in a "star" manner – response time better, bandwidth not of concern – data can also be done this way 31 client1 client2 clientN
  • 33©MapR Technologies - Confidential HB loss + upstream entity reports failure => server dead Increment epoch at CLDB Rearrange replication Exact same code for files and M7 tables No ZK needed at this level Failure Handling Containers managed at CLDB (HB, container-reports). Container Location DataBase (CLDB)
  • 34©MapR Technologies - Confidential Same 10 nodes, but with 3X repl 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 1000 2000 3000 4000 5000 6000 Filecreates/s Files (M) 0 100 200 400 600 800 1000 MapR distribution Other distribution Benchmark: File creates (100B) Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM 0 50 100 150 200 250 300 350 400 0 0.5 1 1.5 Filecreates/s Files (M) Other distributionMapR Other Advantage Rate (creates/s) 14-16K 335-360 40x Scale (files) 6B 1.3M 4615x
  • 35©MapR Technologies - Confidential Recap  HBase has a good basis – But is handicapped by HDFS – But can’t do without HDFS – HBase can’t be fixed in isolation  Separating key storage scaling parameters is key – Allows additional layer of storage indirection – Results in huge scaling and performance improvement  Low-level transactions is hard – Allows R/W file system, decentralized meta-data – Also allows non-file implementations
  • 36©MapR Technologies - Confidential Part 2: An implementation tour
  • 37©MapR Technologies - Confidential An Outline of Important Factors  Start with MapR FS (mutability, transactions, real snapshots)  C++ not Java (data never moves, better control)  Lockless design, custom queue executive (3 ns switch)  New RPC layer (> 1 M RPC / s)  Cut out the middle man (single hop to data)  Hybridize log-structured merge trees and B-trees  Adjust sizes and fanouts  Don’t be silly
  • 38©MapR Technologies - Confidential An Outline of Important Factors  Start with MapR FS (mutability, transactions, real snapshots)  C++ not Java (data never moves, better control)  Lockless design, custom queue executive (3 ns switch)  New RPC layer (> 1 M RPC / s)  Cut out the middle man (single hop to data)  Hybridize log-structured merge trees and B-trees  Adjust sizes and fanouts  Don’t be silly We get these all for free by putting tables into MapR FS
  • 39©MapR Technologies - Confidential M7: Tables Integrated into Storage No extra daemons to manage One hop to data Superior caching policies No JVM problems
  • 40©MapR Technologies - Confidential Lesson 0: Implement tables in the file system
  • 41©MapR Technologies - Confidential Why Not Java?  Disclaimer: I am a pro-Java bigot  But that only goes so far …  Consider the memory size of struct {x, y}[] a;  Consider also interpreting data as it has arrived from the wire  Consider the problem of writing a micro-stack queue executive with hundreds of thousands of threads and 3 ns context switch  Consider the problem of a core-locked processes running cache aware, lock-free, zero copy queue of tasks  Consider the GC-free life-style
  • 42©MapR Technologies - Confidential At What Cost  But writing performant C++ is hard  Managing low-level threads is hard  Implementing very fast failure recovery is hard  Doing manual memory allocation is hard (and dangerous)  Benefits outweigh costs with the right dev team  Benefits dwarfed by the costs with the wrong dev team
  • 43©MapR Technologies - Confidential Lesson 1: With great speed comes great responsibility
  • 44©MapR Technologies - Confidential M7 Table Architecture table tablet tablet partition segmentsegment parition tablet tablet
  • 45©MapR Technologies - Confidential M7 Table Architecture table tablet tablet partition segmentsegment parition tablet tablet This structure is internal and not user-visible
  • 46©MapR Technologies - Confidential Multi-level Design  Fixed number of levels like HBase  Specialized fanout to match sizes to device physics  Mutable file system allows chimeric LSM-tree / B-tree  Sized to match container structure  Guaranteed locality – If the data moves, the new node will handle it – If the node fails, the new node will handle it
  • 47©MapR Technologies - Confidential Lesson 2: Physics. Not just a good idea. It’s the law.
  • 48©MapR Technologies - Confidential RPC Reimplementation  At very high data rates, protobuf is too slow – Not good as an envelope, still a great schema definition language – Most systems never hit this limit  Alternative 1 – Lazy parsing allows deferral of content parsing – Naïve implementation imposes (yet another) extra copy  Alternative 2 – Bespoke parsing of envelope from the wire – Content packages can land fully aligned and ready for battle directly from the wire  Let’s use BOTH ideas
  • 49©MapR Technologies - Confidential Lesson 3: Hacking and abstraction can co-exist
  • 50©MapR Technologies - Confidential Don’t Be Silly  Detailed review of the code revealed an extra copy – It was subtle. Really.  Performance increased when this was stopped  Not as easy to spot as it sounds – But absolutely still worth finding and fixing
  • 51©MapR Technologies - Confidential Part 3: Results
  • 52©MapR Technologies - Confidential Server Reboot  Full container-reports are tiny – CLDB needs 2G dram for 1000-node cluster  Volumes come online very fast – each volume independent of others – as soon as min-repl # of containers ready – no need to wait for whole cluster (eg, HDFS waits for 99.9% blocks reporting)  1000-node cluster restart < 5 mins
  • 53©MapR Technologies - Confidential M7 provides Instant Recovery  0-40 microWALs per region – idle WALs go to zero quickly, so most are empty – region is up before all microWALs are recovered – recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline – 1000-10000x faster recovery  Why doesn't HBase do this? – M7 leverages unique MapR-FS capabilities, not impacted by HDFS limitations – No limit to # of files on disk – No limit to # open files – I/O path translates random writes to sequential writes on disk
  • 54©MapR Technologies - Confidential Other M7 Features  Smaller disk footprint – M7 never repeats the key or column name  Columnar layout – M7 supports 64 column families – in-memory column-families  Online admin – M7 schema changes on the fly – delete/rename/redistribute tables
  • 55©MapR Technologies - Confidential Binary Compatible  HBase applications work "as is" with M7 – No need to recompile (binary compatible)  Can run M7 and HBase side-by-side on the same cluster – eg, during a migration – can access both M7 table and HBase table in same program  Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-versa, viz., % hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=/user/srivas/mytable oldtable
  • 56©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 57©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 58©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 59©MapR Technologies - Confidential Recap  HBase has some excellent core ideas – But is burdened by years of technical debt – Much of the debt was charged on the HDFS credit cards  MapR FS provides ideal substrate for HBase-like service – One hop from client to data – Many problems never even exist in the first place – Other problems have relatively simple solutions with better foundation  Practical results bear out the theory
  • 60©MapR Technologies - Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance and HA Adds industry standard API’s  Tonight Hash tag - #nosqlnow #mapr #fast See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 61©MapR Technologies - Confidential