Case Study of BigData use with MapR M7 in the Enterprise Datacenter


Published on

NoSQL is often used to deal with large scale data needs, but most NoSQL solutions lack strong data integrity features. I will describe how one company is solving their big data problems using MapR M7. MapR's Data Platform (MDP) provides tables as a built in feature of the file system and exposes an HBase compatible API for accessing these tables. The architecture of the system is designed to scale to very large sizes, to avoid long pauses due to compaction and allow complete integration with MapR's snapshots and mirrors.

We will talk about a specific customer use-case where the customer's priorities for Reliability, Ease of Use and Business Continuity made M7 the best choice.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MapR’s innovations have also expanded the use cases that are possible with Hadoop. Not only do we support the full Hadoop API set. MapR provides support for NFS so any file-based application can access the cluster with no changes or rewrites required. MapR provides ODBC support, so any database application or SQL-based tool can access and manipulate data in a MapR cluster. MapR supports real-time streaming access. This greatly expands the applications that are possible with Hadoop moving beyond a batch limitation. Finally, the full HA, DR and data protection capabilities of MapR allow mission critical apps to be deployed safely and allows administrators to meet stringent SLA targets.
  • The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  • (ed. Note: this slide is a great white board slide to summarize M7)The stack on the left is a representation of the HBase architecture found in all other distributions. HBase is deployed on a VM that stores its data in the HDFS layer running on a JVM that in turn stores its data in the Linux file system (ext3) which writes the data to disk. This stack results in a lot of administrative tasks, performance issues, and reliability issues. A lot of the infrastructure within HBase is an attempt to make up for the deficiencies in HDFS. You basically have a database solution that needs to deal with random IO that runs on top of a write-once file system. The middle stack shows how MapR simplified the lower part of the stack with our M5 edition that replaced HDFS and the dependency on the Linux file system with a random read/write storage layer. However, HBase is still a separate infrastructure running on top the storage layer within M5. The region servers are separate and users still experience downtime and delays when recovering from node failures and snapshots.With M7 on the far right, MapR has now unified tables and files into a unified data platform. We’ve eliminated the separate HBase infrastructure. The environment is much simpler to manage by eliminating the various redundant components. We’ve provided a uniform data management layer across files and tables, we’ve provided a consistent data protection layer. Recovery from node failures is in seconds, there is 100% data locality, HBase can read directly from snapshots. Files and tables are in the same namespace, volumes, and directories.
  • Case Study of BigData use with MapR M7 in the Enterprise Datacenter

    1. 1. Case Study of BigData use with MapR M7 in the Enterprise Datacenter Zeljko Dodlek Sales Director DACH +49 (0) 151 120 555 07©MapR Technologies - Confidential 1
    2. 2. Agenda  Ancestry Case Study  MapR Overview  Q&A©MapR Technologies - Confidential 2
    3. 3. Ancestry use Case (page 1) What does Ancestry do? is an online family history service that uses machinelearning and several other statistical techniques to provide servicessuch as ancestry information and DNA sequencing to its users. Business Challenges?10 Billion records in a 4 PB DataStore40.000 Record collections (date of birth/death, census, militarystatus,….)2+ Million subscribers10+ Million registered usersDNA matching added to their offering ©MapR Technologies - Confidential 3
    4. 4. Ancestry use Case (page 2) Why MapR ?HA Requirements for the NameNode & TaskTrackerEasy way to ingest Data into the clusterSafe way for using different Jobs on the same clusterUnified File & Table platformConfiguration3 separate clusters* DNA Matching* Machine Learning* Data Mining ©MapR Technologies - Confidential 4
    5. 5. MapRTech Overview  Enterprise Grade Hadoop Distribution  Innovations in the areas of the DataPlatform, Map&Reduce and HBase  Enabling Customers to depend on our Hadoop Distribution – No Single Points of Failure – Guaranteeing SLA’s – Easy to Install/run/expand  Professional Services – Installation, consulting and training  Support 7 x24©MapR Technologies - Confidential 5
    6. 6. MapR Distribution©MapR Technologies - Confidential 6
    7. 7. MapR’s value addition Distribution made for the enterprise©MapR Technologies - Confidential 7
    8. 8. Expanding Hadoop Use Cases Hadoop APIs for Hadoop Applications ODBC and JDBC for NFS for file-based SQL-based applications applications Mission Real-time critical and SLA Applications dependent Applications Blue = MapR Innovations©MapR Technologies - Confidential 8
    9. 9. No NameNode ArchitectureOther Distributions (HDFS Federation) MapR NAS APPLIANCE A B C D E F NameNode NameNode NameNode NameNode E DataNode DataNode DataNode A F C D E D DataNode DataNode DataNode A B B C E B DataNode DataNode DataNode A D C F B F  Multiple single points of failure  HA w/ automatic failover and re-replication  Limited to 50M files per NameNode  Up to 1T files (> 5000x advantage)  Performance bottleneck  Higher performance  Commercial NAS required  100% commodity hardware  Metadata must fit in memory  Metadata is persisted to disk ©MapR Technologies - Confidential 9
    10. 10. Simplifying HBase Architecture HBase JVM DFS HBase JVM JVM ext3 MapR Unified Disks Disks Disks Other Distributions©MapR Technologies - Confidential 10
    11. 11. Selected MapR Customers  Global threat analytics Intrusion detection & prevention  Recommendation Engine  Virus analysis Forensic analysis  Family tree connectionsMajor Credit Card Company  Clickstream Analysis  Log analysis  Quality profiling/field Recommendation Engine  HBase failure analysis Fraud detection and Prevention  Fraud  Customer Detection Sentiment  Channel  Advertising exchange  Network Analytics analytics analysis and optimization  Customer Revenue Analytics  Customer targeting  Monitors and measures  ETL Offload  Social media analysis behavior of online shoppers ©MapR Technologies - Confidential 11
    12. 12. Thank You©MapR Technologies - Confidential 12