Challenges & Capabilites in Managing a MapR Cluster by David Tucker

  • 656 views
Uploaded on

"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are …

"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are less robust or that have problems more often? Are you running Hadoop on bare metal, or in a cloud environment, and is one easier than the other?"

MapR Senior Solutions Architect David Tucker speaks about the challenges and capabilites in managing a cluster. This talk was given at the SF Bay Area Large Scale Production Engineering Meetup (Sept 19, 2013).

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
656
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
20
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • We all know about hadoop .. so no need to get specific there
  • We all know about hadoop .. so no need to get specific there
  • Another area of ease of use is with the MapR Control system and Heatmap. This simplifies health monitoring, cluster administration and application provisioning at scale. Each small rectangle in the UI represents a separate node. You can select a wide variety of elements to monitor include custom services. MapR also includes alerts and alarms so administrators are not required to constantly monitor. There are also filters and group operations to simplify actions.
  • With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self–healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.
  • The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  • We all know about hadoop .. so no need to get specific there
  • MapR also uniquely provides full Snapshots. No other Hadoop distribution provides this capability. They provide replication that provides additional copies to protect against data loss but it does nothing to protect against application or user errors that are replicated across a cluster. With MapR you have a snapshot and point in time recovery. A user or administrator can simply open up the snapshot directory and recovery a full directory or individual file. The snapshots are provided on a redirect on write method which provides this protection without duplicating the data. In other words you can snapshot a 1 petabyte cluster in seconds with no additional data storage.
  • MapR is also the only distribution for Apache Hadoop that provides wide area replication and mirroring allowing you to provide full business continuity. MapR’s Hadoop distribution allows you to automatically and transparently mirror your data to another cluster. The system performs incremental synchronization of clusters on the changed data. That means there is very low overhead and higher performance. With MapR, you can also easily deploy a research cluster alongside a production cluster so that researchers, developers and analysts can experiment without impacting the production cluster. You can mirror between two clusters which are geographically separated for disaster recovery and implement your Recovery Time Objectives to assure business continuity. MapR’s mirroring also supports bulk data transfer to other clusters. Hadoop users today do not have a way to interoperate between private and public clouds. You can use MapR’s mirroring to synchronize data between a research cluster and your production cluster, or between a private and public cloud.
  • Snowden story : he got docs because he was administering a file server with classified information
  • We all know about hadoop .. so no need to get specific there
  • The MapR Control System also provides advanced job management capabilities, enabling an administrator to have complete visibility and control over the operation of the cluster, jobs and tasks. Unique capabilities of MapR Control System: AutomatedComprehensive – hw and software (Cloudera has no visibility into hardware faults)Full Visibility and controlSupports lights out operation

Transcript

  • 1. 1©MapR Technologies - Do Not Redistribute Challenges and Capabilities in Managing a MapR Cluster David Tucker Senior Solution Architect MapR Technologies
  • 2. 2©MapR Technologies - Do Not Redistribute Overview Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality
  • 3. 3©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality
  • 4. 4©MapR Technologies - Do Not Redistribute Easy Management at Scale  Health Monitoring  Cluster Administration  Application Resource Provisioning
  • 5. 5©MapR Technologies - Do Not Redistribute High Availability and Dependability Reliable Compute Dependable Storage  Automated stateful failover  Automated re-replication  Automated recovery from HW and SW failures  Load balancing of critical services  Rolling upgrades  No lost jobs or data  99999’s of uptime • Business continuity with snapshots and mirrors • Point-in-time recovery • End-to-end check-summing • Strong consistency • Data safe • Multi-site mirroring to meet Recovery Time Objectives
  • 6. 6©MapR Technologies - Do Not Redistribute NameNode NAS APPLIANCE DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode No NameNode Architecture Other Distributions (HDFS Federation) MapR  Multiple single points of failure  Limited to 50M files per NameNode  Performance bottleneck  Commercial NAS required  Metadata must fit in memory  HA w/ automatic failover and re-replication  Up to 1T files (> 5000x advantage)  Higher performance  100% commodity hardware  Metadata is persisted to disk NameNode A B NameNode C D NameNode E F A F C D E D B C E B C F B F A B A D E
  • 7. 7©MapR Technologies - Do Not Redistribute JobTracker HA Other Distributions (MR or YARN) MapR JT JT
  • 8. 8©MapR Technologies - Do Not Redistribute NFS HA (via managed VIPs)
  • 9. 9©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality
  • 10. 10©MapR Technologies - Do Not Redistribute Hadoop / HBASE APPLICATIONS NFS APPLICAITONS Hadoop / HBASE APPLICATIONS NFS APPLICAITONS Data Protection via MapR Snapshots  Snapshots without data duplication  Saves space by sharing blocks  Lightning fast  Zero performance loss on writing to original  Scheduled, or on-demand  Easy recovery by user REDIRECT ON WRITE FOR SNAPSHOT Data Blocks Snapshot 1 Snapshot 2 Snapshot 3 READ / WRITE MapR Storage Services Hadoop / HBASE APPLICATIONS NFS APPLICAITONS A B C C’ D
  • 11. 11©MapR Technologies - Do Not Redistribute Production Business Continuity via MapR Mirroring Business Continuity and Efficiency Efficient design  Differential deltas are updated  Compressed and check-summed Easy to manage  Scheduled or on-demand  WAN, Remote Seeding  Consistent point-in-time WAN Production Research Datacenter 1 Datacenter 1 WAN EC2
  • 12. 12©MapR Technologies - Do Not Redistribute User Authentication and Authorization  PAM interfaces – multiple options for authentication registries  Basic Hadoop authorization – file and directory permissions – job queues  Advanced authorization options  Don’t forget separation of roles !!! – Cluster administration vs data access
  • 13. 13©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality
  • 14. 14©MapR Technologies - Do Not Redistribute Managing Cluster Resources  Isolation – Tasks sandboxed so they don’t impact other tasks or system daemons – System resources protected from runaway jobs – Volume-based data segregation based on users and groups – Volume-based data placement – Label-based job scheduling  Quotas – Storage quotas by volume/user/group – CPU and memory quotas by queue/user/group  Reporting – Detailed reporting on resource usage • ~100 different cluster metrics ! – All reports are available via UI, CLI and REST API
  • 15. 15©MapR Technologies - Do Not Redistribute Advanced Job Management  Job monitoring and management  Job and data placement control  Advanced monitoring, management, isolation and security for Hadoop
  • 16. 16©MapR Technologies - Do Not Redistribute Q & A
  • 17. 17©MapR Technologies - Do Not Redistribute Thank You