Huhadoop - v1.1
Upcoming SlideShare
Loading in...5
×
 

Huhadoop - v1.1

on

  • 515 views

 

Statistics

Views

Total Views
515
Views on SlideShare
514
Embed Views
1

Actions

Likes
5
Downloads
25
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The architecture of MapReduce came with it’s limitations. Scalability Even as specs rise on servers to accommodate more load it still couldn’t scale passed the max concurrent tasks.AvailabilityJT failure kills all queued and running jobsAfter restart they have to be resubmitted and start from the beginningUnable to start from where it left offCan be a huge problem if you have long running batch jobsResource partitioning of resourcesResources were broken up into distinct map and reduce slots which aren’t “fungible”. I love that word. Basically means they weren’t interchangeable.Map slots might be “full” while reduce slots remain empty and vice-versa.This needed to be addressed to to ensure the entire system could be used at max capacity for high utilization.Lacks supportYou were stuck using MapReduce
  • In Hadoop 1.0, all methods of accessing the data within the cluster were constrained to using MapReduce, Open-source Hadoop projects like Pig, Hive are built on top of MapReduce and even though they make MapReduce more accessible, they still suffer with it’s limitations.You have seen some distributions move outside of Hadoop ecosystem, like Cloudera’s Impala, to get around the limitations of MapReduce to improve performance. But then, unfortunately, it isn’t community supported and lags behind in features because it doesn’t have the backing of the innovative open source committers.The crazy thing is even with these limitations, 90% of the use cases Nick spoke about yesterday are based on this.
  • So, what has Trace3 found out on it journey through Big Data about YARN?Well first of all, we discovered that it’s not the type of yarn that cats play with.YARN will be the de-facto distributed operating system for Big Data and by the end of this hour you are going to see why we believe it is and why companies like Cloudera, Hortonworks and MapR are banking on this.
  • YARN is taking Hadoop beyond batchYARN has solved the limitations of MapReduce v1YARN gives you the ability to store all your data in one place and have mixed workloads working with that data and still getting predictable performance and QoS.YARN is moving Hadoop beyond just MapReduce and Batch into Interactive, Online, Streaming, Graph, In-Memory, etc
  • YARN is fairly new to the scene.But that shouldn’t deter you from being confident in it.It was conceived and architected by Yahoo!And has gone through a very quick maturing process due to the open source community putting it through it pacesCurrently YARN is running on over 100,000 nodesResponsible for 400,000+ jobs and 10 million+ hours of compute time daily
  • Yes, I said 10 ‘millllllion’
  • Here are some of the apps that are making up that compute timeHbase will be deployed on YARNWhich we will talk about more a bit laterMaster-Worker applicationsMapReduce has been moved out to it’s own application frameworkReal-Time Streaming AnalyticsThis in my opinion is the most promising of the application types. I don’t want to steal my associate Rikin’s thunder, but he will be speaking a lot more in-depth around Real-Time Streaming Analytics in a session later today. Graph ProcessingYARN has enabled the ability to use iterative applications like Apache Giraph within your cluster where previously MapReduce v1 just wasn’t a viable option.
  • So, what’s changed with YARN for it to be able to accomplish this?YARN splits up the two major functions of the JobTracker into the ResourceManager and ApplicationMasterGlobal ResourceManagerhandles all of the cluster resourcesSchedulerperforms its scheduling function based on the resource requirements of the applicationsPer-node slave NodeManagerResponsible for launching application containersMonitoring their resource usageAnd reporting the same to the ResourceMangerPer-application ApplicationMasterResponsible for negotiating the appropriate resource containers from the schedulerTracking their statusMonitoring for progressPer-application Container running on a NodeManagerLet’s see how these all work together
  • ScaleYARN is no longer limited by40000 concurrent tasks that MapReduce v1hadToday YARN is already handling over 10 million hours of compute time on a daily basisNew Programming Models and ServicesYou aren’t limited to just MapReduceIf your app can benefit from a distributed operating system then you can utilize Improved Cluster UtilizationYARN no longer has a hard partition of resources into map and reduce slots, it utilizes the resource leases aka “Containers” that aren’t limited to in functionality.AgilityBy movingMapReduce out and on top of YARN it gives customers more agility to make changes, upgrade and have different versions of their framework running so they don’t have to affect the entire cluster.Backwards Compatible What you are currently doing with Hadoop 1.x and MapReduce v1 will work with YARN.Mixed workloads on the same data sourceYou can utilize the “data lake” architecture and run all your apps while still having perdictable performance and quality of service.
  • One of the projects that I’m keeping a close on is the Stinger projectSpeed100x speed increase from Hive10SQLImproveHiveSQL to make it more ANSI SQL-likeScaleAbility to run queries on Terabytes to Petabytes of information
  • The Stinger project is tackling the speed portion by utilizing Apache TezTez sits at the layer between MapReduce, Pig and Hive to optimize the execution of the these applications.
  • Tez is doing a lot to optimize execution of tasksFor one, it allows for what they call “Map Reduce Reduce” which avoids unnecessary map tasks by allowing the data to be pipelined between reduce tasks.Also, Tez is capable handling “small datasets” in-memory and avoids unneeded writes to HDFS which is one the biggest causes of latency during map-reduce applications.
  • Another project to watch closely is HBase on YARNDynamic ScalingScales with usage .As load increases, Easier DeploymentHBase cluster deployment can be somewhat complicated, they are looking to correct that be allowing you to do it utilizing builtin API’sAvailabilityWhen a “RegionServer” is lost, to recover it, it’s just deploying another container within the cluster.
  • This is a project that lays outside of my wheel-house but from what I’ve learned about it, it’s going to do amazing thing for Machine Learning.I’ve also highlighted this project to add even more credibility to YARN by showing you a company like Microsoft is dedicating internal time and resources to build applications to run on YARN.
  • Previously a NameNode had one classification of storage media available to it.Now NameNodes as of 2.3 have the ability to split up storage media available to it.Adding awareness of storage media can allow HDFS to make better decisions about the placement of block data with input from applications.An application can choose the distribution of replicas based on its performance and durability requirements.
  • NodeManger Restart allows for a restart of the NM without losing jobs .. They will continue where they left off after restartDynamic Resource Config - Currently containers are static .. They allocate a certain amount of proc / memory to each process. Now processes will have the ability to scale up within a container if resources are available within that NodeManager.
  • Jobs are submitted to the ResouceManager via a public submission protocol and go through an admission control phase during which security credentials are validated and various checks are performed.The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources for various competing applications in the cluster. Because it has a central and global view of the cluster resources, it can enforce properties such as fairness, capacity, and locality across nodes.Accepted jobs are passed to the scheduler to be run. Once the scheduler has enough resources, the application is moved from accepted to running state. This involves allocating a resourceleaseAka as a container (bound JVM) - for the AM and spawning it on a node in the cluster. A record of accepted applications is written to persistent storage and recovered in case of RM failure. The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution and handling faults.By delegating all these functions to AMs, YARN’s architecture gains a great deal of scalability, programming model flexibility, and improved upgrading/testing since multiple versions of the same framework can coexist.The RM interacts with a special system daemon running on each node called the NodeManager (NM). Communications between RM and NMs are heartbeat based for scalability. NMs are responsible for monitorng resource availability, reporting faults, and container lifecycle management (e.g., starting, killing). The RM assembles its global view from these snapshots of NM state.
  • Jobs are submitted to the ResouceManager via a public submission protocol and go through an admission control phase during which security credentials are validated and various checks are performed.The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources for various competing applications in the cluster. Because it has a central and global view of the cluster resources, it can enforce properties such as fairness, capacity, and locality across nodes.Accepted jobs are passed to the scheduler to be run. Once the scheduler has enough resources, the application is moved from accepted to running state. This involves allocating a resourceleaseAka as a container (bound JVM) - for the AM and spawning it on a node in the cluster. A record of accepted applications is written to persistent storage and recovered in case of RM failure. The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution and handling faults.By delegating all these functions to AMs, YARN’s architecture gains a great deal of scalability, programming model flexibility, and improved upgrading/testing since multiple versions of the same framework can coexist.The RM interacts with a special system daemon running on each node called the NodeManager (NM). Communications between RM and NMs are heartbeat based for scalability. NMs are responsible for monitorng resource availability, reporting faults, and container lifecycle management (e.g., starting, killing). The RM assembles its global view from these snapshots of NM state.
  • The Stinger project is tackling the speed portion by utilizing Apache TezTez sits at the layer between MapReduce, Pig and Hive to optimize the execution of the these applications.
  • MapReduce Versionconsisted of 2 daemons / processes.The JobTracker is a master node responsible for managing the cluster resources (map and reduce slots) and job scheduling.The TaskTracker is a per-node agent and manages the map and reduce tasks.
  • Backwards CompatibleWhatever you are doing with Hadoop 1.0 and MapReduce today, will work with YARN.Even though you don’t need all the capabilities of YARN right now, don’t hesitate to move to it and as new tools and applications become available on YARN your company will be able to utilize them.One Source Of DataYARN allows you to have that data lake with all of your data applications running against it.While still maintaining predictable performance and quality of serviceResource ManagementYARN accomplishes this by how it manage resources for better cluster utilization which translates to “more bang for your buck”.Enabling Smart PeopleYARN is an extremely flexible framework that is giving smart people and companies the ability to do amazing things with data.All these benefits add up to “YARN will be the de-facto distributed operating system for Big Data”We see the innovation in Big Data happening on YARN andWe want to help you make the right choice now to avoid the headaches and costs that come along with making the wrong choice.
  • Before we can understand fully what YARN is solving we need to review what it’s replacing.Hadoop 1.0The initial design of Hadoop was focused on running massive MapReduce jobs to process web crawl.Although It did end up evolving outside of initial use case and helped solve the “data silo problem”, it ended up creating a different issue, something called the “data system silo problem”.Users were forced into creating data system silos due to mixed workloads.HBase ExampleDevelopers were forced to abuse the very specific MapReduce programming model to try to accommodate their user cases.One of the biggest cost to a Hadoop cluster is copying data between the clusters to try to accommodate mixed workloads
  • HIveSQL’s SQL datatype and semantics support is already pretty extensive and they plan on adding more to allow nearly full use of SQL concepts within Hive.
  • PMCs are the people that give oversight for the project roadmap and provide guidance to the committers.One thing to highlight may be that Hortonworks is a spin-off of Yahoo!
  • The committers are the ones who actually submit code to the project.One thing to highlight may be that Hortonworks is a spin-off of Yahoo!
  • Question may arise how I can state that YARN will be the de-facto distributed operating system of Big Data. Here are the arguments for my conclusion / prediction.
  • MapReduce Versionconsisted of 2 daemons / processes.The JobTracker is a master node responsible for managing the cluster resources (map and reduce slots) and job scheduling.The TaskTracker is a per-node agent and manages the map and reduce tasks.
  • YARN is still an uncharted territory.Unless you like reading through whitepapers, browsing blog post, analyzing flowcharts, it’s very difficult to understand how YARN can add value to your company. Trace3 has done the work for you and charted the territory.You need to make the right the choice when adopting technologies due to the huge cost and timesink associated with having to migrate once you’ve realized you chose the wrong one.

Huhadoop - v1.1 Huhadoop - v1.1 Presentation Transcript

  • 4/18/2014 Prepared for: Big Data Expedition Roadshow Presented by: “Big Data Joe” Rossi Huhadoop?
  • What Makes Up Hadoop 1.x?
  • Hadoop 1.0 – HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker Secondary NameNode / JobTracker Client 1-1 1-21-3
  • Hadoop 1.0 – HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker Secondary NameNode / JobTracker Client 1-1 1-2 1-3 ReduceMap 2-1 3-2 3-3 4-1 2-3 4-2 2-2 3-1 4-3 ReduceMap
  • MapReduce v1 Limitations Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000 Availability JobTracker failure kills all queued and running jobs Resources Partitioned into Map and Reduce Hard partitioning of Map and Reduce slots led to low resource utilization No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
  • HADOOP 1.0 Single Use System Batch Apps Apache Hadoop 1.0: Single Use System HDFS (redundant, reliable storage) MapReduce (cluster resource management and data processing) Pig Hive
  • What’s New In Hadoop 2.x?
  • YARN Replaces MapReduce Yet Another Resource Negotiator YARN YARN will be the de-facto distributed operating system for Big Data
  • Store DATA in one place YARN: Taking Hadoop Beyond Batch Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively IN Hadoop HDFS2 (redundant, reliable storage) YARN (cluster resource management) BATCH (MapReduce) INTERACTIVE (Tez, Spark) ONLINE (HBase) STREAMING (DataTorrent) GRAPH (Giraph)
  • 2010 2011 2012 2013 2014 Today YARN: Moving Quickly Conceived at Yahoo! Alpha Releases – 2.0 Beta Releases – 2.1 GA Released – 2.2 100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily Version 2.3
  • YARN: Dr. Evil Approved
  • Graph Processing Running all on the same Hadoop cluster to give applications access to all the same source data! YARN: Applications MapReduce v2 Real-Time Streaming Analytics Master-WorkerOnline
  • YARN: What Has Changed? YARN MRv1 RMResourceManager AMApplicationMaster JT JobTracker Scheduler Scheduler NMNodeManager TTTaskTracker Container Map Reduce ResourceManager Scheduler JobTracker Scheduler NodeManager ApplicationMaster TaskTracker Map Reduce NodeManager Container Container TaskTracker Map Reduce
  • Scale New programming models and services Improved cluster utilization Agility Backwards compatible with MapReduce v1 Mixed workloads on the same source of data Enables running apps in memory within the cluster 7 Benefits of YARN
  • The Future of Hadoop Projects and Roadmap
  • Speed Deliver interactive query through 100x performance increases as compared to Hive 10. Stinger: Interactive Query for Hive SQL Support the broadest array of SQL semantics for analytic applications running against Hadoop. Scale The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.
  • Stinger: Speed – Apache Tez HDFS2 (redundant, reliable storage) YARN (cluster resource management) Tez (execution layer) MR Pig Hive
  • Stinger: Speed – Apache Tez
  • Dynamic Scaling On-demand cluster size. Increase and decrease the size with load. HOYA: HBase on YARN Easier Deployment APIs to create, start, stop and delete HBase clusters. Availability Recover from Region Server loss with a new container.
  • Machine Learning Framework well suited for building machine learning jobs. Microsoft REEF Scalable / Fault Tolerant Makes it easy to implement scalable, fault- tolerant runtime environments for a range of computational models. Maintain State Users can build jobs that utilize data from where it’s needed and also maintain state after jobs are done. Retainable Evaluator Execution Framework
  • Heterogeneous Storages in HDFS NameNode Storage NameNode SATA SSD Fusion IO
  • Apache Hadoop 2.4 ResourceManager HA / Auto Failover HDFS Rolling Upgrades Apache Hadoop 2.5 NodeManager Restart w/o disruption Dynamic Resource Configuration Hadoop Roadmap EARLY Q2 2014 MID Q2 2014
  • Questions? No such thing as a stupid question. Huhadoop?
  • Thank You! Huhadoop? Big Data Joe Rossi: http://about.me/bigdatajoe jrossi@trace3.com c. 858.761.2918
  • Supporting Slides Slides with information that may be asked
  • YARN: How It Works ResourceManager NodeManager ApplicationMaster NodeManager NodeManager NodeManager Scheduler Container Container Container Client
  • YARN: Example App Deployment ResourceManager NodeManager HOYA / HBase Master NodeManager NodeManager NodeManager Scheduler Region Server Region Server Region Server HOYA Client
  • Storm Vs. DataTorrent Solution Matrix DataTorrent Apache Storm Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 3 Dynamic Runtime Changes 3 Linear Scalability 3 State Checkpointing 3
  • Apache Spark + Shark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql)
  • Hadoop 2.x – YARN + HDFS NameNode DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager Standby NameNode / ResourceManager ContainerContainer ContainerContainer ContainerContainer ContainerContainer
  • Backwards Compatible YARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away. YARN: Key Take-Aways Resource Management YARN enables Fine Grained Resource Management for better cluster utilization. One Source of Data YARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service. Enabling Smart People YARN is a flexible framework that is giving smart people and companies to do amazing things with data. YARN will be the de-facto distributed operating system for Big Data
  • Storm Vs. DataTorrent - Detailed Solution Matrix DataTorrent Apache Storm Proprietary / Open Source O O Support for Hadoop 1.x 1 1 Support for Hadoop 2.x 1 1 Native YARN 1 3 Dashboard 1 3 Extensible via Modules 1 1 Technical Support 1 1 Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 1 3 Dynamic Runtime Changes 1 3 High Availability 1 2 Prog. Languages Supported Java, Python, etc. Java, Python, etc. Log Analysis 1 3 Site Operations 1 3 MapReduce Diagnostics 1 3 Open Source Operators Library 1 2 Open Source Application Templates 1 3 Complex Computations (DAG) 1 3 Linear Scalability 1 3 Security 1 3 CLI and Macros 1 3 Configuration Based Specification 1 3 State Checkpointing 1 3
  • Users forced to create data system silos for managing mixed workloads Developers forced to abuse very specific MapReduce to fit their use cases The 1st Generation Of Hadoop Hadoop HBase
  • Stinger: HiveQL – SQL Support Hive SQL Datatypes Hive SQL Semantics
  • Apache Spark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql) Spark Streaming MLib (machine learning)
  • Project Mgt Committee Members 0 2 4 6 8 10 12 14 16 Hortonworks Others Cloudera Yahoo! Facebook 7 6 3 15 11
  • Project Committers 0 5 10 15 20 25 30 Hortonworks Others Cloudera Yahoo! Facebook 24 24 11 11 5
  • YARN: Why The De-Facto Distributed OS Technology Adoption 100,000 nodes+ - 400,000 jobs - 10m compute hours daily Enables Innovation Smart people and companies to do amazing things to data Financial Backing 568m+ invested in Hadoop contributing companies, nearly 400m in the 2013 alone
  • Apache Storm Topology Bolt (Filter)Spout Stream (Data Source) Spout Stream (Data Source) Bolt (RDBMS Writes) Bolt (Calculation) Bolt (HDFS Writes) RDBMS HDFS
  • Hadoop 1.0 – MR + HDFS NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker Secondary NameNode / JobTracker ReduceMap ReduceMap ReduceMap ReduceMap
  • Hadoop 1.0 – MapReduce JobTracker TaskTracker ReduceMap TaskTracker ReduceMap TaskTracker ReduceMap TaskTracker ReduceMap
  • YARN: Uncharted Territory You Are Here Technology Value